huggingface · polinaeterna · Aug 22, 2022 · Jun 20, 2022 · Jun 20, 2022 · Jun 21, 2022
diff --git a/datasets/audiofolder/README.md b/datasets/audiofolder/README.md
@@ -0,0 +1,4 @@
+
+### Contributions
+
+Thanks to [@polinaeterna](https://github.com/polinaeterna), [@nateraw](https://github.com/nateraw), [@lhoestq](https://github.com/lhoestq) and [@mariosasko](https://github.com/mariosasko) for adding this dataset.
diff --git a/datasets/audiofolder/dummy/0.0.0/dummy_data.zip b/datasets/audiofolder/dummy/0.0.0/dummy_data.zip
diff --git a/docs/source/audio_load.mdx b/docs/source/audio_load.mdx
@@ -55,3 +55,126 @@ If you only want to load the underlying path to the audio dataset without decodi
  'transcription': 'I would like to set up a joint account with my partner'}
 ```
 
+## AudioFolder
+
+You can also load a dataset with a `AudioFolder` dataset builder. It does not require writing a custom dataloader, making it useful for quickly loading audio data. Your dataset structure might look like:
+
+```
+folder/train/first_audio_file.wav
+folder/train/second_audio_file.wav
+folder/train/third_audio_file.wav
+
+folder/train/last_audio_file.wav
+```
+
+Load your audio dataset by specifying `audiofolder` and the directory containing your data in `data_dir`:
+
+```py
+>>> from datasets import load_dataset
+
+>>> dataset = load_dataset("audiofolder", data_dir="/path/to/folder")
+>>> dataset["train"][0]
+{'audio':
+    {'path': '/path/to/extracted/audio/first_audio_file.wav',
+    'array': array([ 0.00088501,  0.0012207 ,  0.00131226, ..., -0.00045776, -0.00054932, -0.00054932], dtype=float32),
+    'sampling_rate': 16000}
+}
+```
+
+Load remote datasets from their URLs with the `data_files` parameter:
+
+```py
+>>> dataset = load_dataset("adiofolder", data_files="https://s3.amazonaws.com/datasets.huggingface.co/SpeechCommands/v0.01/v0.01_test.tar.gz")
+```
+
+## AudioFolder with metadata
+
+Metadata associated with your dataset can also be loaded. Make sure your dataset has a `metadata.jsonl` file:
+
+```
+folder/train/metadata.jsonl
+folder/train/first_audio_file.mp3
+folder/train/second_audio_file.mp3
+folder/train/third_audio_file.mp3
+```
+
+Your `metadata.jsonl` file must have a `file_name` column which links audio files with their metadata:
+
+```jsonl
+{"file_name": "first_audio_file.mp3", "additional_feature": "This is a first value of a text feature you linked with your audio files"}
+{"file_name": "second_audio_file.mp3", "additional_feature": "This is a second value of a text feature you linked with your audio files"}
+{"file_name": "third_audio_file.mp3", "additional_feature": "This is a third value of a text feature you linked with your audio files"}
+```
+
+### Automatic Speech Recognition
+
+ASR datasets contain text transcriptions of recorded audio files. An example `metadata.jsonl` might look like:
+
+```
+{"file_name": "11295_11059_000000.flac", "transcription": "znowu się duch z ciałem zrośnie w młodocianej wstaniesz wiosnie i możesz skutkiem tych leków umierać wstawać wiek wieków dalej tam były przestrogi jak siekać głowę jak nogi"}
+{"file_name": "11295_11059_000001.flac", "transcription": "już u źwierzyńca podwojów król zasiada przy nim książęta i panowie rada a gdzie wzniosły krążył ganek rycerze obok kochanek król skinął palcem zaczęto igrzysko"}
+{"file_name": "11295_11059_000002.flac", "transcription": "pewnie kędyś w obłędzie ubite minęły szlaki zaczekajmy dzień jaki poślemy szukać wszędzie dziś jutro pewnie będzie posłali wszędzie sługi czekali dzień i drugi gdy nic nie doczekali z płaczem chcą jechać dali"}
+{"file_name": "11295_11059_000003.flac", "transcription": "na miejscach które dziś piaskiem zaniosło gdzie car i trzcina zarasta po których teraz wasze biega wiosło stał okrąg pięknego miasta"}
+```
+
+Load the dataset with `AudioFolder`, and it will create a `transcription` column with text transcriptions:
+
+```py
+>>> dataset = load_dataset("audiofolder", data_dir="/path/to/folder", split="train")
+>>> dataset[0]["transcription"]
+"znowu się duch z ciałem zrośnie w młodocianej wstaniesz wiosnie i możesz skutkiem tych leków umierać wstawać wiek wieków dalej tam były przestrogi jak siekać głowę jak nogi"
+```
+
+## AudioFolder with labels
+
+If you have an audio classification task, set `drop_labels=False` to infer labels from directories names as defined in [`~datasets.packaged_modules.audiofolder.AudioFolderConfig`].
+Your dataset structure should look like this:
+
+```
+folder/train/label_0/first_audio_label_0.mp3
+folder/train/label_0/second_audio_label_0.mp3
+folder/train/label_1/first_audio_label_1.mp3
+
+folder/train/label_9/last_audio_label_99.mp3
+```
+
+`AudioFolder` will creates a `label` column of a [`~datasets.features.ClassLabel`] type based on the directory name.
+
+### Language identification
+
+Language identification datasets have audio recordings of speech in multiple languages:
+
+```
+folder/train/ar/0197_720_0207_190.wav
+folder/train/ar/0179_830_0185_540.mp3
+folder/train/ar/0179_830_0185_540.mp3
+
+folder/train/zh/0442_690_0454_380.mp3
+```
+
+Load the dataset with `drop_labels=False`, and `AudioFolder`  will create a `label` column with the language id based on the directory name:
+
+```
+>>> dataset = load_dataset("audiofolder", data_dir="/path/to/folder", drop_labels=False)
+>>> dataset["train"][0]
+{'audio':
+    {'path': '/path/to/extracted/audio/0197_720_0207_190.mp3',
+    'array': array([-3.6621094e-04, -6.1035156e-05,  6.1035156e-05, ..., -5.1879883e-04, -1.0070801e-03, -7.6293945e-04],
+    'sampling_rate': 16000}
+ 'label': 0  # "ar"
+}
+
+>>> dataset["train"][-1]
+{'audio':
+    {'path': '/path/to/extracted/audio/0442_690_0454_380.mp3',
+    'array': array([1.8920898e-03, 9.4604492e-04, 1.9226074e-03, ..., 9.1552734e-05, 1.8310547e-04, 6.1035156e-05],
+    'sampling_rate': 16000}
+ 'label': 99  # "zh"
+}
+```
+
+<Tip>
+
+Alternatively, you can add `label` column to your `metadata.jsonl` file.
+
+</Tip>
diff --git a/docs/source/package_reference/loading_methods.mdx b/docs/source/package_reference/loading_methods.mdx
@@ -68,3 +68,7 @@ load_dataset("csv", data_dir="path/to/data/dir", sep="\t")
 ### Images
 
 [[autodoc]] datasets.packaged_modules.imagefolder.ImageFolderConfig
+
+### Audio
+
+[[autodoc]] datasets.packaged_modules.audiofolder.AudioFolderConfig
diff --git a/src/datasets/packaged_modules/__init__.py b/src/datasets/packaged_modules/__init__.py
@@ -3,6 +3,7 @@
 from hashlib import sha256
 from typing import List
 
+from .audiofolder import audiofolder
 from .csv import csv
 from .imagefolder import imagefolder
 from .json import json
@@ -32,6 +33,7 @@ def _hash_python_lines(lines: List[str]) -> str:
     "parquet": (parquet.__name__, _hash_python_lines(inspect.getsource(parquet).splitlines())),
     "text": (text.__name__, _hash_python_lines(inspect.getsource(text).splitlines())),
     "imagefolder": (imagefolder.__name__, _hash_python_lines(inspect.getsource(imagefolder).splitlines())),
+    "audiofolder": (audiofolder.__name__, _hash_python_lines(inspect.getsource(audiofolder).splitlines())),
 }
 
 _EXTENSION_TO_MODULE = {
@@ -42,7 +44,8 @@ def _hash_python_lines(lines: List[str]) -> str:
     "parquet": ("parquet", {}),
     "txt": ("text", {}),
 }
-_EXTENSION_TO_MODULE.update({ext[1:]: ("imagefolder", {}) for ext in imagefolder.ImageFolder.IMAGE_EXTENSIONS})
-_EXTENSION_TO_MODULE.update({ext[1:].upper(): ("imagefolder", {}) for ext in imagefolder.ImageFolder.IMAGE_EXTENSIONS})
-
-_MODULE_SUPPORTS_METADATA = {"imagefolder"}
+_EXTENSION_TO_MODULE.update({ext[1:]: ("imagefolder", {}) for ext in imagefolder.ImageFolder.EXTENSIONS})
+_EXTENSION_TO_MODULE.update({ext[1:].upper(): ("imagefolder", {}) for ext in imagefolder.ImageFolder.EXTENSIONS})
+_EXTENSION_TO_MODULE.update({ext[1:]: ("audiofolder", {}) for ext in audiofolder.AudioFolder.EXTENSIONS})
+_EXTENSION_TO_MODULE.update({ext[1:].upper(): ("audiofolder", {}) for ext in audiofolder.AudioFolder.EXTENSIONS})
+_MODULE_SUPPORTS_METADATA = {"imagefolder", "audiofolder"}
diff --git a/src/datasets/packaged_modules/audiofolder/__init__.py b/src/datasets/packaged_modules/audiofolder/__init__.py
diff --git a/src/datasets/packaged_modules/audiofolder/audiofolder.py b/src/datasets/packaged_modules/audiofolder/audiofolder.py
@@ -0,0 +1,80 @@
+from dataclasses import dataclass
+from typing import ClassVar, List
+
+import datasets
+
+from ..base import autofolder
+
+
+logger = datasets.utils.logging.get_logger(__name__)
+
+
+@dataclass
+class AudioFolderConfig(autofolder.AutoFolderConfig):
+    """Builder Config for AudioFolder."""
+
+    _base_feature: ClassVar = datasets.Audio()
+    drop_labels: bool = True  # usually we don't need labels as classification is not the main audio task
+    drop_metadata: bool = None
+
+
+class AudioFolder(autofolder.AutoFolder):
+    BUILDER_CONFIG_CLASS = AudioFolderConfig
+    EXTENSIONS: List[str] = []  # definition at the bottom of the script
+
+    def _info(self):
+        return datasets.DatasetInfo(features=self.config.features)
+
+    def _split_generators(self, dl_manager):
+        # _prepare_split_generators() sets self.info.features,
+        # infers labels, finds metadata files if needed and returns splits
+        return self._prepare_split_generators(dl_manager)
+
+    def _generate_examples(self, files, metadata_files, split_name, add_metadata, add_labels):
+        generator = self._prepare_generate_examples(files, metadata_files, split_name, add_metadata, add_labels)
+        for _, example in generator:
+            yield _, example
+
+
+# Obtained with:
+# ```
+# import soundfile as sf
+#
+# AUDIO_EXTENSIONS = [f".{format.lower()}" for format in sf.available_formats().keys()]
+#
+# # .mp3 is currently decoded via `torchaudio`, .opus decoding is supported if version of `libsndfile` >= 1.0.30:
+# AUDIO_EXTENSIONS.extend([".mp3", ".opus"])
+# ```
+# We intentionally do not run this code on launch because:
+# (1) Soundfile is an optional dependency, so importing it in global namespace is not allowed
+# (2) To ensure the list of supported extensions is deterministic
+AUDIO_EXTENSIONS = [
+    ".aiff",
+    ".au",
+    ".avr",
+    ".caf",
+    ".flac",
+    ".htk",
+    ".svx",
+    ".mat4",
+    ".mat5",
+    ".mpc2k",
+    ".ogg",
+    ".paf",
+    ".pvf",
+    ".raw",
+    ".rf64",
+    ".sd2",
+    ".sds",
+    ".ircam",
+    ".voc",
+    ".w64",
+    ".wav",
+    ".nist",
+    ".wavex",
+    ".wve",
+    ".xi",
+    ".mp3",
+    ".opus",
+]
+AudioFolder.EXTENSIONS = AUDIO_EXTENSIONS
diff --git a/src/datasets/packaged_modules/base/__init__.py b/src/datasets/packaged_modules/base/__init__.py