Add AudioFolder packaged loader (#4530)

* add audiofolder loader (almost identical to imagefolder except for inferring labels is not default) * add instruction on how to obtain list of audio extensions * add a generic loader * patch autofolder for streaming manually * align autofolder with the latest imagefolder implementation * update tests * add test for duplicate label col * add tests for autofolder (+copied from imagefolder) * add missed audio_file fixture * add documentation * remove boilerplate, make base feature builder's class arg instead of a config's one * remove self.config.label_name, use hardcoded 'label' * patch parents that inherit from DatasetBuilder, revert get_imports * rename autofolder -> folder_builder * make base column name an abstract attr of FolderBuilder instead of config's parameter * Update src/datasets/streaming.py Co-authored-by: Mario Šaško <mario@huggingface.co> * rename FolderBuilder -> FolderBasedBuilder * set drop_labels to None by default for AudioFolder * update documentation * check if builder extending for streaming is not in datasets.builder module Co-authored-by: Mario Šaško <mario@huggingface.co> Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
huggingface · Aug 22, 2022 · 6ea46d8 · 6ea46d8 · github-actions · Aug 22, 2022
1 parent 16f6cd7
commit 6ea46d8
Show file tree

Hide file tree

Showing 17 changed files with 1,479 additions and 368 deletions.
diff --git a/datasets/audiofolder/README.md b/datasets/audiofolder/README.md
@@ -0,0 +1,4 @@
+
+### Contributions
+
+Thanks to [@polinaeterna](https://github.com/polinaeterna), [@nateraw](https://github.com/nateraw), [@lhoestq](https://github.com/lhoestq) and [@mariosasko](https://github.com/mariosasko) for adding this dataset.
diff --git a/datasets/audiofolder/dummy/0.0.0/dummy_data.zip b/datasets/audiofolder/dummy/0.0.0/dummy_data.zip
diff --git a/docs/source/audio_load.mdx b/docs/source/audio_load.mdx
@@ -55,3 +55,99 @@ If you only want to load the underlying path to the audio dataset without decodi
  'transcription': 'I would like to set up a joint account with my partner'}
 ```
 
+## AudioFolder
+
+You can also load a dataset with an `AudioFolder` dataset builder. It does not require writing a custom dataloader, making it useful for quickly loading audio data.
+
+## AudioFolder with metadata
+
+To link your audio files with metadata information, make sure your dataset has a `metadata.jsonl` file. Your dataset structure might look like:
+
+```
+folder/train/metadata.jsonl
+folder/train/first_audio_file.mp3
+folder/train/second_audio_file.mp3
+folder/train/third_audio_file.mp3
+```
+
+Your `metadata.jsonl` file must have a `file_name` column which links audio files with their metadata. An example `metadata.jsonl` file might look like:
+
+```python
+{"file_name": "first_audio_file.mp3", "transcription": "znowu się duch z ciałem zrośnie w młodocianej wstaniesz wiosnie i możesz skutkiem tych leków umierać wstawać wiek wieków dalej tam były przestrogi jak siekać głowę jak nogi"}
+{"file_name": "second_audio_file.mp3", "transcription": "już u źwierzyńca podwojów król zasiada przy nim książęta i panowie rada a gdzie wzniosły krążył ganek rycerze obok kochanek król skinął palcem zaczęto igrzysko"}
+{"file_name": "third_audio_file.mp3", "transcription": "pewnie kędyś w obłędzie ubite minęły szlaki zaczekajmy dzień jaki poślemy szukać wszędzie dziś jutro pewnie będzie posłali wszędzie sługi czekali dzień i drugi gdy nic nie doczekali z płaczem chcą jechać dali"}
+```
+
+Load your audio dataset by specifying `audiofolder` and the directory containing your data in `data_dir`:
+
+```py
+>>> from datasets import load_dataset
+
+>>> dataset = load_dataset("audiofolder", data_dir="/path/to/folder")
+```
+
+`AudioFolder` will load audio data and create a `transcription` column containing texts from `metadata.jsonl`:
+
+```py
+>>> dataset["train"][0]
+{'audio':
+    {'path': '/path/to/extracted/audio/first_audio_file.mp3',
+    'array': array([ 0.00088501,  0.0012207 ,  0.00131226, ..., -0.00045776, -0.00054932, -0.00054932], dtype=float32),
+    'sampling_rate': 16000},
+ 'transcription': 'znowu się duch z ciałem zrośnie w młodocianej wstaniesz wiosnie i możesz skutkiem tych leków umierać wstawać wiek wieków dalej tam były przestrogi jak siekać głowę jak nogi'
+}
+```
+
+You can load remote datasets from their URLs with the `data_files` parameter:
+
+```py
+>>> dataset = load_dataset("audiofolder", data_files="https://s3.amazonaws.com/datasets.huggingface.co/SpeechCommands/v0.01/v0.01_test.tar.gz")
+```
+
+## AudioFolder with labels
+
+If your data directory doesn't contain any metadata files, by default `AudioFolder` automatically adds a `label` column of [`~datasets.features.ClassLabel`] type, with labels based on the directory name.
+It might be useful if you have an audio classification task.
+
+### Language identification
+
+Language identification datasets have audio recordings of speech in multiple languages:
+
+```
+folder/train/ar/0197_720_0207_190.wav
+folder/train/ar/0179_830_0185_540.mp3
+folder/train/ar/0179_830_0185_540.mp3
+
+folder/train/zh/0442_690_0454_380.mp3
+```
+
+As there are no metadata files, `AudioFolder` will create a `label` column with the language id based on the directory name:
+
+```
+>>> dataset = load_dataset("audiofolder", data_dir="/path/to/folder", drop_labels=False)
+>>> dataset["train"][0]
+{'audio':
+    {'path': '/path/to/extracted/audio/0197_720_0207_190.mp3',
+    'array': array([-3.6621094e-04, -6.1035156e-05,  6.1035156e-05, ..., -5.1879883e-04, -1.0070801e-03, -7.6293945e-04],
+    'sampling_rate': 16000}
+ 'label': 0  # "ar"
+}
+
+>>> dataset["train"][-1]
+{'audio':
+    {'path': '/path/to/extracted/audio/0442_690_0454_380.mp3',
+    'array': array([1.8920898e-03, 9.4604492e-04, 1.9226074e-03, ..., 9.1552734e-05, 1.8310547e-04, 6.1035156e-05],
+    'sampling_rate': 16000}
+ 'label': 99  # "zh"
+}
+```
+
+If you have metadata files inside your data directory, but you still want to infer labels from directories names, set `drop_labels=False` as defined in [`~datasets.packaged_modules.audiofolder.AudioFolderConfig`].
+
+<Tip>
+
+Alternatively, you can add `label` column to your `metadata.jsonl` file.
+
+</Tip>
+
+If you have no metadata files and want to drop automatically created labels, set `drop_labels=True`. In this case your dataset would contain only an `audio` column.
diff --git a/docs/source/image_load.mdx b/docs/source/image_load.mdx
@@ -47,7 +47,7 @@ If you only want to load the underlying path to the image dataset without decodi
 
 ## ImageFolder
 
-You can also load a dataset with a `ImageFolder` dataset builder. It does not require writing a custom dataloader, making it useful for quickly loading a dataset for certain vision tasks. Your image dataset structure should look like this:
+You can also load a dataset with an `ImageFolder` dataset builder. It does not require writing a custom dataloader, making it useful for quickly loading a dataset for certain vision tasks. Your image dataset structure should look like this:
 
 ```
 folder/train/dog/golden_retriever.png

diff --git a/docs/source/package_reference/loading_methods.mdx b/docs/source/package_reference/loading_methods.mdx
@@ -68,3 +68,7 @@ load_dataset("csv", data_dir="path/to/data/dir", sep="\t")
 ### Images
 
 [[autodoc]] datasets.packaged_modules.imagefolder.ImageFolderConfig
+
+### Audio
+
+[[autodoc]] datasets.packaged_modules.audiofolder.AudioFolderConfig
diff --git a/src/datasets/packaged_modules/__init__.py b/src/datasets/packaged_modules/__init__.py
@@ -3,6 +3,7 @@
 from hashlib import sha256
 from typing import List
 
+from .audiofolder import audiofolder
 from .csv import csv
 from .imagefolder import imagefolder
 from .json import json
@@ -32,6 +33,7 @@ def _hash_python_lines(lines: List[str]) -> str:
     "parquet": (parquet.__name__, _hash_python_lines(inspect.getsource(parquet).splitlines())),
     "text": (text.__name__, _hash_python_lines(inspect.getsource(text).splitlines())),
     "imagefolder": (imagefolder.__name__, _hash_python_lines(inspect.getsource(imagefolder).splitlines())),
+    "audiofolder": (audiofolder.__name__, _hash_python_lines(inspect.getsource(audiofolder).splitlines())),
 }
 
 _EXTENSION_TO_MODULE = {
@@ -42,7 +44,8 @@ def _hash_python_lines(lines: List[str]) -> str:
     "parquet": ("parquet", {}),
     "txt": ("text", {}),
 }
-_EXTENSION_TO_MODULE.update({ext[1:]: ("imagefolder", {}) for ext in imagefolder.ImageFolder.IMAGE_EXTENSIONS})
-_EXTENSION_TO_MODULE.update({ext[1:].upper(): ("imagefolder", {}) for ext in imagefolder.ImageFolder.IMAGE_EXTENSIONS})
-
-_MODULE_SUPPORTS_METADATA = {"imagefolder"}
+_EXTENSION_TO_MODULE.update({ext[1:]: ("imagefolder", {}) for ext in imagefolder.ImageFolder.EXTENSIONS})
+_EXTENSION_TO_MODULE.update({ext[1:].upper(): ("imagefolder", {}) for ext in imagefolder.ImageFolder.EXTENSIONS})
+_EXTENSION_TO_MODULE.update({ext[1:]: ("audiofolder", {}) for ext in audiofolder.AudioFolder.EXTENSIONS})
+_EXTENSION_TO_MODULE.update({ext[1:].upper(): ("audiofolder", {}) for ext in audiofolder.AudioFolder.EXTENSIONS})
+_MODULE_SUPPORTS_METADATA = {"imagefolder", "audiofolder"}
diff --git a/src/datasets/packaged_modules/audiofolder/__init__.py b/src/datasets/packaged_modules/audiofolder/__init__.py
diff --git a/src/datasets/packaged_modules/audiofolder/audiofolder.py b/src/datasets/packaged_modules/audiofolder/audiofolder.py
@@ -0,0 +1,66 @@
+from typing import List
+
+import datasets
+
+from ..folder_based_builder import folder_based_builder
+
+
+logger = datasets.utils.logging.get_logger(__name__)
+
+
+class AudioFolderConfig(folder_based_builder.FolderBasedBuilderConfig):
+    """Builder Config for AudioFolder."""
+
+    drop_labels: bool = None
+    drop_metadata: bool = None
+
+
+class AudioFolder(folder_based_builder.FolderBasedBuilder):
+    BASE_FEATURE = datasets.Audio()
+    BASE_COLUMN_NAME = "audio"
+    BUILDER_CONFIG_CLASS = AudioFolderConfig
+    EXTENSIONS: List[str]  # definition at the bottom of the script
+
+
+# Obtained with:
+# ```
+# import soundfile as sf
+#
+# AUDIO_EXTENSIONS = [f".{format.lower()}" for format in sf.available_formats().keys()]
+#
+# # .mp3 is currently decoded via `torchaudio`, .opus decoding is supported if version of `libsndfile` >= 1.0.30:
+# AUDIO_EXTENSIONS.extend([".mp3", ".opus"])
+# ```
+# We intentionally do not run this code on launch because:
+# (1) Soundfile is an optional dependency, so importing it in global namespace is not allowed
+# (2) To ensure the list of supported extensions is deterministic
+AUDIO_EXTENSIONS = [
+    ".aiff",
+    ".au",
+    ".avr",
+    ".caf",
+    ".flac",
+    ".htk",
+    ".svx",
+    ".mat4",
+    ".mat5",
+    ".mpc2k",
+    ".ogg",
+    ".paf",
+    ".pvf",
+    ".raw",
+    ".rf64",
+    ".sd2",
+    ".sds",
+    ".ircam",
+    ".voc",
+    ".w64",
+    ".wav",
+    ".nist",
+    ".wavex",
+    ".wve",
+    ".xi",
+    ".mp3",
+    ".opus",
+]
+AudioFolder.EXTENSIONS = AUDIO_EXTENSIONS
diff --git a/src/datasets/packaged_modules/folder_based_builder/__init__.py b/src/datasets/packaged_modules/folder_based_builder/__init__.py