Skip to content

Commit

Permalink
Add AudioFolder packaged loader (#4530)
Browse files Browse the repository at this point in the history
* add audiofolder loader (almost identical to imagefolder except for inferring labels is not default)

* add instruction on how to obtain list of audio extensions

* add a generic loader

* patch autofolder for streaming manually

* align autofolder with the latest imagefolder implementation

* update tests

* add test for duplicate label col

* add tests for autofolder (+copied from imagefolder)

* add missed audio_file fixture

* add documentation

* remove boilerplate, make base feature builder's class arg instead of a config's one

* remove self.config.label_name, use hardcoded 'label'

* patch parents that inherit from DatasetBuilder, revert get_imports

* rename autofolder -> folder_builder

* make base column name an abstract attr of FolderBuilder instead of config's parameter

* Update src/datasets/streaming.py

Co-authored-by: Mario 艩a拧ko <mario@huggingface.co>

* rename FolderBuilder -> FolderBasedBuilder

* set drop_labels to None by default for AudioFolder

* update documentation

* check if builder extending for streaming is not in datasets.builder module

Co-authored-by: Mario 艩a拧ko <mario@huggingface.co>
Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
  • Loading branch information
3 people committed Aug 22, 2022
1 parent 16f6cd7 commit 6ea46d8
Show file tree
Hide file tree
Showing 17 changed files with 1,479 additions and 368 deletions.
4 changes: 4 additions & 0 deletions datasets/audiofolder/README.md
@@ -0,0 +1,4 @@

### Contributions

Thanks to [@polinaeterna](https://github.com/polinaeterna), [@nateraw](https://github.com/nateraw), [@lhoestq](https://github.com/lhoestq) and [@mariosasko](https://github.com/mariosasko) for adding this dataset.
Binary file added datasets/audiofolder/dummy/0.0.0/dummy_data.zip
Binary file not shown.
96 changes: 96 additions & 0 deletions docs/source/audio_load.mdx
Expand Up @@ -55,3 +55,99 @@ If you only want to load the underlying path to the audio dataset without decodi
'transcription': 'I would like to set up a joint account with my partner'}
```

## AudioFolder

You can also load a dataset with an `AudioFolder` dataset builder. It does not require writing a custom dataloader, making it useful for quickly loading audio data.

## AudioFolder with metadata

To link your audio files with metadata information, make sure your dataset has a `metadata.jsonl` file. Your dataset structure might look like:

```
folder/train/metadata.jsonl
folder/train/first_audio_file.mp3
folder/train/second_audio_file.mp3
folder/train/third_audio_file.mp3
```

Your `metadata.jsonl` file must have a `file_name` column which links audio files with their metadata. An example `metadata.jsonl` file might look like:

```python
{"file_name": "first_audio_file.mp3", "transcription": "znowu si臋 duch z cia艂em zro艣nie w m艂odocianej wstaniesz wiosnie i mo偶esz skutkiem tych lek贸w umiera膰 wstawa膰 wiek wiek贸w dalej tam by艂y przestrogi jak sieka膰 g艂ow臋 jak nogi"}
{"file_name": "second_audio_file.mp3", "transcription": "ju偶 u 藕wierzy艅ca podwoj贸w kr贸l zasiada przy nim ksi膮偶臋ta i panowie rada a gdzie wznios艂y kr膮偶y艂 ganek rycerze obok kochanek kr贸l skin膮艂 palcem zacz臋to igrzysko"}
{"file_name": "third_audio_file.mp3", "transcription": "pewnie k臋dy艣 w ob艂臋dzie ubite min臋艂y szlaki zaczekajmy dzie艅 jaki po艣lemy szuka膰 wsz臋dzie dzi艣 jutro pewnie b臋dzie pos艂ali wsz臋dzie s艂ugi czekali dzie艅 i drugi gdy nic nie doczekali z p艂aczem chc膮 jecha膰 dali"}
```

Load your audio dataset by specifying `audiofolder` and the directory containing your data in `data_dir`:

```py
>>> from datasets import load_dataset

>>> dataset = load_dataset("audiofolder", data_dir="/path/to/folder")
```

`AudioFolder` will load audio data and create a `transcription` column containing texts from `metadata.jsonl`:

```py
>>> dataset["train"][0]
{'audio':
{'path': '/path/to/extracted/audio/first_audio_file.mp3',
'array': array([ 0.00088501, 0.0012207 , 0.00131226, ..., -0.00045776, -0.00054932, -0.00054932], dtype=float32),
'sampling_rate': 16000},
'transcription': 'znowu si臋 duch z cia艂em zro艣nie w m艂odocianej wstaniesz wiosnie i mo偶esz skutkiem tych lek贸w umiera膰 wstawa膰 wiek wiek贸w dalej tam by艂y przestrogi jak sieka膰 g艂ow臋 jak nogi'
}
```

You can load remote datasets from their URLs with the `data_files` parameter:

```py
>>> dataset = load_dataset("audiofolder", data_files="https://s3.amazonaws.com/datasets.huggingface.co/SpeechCommands/v0.01/v0.01_test.tar.gz")
```

## AudioFolder with labels

If your data directory doesn't contain any metadata files, by default `AudioFolder` automatically adds a `label` column of [`~datasets.features.ClassLabel`] type, with labels based on the directory name.
It might be useful if you have an audio classification task.

### Language identification

Language identification datasets have audio recordings of speech in multiple languages:

```
folder/train/ar/0197_720_0207_190.wav
folder/train/ar/0179_830_0185_540.mp3
folder/train/ar/0179_830_0185_540.mp3
folder/train/zh/0442_690_0454_380.mp3
```

As there are no metadata files, `AudioFolder` will create a `label` column with the language id based on the directory name:

```
>>> dataset = load_dataset("audiofolder", data_dir="/path/to/folder", drop_labels=False)
>>> dataset["train"][0]
{'audio':
{'path': '/path/to/extracted/audio/0197_720_0207_190.mp3',
'array': array([-3.6621094e-04, -6.1035156e-05, 6.1035156e-05, ..., -5.1879883e-04, -1.0070801e-03, -7.6293945e-04],
'sampling_rate': 16000}
'label': 0 # "ar"
}
>>> dataset["train"][-1]
{'audio':
{'path': '/path/to/extracted/audio/0442_690_0454_380.mp3',
'array': array([1.8920898e-03, 9.4604492e-04, 1.9226074e-03, ..., 9.1552734e-05, 1.8310547e-04, 6.1035156e-05],
'sampling_rate': 16000}
'label': 99 # "zh"
}
```

If you have metadata files inside your data directory, but you still want to infer labels from directories names, set `drop_labels=False` as defined in [`~datasets.packaged_modules.audiofolder.AudioFolderConfig`].

<Tip>

Alternatively, you can add `label` column to your `metadata.jsonl` file.

</Tip>

If you have no metadata files and want to drop automatically created labels, set `drop_labels=True`. In this case your dataset would contain only an `audio` column.
2 changes: 1 addition & 1 deletion docs/source/image_load.mdx
Expand Up @@ -47,7 +47,7 @@ If you only want to load the underlying path to the image dataset without decodi

## ImageFolder

You can also load a dataset with a `ImageFolder` dataset builder. It does not require writing a custom dataloader, making it useful for quickly loading a dataset for certain vision tasks. Your image dataset structure should look like this:
You can also load a dataset with an `ImageFolder` dataset builder. It does not require writing a custom dataloader, making it useful for quickly loading a dataset for certain vision tasks. Your image dataset structure should look like this:

```
folder/train/dog/golden_retriever.png
Expand Down
4 changes: 4 additions & 0 deletions docs/source/package_reference/loading_methods.mdx
Expand Up @@ -68,3 +68,7 @@ load_dataset("csv", data_dir="path/to/data/dir", sep="\t")
### Images

[[autodoc]] datasets.packaged_modules.imagefolder.ImageFolderConfig

### Audio

[[autodoc]] datasets.packaged_modules.audiofolder.AudioFolderConfig
11 changes: 7 additions & 4 deletions src/datasets/packaged_modules/__init__.py
Expand Up @@ -3,6 +3,7 @@
from hashlib import sha256
from typing import List

from .audiofolder import audiofolder
from .csv import csv
from .imagefolder import imagefolder
from .json import json
Expand Down Expand Up @@ -32,6 +33,7 @@ def _hash_python_lines(lines: List[str]) -> str:
"parquet": (parquet.__name__, _hash_python_lines(inspect.getsource(parquet).splitlines())),
"text": (text.__name__, _hash_python_lines(inspect.getsource(text).splitlines())),
"imagefolder": (imagefolder.__name__, _hash_python_lines(inspect.getsource(imagefolder).splitlines())),
"audiofolder": (audiofolder.__name__, _hash_python_lines(inspect.getsource(audiofolder).splitlines())),
}

_EXTENSION_TO_MODULE = {
Expand All @@ -42,7 +44,8 @@ def _hash_python_lines(lines: List[str]) -> str:
"parquet": ("parquet", {}),
"txt": ("text", {}),
}
_EXTENSION_TO_MODULE.update({ext[1:]: ("imagefolder", {}) for ext in imagefolder.ImageFolder.IMAGE_EXTENSIONS})
_EXTENSION_TO_MODULE.update({ext[1:].upper(): ("imagefolder", {}) for ext in imagefolder.ImageFolder.IMAGE_EXTENSIONS})

_MODULE_SUPPORTS_METADATA = {"imagefolder"}
_EXTENSION_TO_MODULE.update({ext[1:]: ("imagefolder", {}) for ext in imagefolder.ImageFolder.EXTENSIONS})
_EXTENSION_TO_MODULE.update({ext[1:].upper(): ("imagefolder", {}) for ext in imagefolder.ImageFolder.EXTENSIONS})
_EXTENSION_TO_MODULE.update({ext[1:]: ("audiofolder", {}) for ext in audiofolder.AudioFolder.EXTENSIONS})
_EXTENSION_TO_MODULE.update({ext[1:].upper(): ("audiofolder", {}) for ext in audiofolder.AudioFolder.EXTENSIONS})
_MODULE_SUPPORTS_METADATA = {"imagefolder", "audiofolder"}
Empty file.
66 changes: 66 additions & 0 deletions src/datasets/packaged_modules/audiofolder/audiofolder.py
@@ -0,0 +1,66 @@
from typing import List

import datasets

from ..folder_based_builder import folder_based_builder


logger = datasets.utils.logging.get_logger(__name__)


class AudioFolderConfig(folder_based_builder.FolderBasedBuilderConfig):
"""Builder Config for AudioFolder."""

drop_labels: bool = None
drop_metadata: bool = None


class AudioFolder(folder_based_builder.FolderBasedBuilder):
BASE_FEATURE = datasets.Audio()
BASE_COLUMN_NAME = "audio"
BUILDER_CONFIG_CLASS = AudioFolderConfig
EXTENSIONS: List[str] # definition at the bottom of the script


# Obtained with:
# ```
# import soundfile as sf
#
# AUDIO_EXTENSIONS = [f".{format.lower()}" for format in sf.available_formats().keys()]
#
# # .mp3 is currently decoded via `torchaudio`, .opus decoding is supported if version of `libsndfile` >= 1.0.30:
# AUDIO_EXTENSIONS.extend([".mp3", ".opus"])
# ```
# We intentionally do not run this code on launch because:
# (1) Soundfile is an optional dependency, so importing it in global namespace is not allowed
# (2) To ensure the list of supported extensions is deterministic
AUDIO_EXTENSIONS = [
".aiff",
".au",
".avr",
".caf",
".flac",
".htk",
".svx",
".mat4",
".mat5",
".mpc2k",
".ogg",
".paf",
".pvf",
".raw",
".rf64",
".sd2",
".sds",
".ircam",
".voc",
".w64",
".wav",
".nist",
".wavex",
".wve",
".xi",
".mp3",
".opus",
]
AudioFolder.EXTENSIONS = AUDIO_EXTENSIONS
Empty file.

1 comment on commit 6ea46d8

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Show benchmarks

PyArrow==6.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.008568 / 0.011353 (-0.002785) 0.004133 / 0.011008 (-0.006875) 0.031041 / 0.038508 (-0.007467) 0.038965 / 0.023109 (0.015856) 0.298009 / 0.275898 (0.022111) 0.373878 / 0.323480 (0.050398) 0.006394 / 0.007986 (-0.001592) 0.005131 / 0.004328 (0.000803) 0.007232 / 0.004250 (0.002982) 0.060589 / 0.037052 (0.023536) 0.320574 / 0.258489 (0.062085) 0.358694 / 0.293841 (0.064853) 0.031759 / 0.128546 (-0.096787) 0.009777 / 0.075646 (-0.065870) 0.268304 / 0.419271 (-0.150968) 0.055693 / 0.043533 (0.012161) 0.299577 / 0.255139 (0.044438) 0.317759 / 0.283200 (0.034559) 0.123204 / 0.141683 (-0.018479) 1.538594 / 1.452155 (0.086440) 1.566491 / 1.492716 (0.073775)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.276269 / 0.018006 (0.258262) 0.513867 / 0.000490 (0.513377) 0.004325 / 0.000200 (0.004125) 0.000089 / 0.000054 (0.000034)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.025969 / 0.037411 (-0.011443) 0.108319 / 0.014526 (0.093793) 0.118701 / 0.176557 (-0.057856) 0.184598 / 0.737135 (-0.552538) 0.127123 / 0.296338 (-0.169215)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.396280 / 0.215209 (0.181071) 3.968533 / 2.077655 (1.890878) 1.768430 / 1.504120 (0.264310) 1.581403 / 1.541195 (0.040209) 1.669893 / 1.468490 (0.201403) 0.426448 / 4.584777 (-4.158329) 3.779929 / 3.745712 (0.034217) 3.567399 / 5.269862 (-1.702462) 1.757791 / 4.565676 (-2.807885) 0.051756 / 0.424275 (-0.372519) 0.011488 / 0.007607 (0.003881) 0.507868 / 0.226044 (0.281824) 5.073895 / 2.268929 (2.804967) 2.240790 / 55.444624 (-53.203834) 1.897565 / 6.876477 (-4.978912) 2.098987 / 2.142072 (-0.043085) 0.546676 / 4.805227 (-4.258552) 0.121978 / 6.500664 (-6.378686) 0.062944 / 0.075469 (-0.012525)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.437529 / 1.841788 (-0.404258) 14.268878 / 8.074308 (6.194570) 25.325708 / 10.191392 (15.134316) 0.831804 / 0.680424 (0.151380) 0.530936 / 0.534201 (-0.003265) 0.386568 / 0.579283 (-0.192715) 0.432729 / 0.434364 (-0.001635) 0.275455 / 0.540337 (-0.264882) 0.271256 / 1.386936 (-1.115680)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.006715 / 0.011353 (-0.004638) 0.004304 / 0.011008 (-0.006704) 0.028648 / 0.038508 (-0.009860) 0.036007 / 0.023109 (0.012898) 0.349896 / 0.275898 (0.073998) 0.433444 / 0.323480 (0.109964) 0.004390 / 0.007986 (-0.003596) 0.003813 / 0.004328 (-0.000516) 0.005052 / 0.004250 (0.000801) 0.053632 / 0.037052 (0.016580) 0.358833 / 0.258489 (0.100344) 0.396284 / 0.293841 (0.102443) 0.030928 / 0.128546 (-0.097618) 0.010090 / 0.075646 (-0.065556) 0.266533 / 0.419271 (-0.152738) 0.059650 / 0.043533 (0.016117) 0.341938 / 0.255139 (0.086799) 0.363815 / 0.283200 (0.080615) 0.116797 / 0.141683 (-0.024886) 1.513282 / 1.452155 (0.061127) 1.515856 / 1.492716 (0.023139)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.280721 / 0.018006 (0.262715) 0.519628 / 0.000490 (0.519138) 0.001184 / 0.000200 (0.000984) 0.000089 / 0.000054 (0.000034)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.025878 / 0.037411 (-0.011534) 0.107856 / 0.014526 (0.093331) 0.116456 / 0.176557 (-0.060101) 0.162067 / 0.737135 (-0.575068) 0.122049 / 0.296338 (-0.174289)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.426752 / 0.215209 (0.211543) 4.255081 / 2.077655 (2.177426) 2.040571 / 1.504120 (0.536451) 1.847850 / 1.541195 (0.306655) 1.927181 / 1.468490 (0.458691) 0.431994 / 4.584777 (-4.152783) 3.782285 / 3.745712 (0.036572) 2.095792 / 5.269862 (-3.174069) 1.275580 / 4.565676 (-3.290096) 0.052149 / 0.424275 (-0.372126) 0.011051 / 0.007607 (0.003444) 0.531085 / 0.226044 (0.305041) 5.291369 / 2.268929 (3.022440) 2.516737 / 55.444624 (-52.927887) 2.175954 / 6.876477 (-4.700523) 2.355976 / 2.142072 (0.213903) 0.552290 / 4.805227 (-4.252937) 0.122648 / 6.500664 (-6.378016) 0.062470 / 0.075469 (-0.012999)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.552523 / 1.841788 (-0.289265) 14.641604 / 8.074308 (6.567296) 25.302524 / 10.191392 (15.111132) 0.959987 / 0.680424 (0.279563) 0.642189 / 0.534201 (0.107988) 0.388598 / 0.579283 (-0.190685) 0.427223 / 0.434364 (-0.007141) 0.265775 / 0.540337 (-0.274563) 0.271770 / 1.386936 (-1.115166)

CML watermark

Please sign in to comment.