Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add AudioFolder packaged loader #4530

Merged
merged 81 commits into from Aug 22, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
81 commits
Select commit Hold shift + click to select a range
cc87b0e
add audiofolder loader (almost identical to imagefolder except for in…
polinaeterna Jun 20, 2022
0adcd56
check codestyle
polinaeterna Jun 20, 2022
e46eecb
add tests
polinaeterna Jun 21, 2022
7cc4ab9
remove unused imports
polinaeterna Jun 21, 2022
a648530
add dummy data
polinaeterna Jun 21, 2022
5cbbad1
add instruction on how to obtain list of audio extensions
polinaeterna Jun 21, 2022
53e9ce3
fix comment
polinaeterna Jun 22, 2022
60760ea
add audiofolder dummy files in tests
polinaeterna Jun 22, 2022
e4bb688
Merge branch 'master' into add-audio-folder-new
polinaeterna Jun 22, 2022
d0b2592
check if two separate files fix test error (i guess not but just in c…
polinaeterna Jun 22, 2022
15ca3cf
remove unused imports
polinaeterna Jun 22, 2022
420dd2b
Revert "check if two separate files fix test error (i guess not but j…
polinaeterna Jun 22, 2022
68b7f5a
add uppercased formats, modify test for zip archive (check that array…
polinaeterna Jun 22, 2022
7c75e81
Merge branch 'huggingface:master' into add-audio-folder-new
polinaeterna Jun 22, 2022
eecc449
add contributors
polinaeterna Jun 23, 2022
a081364
Merge branch 'add-audio-folder-new' of github.com:polinaeterna/datase…
polinaeterna Jun 23, 2022
93c6afa
Merge branch 'huggingface:master' into add-audio-folder-new
polinaeterna Jun 23, 2022
f5d9841
Merge branch 'add-audio-folder-new' of github.com:polinaeterna/datase…
polinaeterna Jun 23, 2022
91afd92
Merge branch 'huggingface:master' into add-audio-folder-new
polinaeterna Jun 29, 2022
cce1ebf
Merge branch 'master' into add-audio-folder-new
polinaeterna Jul 4, 2022
3c6c56a
add a generic loader
polinaeterna Jul 5, 2022
6840eab
change name of get_patterns
polinaeterna Jul 5, 2022
ffa6d14
update autofolder, align imagefolder and audiofolder with it
polinaeterna Jul 5, 2022
aa2f246
align audiofolder
polinaeterna Jul 5, 2022
d27266c
move autofolder
polinaeterna Jul 6, 2022
24a65fd
fix bug with incorrect itaration over archives (incorrect copypaste -_-)
polinaeterna Jul 6, 2022
f9ee90d
get back comment
polinaeterna Jul 6, 2022
c905c1b
patch autofolder for streaming manually
polinaeterna Jul 6, 2022
f0ddbef
Merge branch 'huggingface:main' into add-audio-folder-new
polinaeterna Jul 13, 2022
6c7a1f9
check fro AutoFolder class specifically in patching, not its string n…
polinaeterna Jul 15, 2022
9f9551c
Merge branch 'huggingface:main' into add-audio-folder-new
polinaeterna Jul 15, 2022
96189c2
Merge branch 'add-audio-folder-new' of github.com:polinaeterna/datase…
polinaeterna Jul 15, 2022
66d3877
pass missing use_auth_token for AutoFolder patching
polinaeterna Jul 15, 2022
ba9d059
fix docstrings
polinaeterna Jul 15, 2022
1346488
Merge branch 'main' into add-audio-folder-new
polinaeterna Jul 15, 2022
a0b4093
align autofolder with the latest imagefolder implementation
polinaeterna Aug 2, 2022
86fbb99
Merge branch 'main' into add-audio-folder-new
polinaeterna Aug 2, 2022
d1e4a64
update tests
polinaeterna Aug 3, 2022
b9eace0
add test for duplicate label col
polinaeterna Aug 3, 2022
6a841df
copy test for dir names
polinaeterna Aug 3, 2022
eabece2
add tests for autofolder (+copied from imagefolder)
polinaeterna Aug 5, 2022
d250bfd
Merge branch 'add-audio-folder-new' of github.com:polinaeterna/datase…
polinaeterna Aug 5, 2022
edd4803
Merge branch 'huggingface:main' into add-audio-folder-new
polinaeterna Aug 5, 2022
aab4746
Merge branch 'add-audio-folder-new' of github.com:polinaeterna/datase…
polinaeterna Aug 5, 2022
997a01b
add missed audio_file fixture
polinaeterna Aug 5, 2022
56d35aa
add __name__ to audio/image features to avoid documentation building …
polinaeterna Aug 6, 2022
42627ba
fix CI on windows
polinaeterna Aug 8, 2022
76c319f
check for __name__ attr too when creating base_feature_name
polinaeterna Aug 8, 2022
dce047e
add documentation
polinaeterna Aug 8, 2022
74474fd
make base_feature a private attr to be excluded from docs
polinaeterna Aug 9, 2022
91c130b
fix docs
polinaeterna Aug 9, 2022
0c33f73
fix comment (rename base_feature)
polinaeterna Aug 9, 2022
7a8e384
Merge branch 'huggingface:main' into add-audio-folder-new
polinaeterna Aug 10, 2022
75ac1f4
fix typos (from code review)
polinaeterna Aug 11, 2022
3ab6136
fix typo (from code review)
polinaeterna Aug 11, 2022
0b60893
remove boilerplate, make base feature builder's class arg instead of …
polinaeterna Aug 13, 2022
bc1fb3d
patch relative imports from parent folder too
polinaeterna Aug 13, 2022
b4c8a2d
Merge remote-tracking branch 'upstream/main' into add-audio-folder-new
polinaeterna Aug 15, 2022
676e6f3
Merge remote-tracking branch 'upstream/main' into add-audio-folder-new
polinaeterna Aug 16, 2022
724782e
remove self.config.label_name, use hardcoded 'label'
polinaeterna Aug 16, 2022
bfecab4
patch parents that inherit from DatasetBuilder, revert get_imports
polinaeterna Aug 16, 2022
90dc043
rename autofolder -> folder_builder
polinaeterna Aug 16, 2022
292a8c5
remove autofolder dir
polinaeterna Aug 16, 2022
3e32181
remove axtending for streaming from tests, it should work without man…
polinaeterna Aug 16, 2022
fe80766
make base column name an abstract attr of FolderBuilder instead of co…
polinaeterna Aug 16, 2022
f74922c
Update src/datasets/streaming.py
polinaeterna Aug 17, 2022
227ce04
rename FolderBuilder -> FolderBasedBuilder
polinaeterna Aug 17, 2022
034b88c
set drop_labels to None by default for AudioFolder
polinaeterna Aug 17, 2022
54c6cf2
remove dataclass decorator from audio/image folder configs as they do…
polinaeterna Aug 17, 2022
7f6719b
remove ABC from FolderBasedBuilder as it does nothing
polinaeterna Aug 18, 2022
748576b
update documentation
polinaeterna Aug 18, 2022
615a839
fix docs
polinaeterna Aug 18, 2022
02f8f57
SORRY another small fix in docs
polinaeterna Aug 18, 2022
fc41118
get back abc and dataclasses just because of the magical thinking ¯\_…
polinaeterna Aug 19, 2022
9ee04ed
Revert "get back abc and dataclasses just because of the magical thin…
polinaeterna Aug 22, 2022
accb8cd
Merge remote-tracking branch 'upstream/main' into add-audio-folder-new
polinaeterna Aug 22, 2022
adccfd8
Merge remote-tracking branch 'upstream/main' into add-audio-folder-new
polinaeterna Aug 22, 2022
6a79a5f
check if builder extending for streaming is not in datasets.builder m…
polinaeterna Aug 22, 2022
189e98b
Merge branch 'add-audio-folder-new' of github.com:polinaeterna/datase…
polinaeterna Aug 22, 2022
89e298c
fix linters
polinaeterna Aug 22, 2022
fbef2b0
add comment to the patching thing
polinaeterna Aug 22, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
4 changes: 4 additions & 0 deletions datasets/audiofolder/README.md
@@ -0,0 +1,4 @@

### Contributions

Thanks to [@polinaeterna](https://github.com/polinaeterna), [@nateraw](https://github.com/nateraw), [@lhoestq](https://github.com/lhoestq) and [@mariosasko](https://github.com/mariosasko) for adding this dataset.
Binary file added datasets/audiofolder/dummy/0.0.0/dummy_data.zip
Binary file not shown.
96 changes: 96 additions & 0 deletions docs/source/audio_load.mdx
Expand Up @@ -55,3 +55,99 @@ If you only want to load the underlying path to the audio dataset without decodi
'transcription': 'I would like to set up a joint account with my partner'}
```

## AudioFolder

You can also load a dataset with an `AudioFolder` dataset builder. It does not require writing a custom dataloader, making it useful for quickly loading audio data.

## AudioFolder with metadata
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I would put this section first, since this is the main use case anyway.


To link your audio files with metadata information, make sure your dataset has a `metadata.jsonl` file. Your dataset structure might look like:

```
folder/train/metadata.jsonl
folder/train/first_audio_file.mp3
folder/train/second_audio_file.mp3
folder/train/third_audio_file.mp3
```

Your `metadata.jsonl` file must have a `file_name` column which links audio files with their metadata. An example `metadata.jsonl` file might look like:

```python
{"file_name": "first_audio_file.mp3", "transcription": "znowu się duch z ciałem zrośnie w młodocianej wstaniesz wiosnie i możesz skutkiem tych leków umierać wstawać wiek wieków dalej tam były przestrogi jak siekać głowę jak nogi"}
{"file_name": "second_audio_file.mp3", "transcription": "już u źwierzyńca podwojów król zasiada przy nim książęta i panowie rada a gdzie wzniosły krążył ganek rycerze obok kochanek król skinął palcem zaczęto igrzysko"}
{"file_name": "third_audio_file.mp3", "transcription": "pewnie kędyś w obłędzie ubite minęły szlaki zaczekajmy dzień jaki poślemy szukać wszędzie dziś jutro pewnie będzie posłali wszędzie sługi czekali dzień i drugi gdy nic nie doczekali z płaczem chcą jechać dali"}
```

Load your audio dataset by specifying `audiofolder` and the directory containing your data in `data_dir`:

```py
>>> from datasets import load_dataset

>>> dataset = load_dataset("audiofolder", data_dir="/path/to/folder")
```

`AudioFolder` will load audio data and create a `transcription` column containing texts from `metadata.jsonl`:

```py
>>> dataset["train"][0]
{'audio':
{'path': '/path/to/extracted/audio/first_audio_file.mp3',
'array': array([ 0.00088501, 0.0012207 , 0.00131226, ..., -0.00045776, -0.00054932, -0.00054932], dtype=float32),
'sampling_rate': 16000},
'transcription': 'znowu się duch z ciałem zrośnie w młodocianej wstaniesz wiosnie i możesz skutkiem tych leków umierać wstawać wiek wieków dalej tam były przestrogi jak siekać głowę jak nogi'
}
```

You can load remote datasets from their URLs with the `data_files` parameter:

```py
>>> dataset = load_dataset("audiofolder", data_files="https://s3.amazonaws.com/datasets.huggingface.co/SpeechCommands/v0.01/v0.01_test.tar.gz")
```

## AudioFolder with labels

If your data directory doesn't contain any metadata files, by default `AudioFolder` automatically adds a `label` column of [`~datasets.features.ClassLabel`] type, with labels based on the directory name.
It might be useful if you have an audio classification task.

### Language identification

Language identification datasets have audio recordings of speech in multiple languages:

```
folder/train/ar/0197_720_0207_190.wav
folder/train/ar/0179_830_0185_540.mp3
folder/train/ar/0179_830_0185_540.mp3

folder/train/zh/0442_690_0454_380.mp3
```

As there are no metadata files, `AudioFolder` will create a `label` column with the language id based on the directory name:

```
>>> dataset = load_dataset("audiofolder", data_dir="/path/to/folder", drop_labels=False)
>>> dataset["train"][0]
{'audio':
{'path': '/path/to/extracted/audio/0197_720_0207_190.mp3',
'array': array([-3.6621094e-04, -6.1035156e-05, 6.1035156e-05, ..., -5.1879883e-04, -1.0070801e-03, -7.6293945e-04],
'sampling_rate': 16000}
'label': 0 # "ar"
}

>>> dataset["train"][-1]
{'audio':
{'path': '/path/to/extracted/audio/0442_690_0454_380.mp3',
'array': array([1.8920898e-03, 9.4604492e-04, 1.9226074e-03, ..., 9.1552734e-05, 1.8310547e-04, 6.1035156e-05],
'sampling_rate': 16000}
'label': 99 # "zh"
}
```

If you have metadata files inside your data directory, but you still want to infer labels from directories names, set `drop_labels=False` as defined in [`~datasets.packaged_modules.audiofolder.AudioFolderConfig`].

<Tip>

Alternatively, you can add `label` column to your `metadata.jsonl` file.

</Tip>

If you have no metadata files and want to drop automatically created labels, set `drop_labels=True`. In this case your dataset would contain only an `audio` column.
2 changes: 1 addition & 1 deletion docs/source/image_load.mdx
Expand Up @@ -47,7 +47,7 @@ If you only want to load the underlying path to the image dataset without decodi

## ImageFolder

You can also load a dataset with a `ImageFolder` dataset builder. It does not require writing a custom dataloader, making it useful for quickly loading a dataset for certain vision tasks. Your image dataset structure should look like this:
You can also load a dataset with an `ImageFolder` dataset builder. It does not require writing a custom dataloader, making it useful for quickly loading a dataset for certain vision tasks. Your image dataset structure should look like this:

```
folder/train/dog/golden_retriever.png
Expand Down
4 changes: 4 additions & 0 deletions docs/source/package_reference/loading_methods.mdx
Expand Up @@ -68,3 +68,7 @@ load_dataset("csv", data_dir="path/to/data/dir", sep="\t")
### Images

[[autodoc]] datasets.packaged_modules.imagefolder.ImageFolderConfig

### Audio

[[autodoc]] datasets.packaged_modules.audiofolder.AudioFolderConfig
11 changes: 7 additions & 4 deletions src/datasets/packaged_modules/__init__.py
Expand Up @@ -3,6 +3,7 @@
from hashlib import sha256
from typing import List

from .audiofolder import audiofolder
from .csv import csv
from .imagefolder import imagefolder
from .json import json
Expand Down Expand Up @@ -32,6 +33,7 @@ def _hash_python_lines(lines: List[str]) -> str:
"parquet": (parquet.__name__, _hash_python_lines(inspect.getsource(parquet).splitlines())),
"text": (text.__name__, _hash_python_lines(inspect.getsource(text).splitlines())),
"imagefolder": (imagefolder.__name__, _hash_python_lines(inspect.getsource(imagefolder).splitlines())),
"audiofolder": (audiofolder.__name__, _hash_python_lines(inspect.getsource(audiofolder).splitlines())),
}

_EXTENSION_TO_MODULE = {
Expand All @@ -42,7 +44,8 @@ def _hash_python_lines(lines: List[str]) -> str:
"parquet": ("parquet", {}),
"txt": ("text", {}),
}
_EXTENSION_TO_MODULE.update({ext[1:]: ("imagefolder", {}) for ext in imagefolder.ImageFolder.IMAGE_EXTENSIONS})
_EXTENSION_TO_MODULE.update({ext[1:].upper(): ("imagefolder", {}) for ext in imagefolder.ImageFolder.IMAGE_EXTENSIONS})

_MODULE_SUPPORTS_METADATA = {"imagefolder"}
_EXTENSION_TO_MODULE.update({ext[1:]: ("imagefolder", {}) for ext in imagefolder.ImageFolder.EXTENSIONS})
_EXTENSION_TO_MODULE.update({ext[1:].upper(): ("imagefolder", {}) for ext in imagefolder.ImageFolder.EXTENSIONS})
_EXTENSION_TO_MODULE.update({ext[1:]: ("audiofolder", {}) for ext in audiofolder.AudioFolder.EXTENSIONS})
_EXTENSION_TO_MODULE.update({ext[1:].upper(): ("audiofolder", {}) for ext in audiofolder.AudioFolder.EXTENSIONS})
_MODULE_SUPPORTS_METADATA = {"imagefolder", "audiofolder"}
Empty file.
66 changes: 66 additions & 0 deletions src/datasets/packaged_modules/audiofolder/audiofolder.py
@@ -0,0 +1,66 @@
from typing import List

import datasets

from ..folder_based_builder import folder_based_builder


logger = datasets.utils.logging.get_logger(__name__)


class AudioFolderConfig(folder_based_builder.FolderBasedBuilderConfig):
"""Builder Config for AudioFolder."""

drop_labels: bool = None
drop_metadata: bool = None


class AudioFolder(folder_based_builder.FolderBasedBuilder):
BASE_FEATURE = datasets.Audio()
BASE_COLUMN_NAME = "audio"
BUILDER_CONFIG_CLASS = AudioFolderConfig
EXTENSIONS: List[str] # definition at the bottom of the script


# Obtained with:
# ```
# import soundfile as sf
#
# AUDIO_EXTENSIONS = [f".{format.lower()}" for format in sf.available_formats().keys()]
#
# # .mp3 is currently decoded via `torchaudio`, .opus decoding is supported if version of `libsndfile` >= 1.0.30:
# AUDIO_EXTENSIONS.extend([".mp3", ".opus"])
# ```
# We intentionally do not run this code on launch because:
# (1) Soundfile is an optional dependency, so importing it in global namespace is not allowed
# (2) To ensure the list of supported extensions is deterministic
AUDIO_EXTENSIONS = [
".aiff",
".au",
".avr",
".caf",
".flac",
".htk",
".svx",
".mat4",
".mat5",
".mpc2k",
".ogg",
".paf",
".pvf",
".raw",
".rf64",
".sd2",
".sds",
".ircam",
".voc",
".w64",
".wav",
".nist",
".wavex",
".wve",
".xi",
".mp3",
".opus",
]
AudioFolder.EXTENSIONS = AUDIO_EXTENSIONS
Empty file.