New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add AudioFolder packaged loader #4530
Changes from 65 commits
cc87b0e
0adcd56
e46eecb
7cc4ab9
a648530
5cbbad1
53e9ce3
60760ea
e4bb688
d0b2592
15ca3cf
420dd2b
68b7f5a
7c75e81
eecc449
a081364
93c6afa
f5d9841
91afd92
cce1ebf
3c6c56a
6840eab
ffa6d14
aa2f246
d27266c
24a65fd
f9ee90d
c905c1b
f0ddbef
6c7a1f9
9f9551c
96189c2
66d3877
ba9d059
1346488
a0b4093
86fbb99
d1e4a64
b9eace0
6a841df
eabece2
d250bfd
edd4803
aab4746
997a01b
56d35aa
42627ba
76c319f
dce047e
74474fd
91c130b
0c33f73
7a8e384
75ac1f4
3ab6136
0b60893
bc1fb3d
b4c8a2d
676e6f3
724782e
bfecab4
90dc043
292a8c5
3e32181
fe80766
f74922c
227ce04
034b88c
54c6cf2
7f6719b
748576b
615a839
02f8f57
fc41118
9ee04ed
accb8cd
adccfd8
6a79a5f
189e98b
89e298c
fbef2b0
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
|
||
### Contributions | ||
|
||
Thanks to [@polinaeterna](https://github.com/polinaeterna), [@nateraw](https://github.com/nateraw), [@lhoestq](https://github.com/lhoestq) and [@mariosasko](https://github.com/mariosasko) for adding this dataset. |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -55,3 +55,126 @@ If you only want to load the underlying path to the audio dataset without decodi | |
'transcription': 'I would like to set up a joint account with my partner'} | ||
``` | ||
|
||
## AudioFolder | ||
|
||
You can also load a dataset with an `AudioFolder` dataset builder. It does not require writing a custom dataloader, making it useful for quickly loading audio data. Your dataset structure might look like: | ||
|
||
``` | ||
folder/train/first_audio_file.wav | ||
folder/train/second_audio_file.wav | ||
folder/train/third_audio_file.wav | ||
|
||
folder/train/last_audio_file.wav | ||
``` | ||
|
||
Load your audio dataset by specifying `audiofolder` and the directory containing your data in `data_dir`: | ||
|
||
```py | ||
>>> from datasets import load_dataset | ||
|
||
>>> dataset = load_dataset("audiofolder", data_dir="/path/to/folder") | ||
>>> dataset["train"][0] | ||
{'audio': | ||
{'path': '/path/to/extracted/audio/first_audio_file.wav', | ||
'array': array([ 0.00088501, 0.0012207 , 0.00131226, ..., -0.00045776, -0.00054932, -0.00054932], dtype=float32), | ||
'sampling_rate': 16000} | ||
} | ||
``` | ||
|
||
Load remote datasets from their URLs with the `data_files` parameter: | ||
|
||
```py | ||
>>> dataset = load_dataset("audiofolder", data_files="https://s3.amazonaws.com/datasets.huggingface.co/SpeechCommands/v0.01/v0.01_test.tar.gz") | ||
``` | ||
|
||
## AudioFolder with metadata | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think I would put this section first, since this is the main use case anyway. |
||
|
||
Metadata associated with your dataset can also be loaded. Make sure your dataset has a `metadata.jsonl` file: | ||
|
||
``` | ||
folder/train/metadata.jsonl | ||
folder/train/first_audio_file.mp3 | ||
folder/train/second_audio_file.mp3 | ||
folder/train/third_audio_file.mp3 | ||
``` | ||
|
||
Your `metadata.jsonl` file must have a `file_name` column which links audio files with their metadata: | ||
|
||
```jsonl | ||
{"file_name": "first_audio_file.mp3", "additional_feature": "This is a first value of a text feature you linked with your audio files"} | ||
{"file_name": "second_audio_file.mp3", "additional_feature": "This is a second value of a text feature you linked with your audio files"} | ||
{"file_name": "third_audio_file.mp3", "additional_feature": "This is a third value of a text feature you linked with your audio files"} | ||
``` | ||
|
||
### Automatic Speech Recognition | ||
|
||
ASR datasets contain text transcriptions of recorded audio files. An example `metadata.jsonl` might look like: | ||
|
||
``` | ||
{"file_name": "11295_11059_000000.flac", "transcription": "znowu się duch z ciałem zrośnie w młodocianej wstaniesz wiosnie i możesz skutkiem tych leków umierać wstawać wiek wieków dalej tam były przestrogi jak siekać głowę jak nogi"} | ||
{"file_name": "11295_11059_000001.flac", "transcription": "już u źwierzyńca podwojów król zasiada przy nim książęta i panowie rada a gdzie wzniosły krążył ganek rycerze obok kochanek król skinął palcem zaczęto igrzysko"} | ||
{"file_name": "11295_11059_000002.flac", "transcription": "pewnie kędyś w obłędzie ubite minęły szlaki zaczekajmy dzień jaki poślemy szukać wszędzie dziś jutro pewnie będzie posłali wszędzie sługi czekali dzień i drugi gdy nic nie doczekali z płaczem chcą jechać dali"} | ||
{"file_name": "11295_11059_000003.flac", "transcription": "na miejscach które dziś piaskiem zaniosło gdzie car i trzcina zarasta po których teraz wasze biega wiosło stał okrąg pięknego miasta"} | ||
``` | ||
|
||
Load the dataset with `AudioFolder`, and it will create a `transcription` column with text transcriptions: | ||
|
||
```py | ||
>>> dataset = load_dataset("audiofolder", data_dir="/path/to/folder", split="train") | ||
>>> dataset[0]["transcription"] | ||
"znowu się duch z ciałem zrośnie w młodocianej wstaniesz wiosnie i możesz skutkiem tych leków umierać wstawać wiek wieków dalej tam były przestrogi jak siekać głowę jak nogi" | ||
``` | ||
|
||
## AudioFolder with labels | ||
|
||
If you have an audio classification task, set `drop_labels=False` to infer labels from directories names as defined in [`~datasets.packaged_modules.audiofolder.AudioFolderConfig`]. | ||
Your dataset structure should look like this: | ||
|
||
``` | ||
folder/train/label_0/first_audio_label_0.mp3 | ||
folder/train/label_0/second_audio_label_0.mp3 | ||
folder/train/label_1/first_audio_label_1.mp3 | ||
|
||
folder/train/label_9/last_audio_label_99.mp3 | ||
``` | ||
|
||
`AudioFolder` will create a `label` column of a [`~datasets.features.ClassLabel`] type based on the directory name. | ||
|
||
### Language identification | ||
|
||
Language identification datasets have audio recordings of speech in multiple languages: | ||
|
||
``` | ||
folder/train/ar/0197_720_0207_190.wav | ||
folder/train/ar/0179_830_0185_540.mp3 | ||
folder/train/ar/0179_830_0185_540.mp3 | ||
|
||
folder/train/zh/0442_690_0454_380.mp3 | ||
``` | ||
|
||
Load the dataset with `drop_labels=False`, and `AudioFolder` will create a `label` column with the language id based on the directory name: | ||
|
||
``` | ||
>>> dataset = load_dataset("audiofolder", data_dir="/path/to/folder", drop_labels=False) | ||
>>> dataset["train"][0] | ||
{'audio': | ||
{'path': '/path/to/extracted/audio/0197_720_0207_190.mp3', | ||
'array': array([-3.6621094e-04, -6.1035156e-05, 6.1035156e-05, ..., -5.1879883e-04, -1.0070801e-03, -7.6293945e-04], | ||
'sampling_rate': 16000} | ||
'label': 0 # "ar" | ||
} | ||
|
||
>>> dataset["train"][-1] | ||
{'audio': | ||
{'path': '/path/to/extracted/audio/0442_690_0454_380.mp3', | ||
'array': array([1.8920898e-03, 9.4604492e-04, 1.9226074e-03, ..., 9.1552734e-05, 1.8310547e-04, 6.1035156e-05], | ||
'sampling_rate': 16000} | ||
'label': 99 # "zh" | ||
} | ||
``` | ||
|
||
<Tip> | ||
|
||
Alternatively, you can add `label` column to your `metadata.jsonl` file. | ||
|
||
</Tip> |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,68 @@ | ||
from dataclasses import dataclass | ||
from typing import List | ||
|
||
import datasets | ||
|
||
from ..folder_builder import folder_builder | ||
|
||
|
||
logger = datasets.utils.logging.get_logger(__name__) | ||
|
||
|
||
@dataclass | ||
class AudioFolderConfig(folder_builder.FolderBuilderConfig): | ||
"""Builder Config for AudioFolder.""" | ||
|
||
drop_labels: bool = True # usually we don't need labels as classification is not the main audio task | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just to say that I'm still in favor of setting it to None by default for consistency with imagefolder ^^' There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. well I don't have a strong opinion here anymore :D There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @mariosasko what do you think about that? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's good to be consistent, so I agree with @lhoestq. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i'm ok with that but it makes explaining things in documentation a bit more complicated... There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i've updated the docs, tried to make it simpler. still not sure that this logic with default None value of "drop_labels" is clear but I guess we'd better see what users say. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @lhoestq @mariosasko what do you think about it now? 🤗 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Looks all good to me ! :D Not sure what's happening with the CI though. I just re-launched one job to see if it was caused by a bug in github actions or the windows runners |
||
drop_metadata: bool = None | ||
|
||
|
||
class AudioFolder(folder_builder.FolderBuilder): | ||
BASE_FEATURE = datasets.Audio() | ||
BASE_COLUMN_NAME = "audio" | ||
BUILDER_CONFIG_CLASS = AudioFolderConfig | ||
EXTENSIONS: List[str] # definition at the bottom of the script | ||
|
||
|
||
# Obtained with: | ||
# ``` | ||
# import soundfile as sf | ||
# | ||
# AUDIO_EXTENSIONS = [f".{format.lower()}" for format in sf.available_formats().keys()] | ||
# | ||
# # .mp3 is currently decoded via `torchaudio`, .opus decoding is supported if version of `libsndfile` >= 1.0.30: | ||
# AUDIO_EXTENSIONS.extend([".mp3", ".opus"]) | ||
# ``` | ||
# We intentionally do not run this code on launch because: | ||
# (1) Soundfile is an optional dependency, so importing it in global namespace is not allowed | ||
# (2) To ensure the list of supported extensions is deterministic | ||
AUDIO_EXTENSIONS = [ | ||
".aiff", | ||
".au", | ||
".avr", | ||
".caf", | ||
".flac", | ||
".htk", | ||
".svx", | ||
".mat4", | ||
".mat5", | ||
".mpc2k", | ||
".ogg", | ||
".paf", | ||
".pvf", | ||
".raw", | ||
".rf64", | ||
".sd2", | ||
".sds", | ||
".ircam", | ||
".voc", | ||
".w64", | ||
".wav", | ||
".nist", | ||
".wavex", | ||
".wve", | ||
".xi", | ||
".mp3", | ||
".opus", | ||
] | ||
AudioFolder.EXTENSIONS = AUDIO_EXTENSIONS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can remove this part, since it's already presented later in "AudioFolder with labels"