Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add AudioFolder packaged loader #4530

Merged
merged 81 commits into from Aug 22, 2022
Merged
Show file tree
Hide file tree
Changes from 53 commits
Commits
Show all changes
81 commits
Select commit Hold shift + click to select a range
cc87b0e
add audiofolder loader (almost identical to imagefolder except for in…
polinaeterna Jun 20, 2022
0adcd56
check codestyle
polinaeterna Jun 20, 2022
e46eecb
add tests
polinaeterna Jun 21, 2022
7cc4ab9
remove unused imports
polinaeterna Jun 21, 2022
a648530
add dummy data
polinaeterna Jun 21, 2022
5cbbad1
add instruction on how to obtain list of audio extensions
polinaeterna Jun 21, 2022
53e9ce3
fix comment
polinaeterna Jun 22, 2022
60760ea
add audiofolder dummy files in tests
polinaeterna Jun 22, 2022
e4bb688
Merge branch 'master' into add-audio-folder-new
polinaeterna Jun 22, 2022
d0b2592
check if two separate files fix test error (i guess not but just in c…
polinaeterna Jun 22, 2022
15ca3cf
remove unused imports
polinaeterna Jun 22, 2022
420dd2b
Revert "check if two separate files fix test error (i guess not but j…
polinaeterna Jun 22, 2022
68b7f5a
add uppercased formats, modify test for zip archive (check that array…
polinaeterna Jun 22, 2022
7c75e81
Merge branch 'huggingface:master' into add-audio-folder-new
polinaeterna Jun 22, 2022
eecc449
add contributors
polinaeterna Jun 23, 2022
a081364
Merge branch 'add-audio-folder-new' of github.com:polinaeterna/datase…
polinaeterna Jun 23, 2022
93c6afa
Merge branch 'huggingface:master' into add-audio-folder-new
polinaeterna Jun 23, 2022
f5d9841
Merge branch 'add-audio-folder-new' of github.com:polinaeterna/datase…
polinaeterna Jun 23, 2022
91afd92
Merge branch 'huggingface:master' into add-audio-folder-new
polinaeterna Jun 29, 2022
cce1ebf
Merge branch 'master' into add-audio-folder-new
polinaeterna Jul 4, 2022
3c6c56a
add a generic loader
polinaeterna Jul 5, 2022
6840eab
change name of get_patterns
polinaeterna Jul 5, 2022
ffa6d14
update autofolder, align imagefolder and audiofolder with it
polinaeterna Jul 5, 2022
aa2f246
align audiofolder
polinaeterna Jul 5, 2022
d27266c
move autofolder
polinaeterna Jul 6, 2022
24a65fd
fix bug with incorrect itaration over archives (incorrect copypaste -_-)
polinaeterna Jul 6, 2022
f9ee90d
get back comment
polinaeterna Jul 6, 2022
c905c1b
patch autofolder for streaming manually
polinaeterna Jul 6, 2022
f0ddbef
Merge branch 'huggingface:main' into add-audio-folder-new
polinaeterna Jul 13, 2022
6c7a1f9
check fro AutoFolder class specifically in patching, not its string n…
polinaeterna Jul 15, 2022
9f9551c
Merge branch 'huggingface:main' into add-audio-folder-new
polinaeterna Jul 15, 2022
96189c2
Merge branch 'add-audio-folder-new' of github.com:polinaeterna/datase…
polinaeterna Jul 15, 2022
66d3877
pass missing use_auth_token for AutoFolder patching
polinaeterna Jul 15, 2022
ba9d059
fix docstrings
polinaeterna Jul 15, 2022
1346488
Merge branch 'main' into add-audio-folder-new
polinaeterna Jul 15, 2022
a0b4093
align autofolder with the latest imagefolder implementation
polinaeterna Aug 2, 2022
86fbb99
Merge branch 'main' into add-audio-folder-new
polinaeterna Aug 2, 2022
d1e4a64
update tests
polinaeterna Aug 3, 2022
b9eace0
add test for duplicate label col
polinaeterna Aug 3, 2022
6a841df
copy test for dir names
polinaeterna Aug 3, 2022
eabece2
add tests for autofolder (+copied from imagefolder)
polinaeterna Aug 5, 2022
d250bfd
Merge branch 'add-audio-folder-new' of github.com:polinaeterna/datase…
polinaeterna Aug 5, 2022
edd4803
Merge branch 'huggingface:main' into add-audio-folder-new
polinaeterna Aug 5, 2022
aab4746
Merge branch 'add-audio-folder-new' of github.com:polinaeterna/datase…
polinaeterna Aug 5, 2022
997a01b
add missed audio_file fixture
polinaeterna Aug 5, 2022
56d35aa
add __name__ to audio/image features to avoid documentation building …
polinaeterna Aug 6, 2022
42627ba
fix CI on windows
polinaeterna Aug 8, 2022
76c319f
check for __name__ attr too when creating base_feature_name
polinaeterna Aug 8, 2022
dce047e
add documentation
polinaeterna Aug 8, 2022
74474fd
make base_feature a private attr to be excluded from docs
polinaeterna Aug 9, 2022
91c130b
fix docs
polinaeterna Aug 9, 2022
0c33f73
fix comment (rename base_feature)
polinaeterna Aug 9, 2022
7a8e384
Merge branch 'huggingface:main' into add-audio-folder-new
polinaeterna Aug 10, 2022
75ac1f4
fix typos (from code review)
polinaeterna Aug 11, 2022
3ab6136
fix typo (from code review)
polinaeterna Aug 11, 2022
0b60893
remove boilerplate, make base feature builder's class arg instead of …
polinaeterna Aug 13, 2022
bc1fb3d
patch relative imports from parent folder too
polinaeterna Aug 13, 2022
b4c8a2d
Merge remote-tracking branch 'upstream/main' into add-audio-folder-new
polinaeterna Aug 15, 2022
676e6f3
Merge remote-tracking branch 'upstream/main' into add-audio-folder-new
polinaeterna Aug 16, 2022
724782e
remove self.config.label_name, use hardcoded 'label'
polinaeterna Aug 16, 2022
bfecab4
patch parents that inherit from DatasetBuilder, revert get_imports
polinaeterna Aug 16, 2022
90dc043
rename autofolder -> folder_builder
polinaeterna Aug 16, 2022
292a8c5
remove autofolder dir
polinaeterna Aug 16, 2022
3e32181
remove axtending for streaming from tests, it should work without man…
polinaeterna Aug 16, 2022
fe80766
make base column name an abstract attr of FolderBuilder instead of co…
polinaeterna Aug 16, 2022
f74922c
Update src/datasets/streaming.py
polinaeterna Aug 17, 2022
227ce04
rename FolderBuilder -> FolderBasedBuilder
polinaeterna Aug 17, 2022
034b88c
set drop_labels to None by default for AudioFolder
polinaeterna Aug 17, 2022
54c6cf2
remove dataclass decorator from audio/image folder configs as they do…
polinaeterna Aug 17, 2022
7f6719b
remove ABC from FolderBasedBuilder as it does nothing
polinaeterna Aug 18, 2022
748576b
update documentation
polinaeterna Aug 18, 2022
615a839
fix docs
polinaeterna Aug 18, 2022
02f8f57
SORRY another small fix in docs
polinaeterna Aug 18, 2022
fc41118
get back abc and dataclasses just because of the magical thinking ¯\_…
polinaeterna Aug 19, 2022
9ee04ed
Revert "get back abc and dataclasses just because of the magical thin…
polinaeterna Aug 22, 2022
accb8cd
Merge remote-tracking branch 'upstream/main' into add-audio-folder-new
polinaeterna Aug 22, 2022
adccfd8
Merge remote-tracking branch 'upstream/main' into add-audio-folder-new
polinaeterna Aug 22, 2022
6a79a5f
check if builder extending for streaming is not in datasets.builder m…
polinaeterna Aug 22, 2022
189e98b
Merge branch 'add-audio-folder-new' of github.com:polinaeterna/datase…
polinaeterna Aug 22, 2022
89e298c
fix linters
polinaeterna Aug 22, 2022
fbef2b0
add comment to the patching thing
polinaeterna Aug 22, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
4 changes: 4 additions & 0 deletions datasets/audiofolder/README.md
@@ -0,0 +1,4 @@

### Contributions

Thanks to [@polinaeterna](https://github.com/polinaeterna), [@nateraw](https://github.com/nateraw), [@lhoestq](https://github.com/lhoestq) and [@mariosasko](https://github.com/mariosasko) for adding this dataset.
Binary file added datasets/audiofolder/dummy/0.0.0/dummy_data.zip
Binary file not shown.
123 changes: 123 additions & 0 deletions docs/source/audio_load.mdx
Expand Up @@ -55,3 +55,126 @@ If you only want to load the underlying path to the audio dataset without decodi
'transcription': 'I would like to set up a joint account with my partner'}
```

## AudioFolder

You can also load a dataset with a `AudioFolder` dataset builder. It does not require writing a custom dataloader, making it useful for quickly loading audio data. Your dataset structure might look like:
polinaeterna marked this conversation as resolved.
Show resolved Hide resolved

```
folder/train/first_audio_file.wav
folder/train/second_audio_file.wav
folder/train/third_audio_file.wav

folder/train/last_audio_file.wav
```

Load your audio dataset by specifying `audiofolder` and the directory containing your data in `data_dir`:

```py
>>> from datasets import load_dataset

>>> dataset = load_dataset("audiofolder", data_dir="/path/to/folder")
>>> dataset["train"][0]
{'audio':
{'path': '/path/to/extracted/audio/first_audio_file.wav',
'array': array([ 0.00088501, 0.0012207 , 0.00131226, ..., -0.00045776, -0.00054932, -0.00054932], dtype=float32),
'sampling_rate': 16000}
}
```

Load remote datasets from their URLs with the `data_files` parameter:

```py
>>> dataset = load_dataset("adiofolder", data_files="https://s3.amazonaws.com/datasets.huggingface.co/SpeechCommands/v0.01/v0.01_test.tar.gz")
polinaeterna marked this conversation as resolved.
Show resolved Hide resolved
```

## AudioFolder with metadata
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I would put this section first, since this is the main use case anyway.


Metadata associated with your dataset can also be loaded. Make sure your dataset has a `metadata.jsonl` file:

```
folder/train/metadata.jsonl
folder/train/first_audio_file.mp3
folder/train/second_audio_file.mp3
folder/train/third_audio_file.mp3
```

Your `metadata.jsonl` file must have a `file_name` column which links audio files with their metadata:

```jsonl
{"file_name": "first_audio_file.mp3", "additional_feature": "This is a first value of a text feature you linked with your audio files"}
{"file_name": "second_audio_file.mp3", "additional_feature": "This is a second value of a text feature you linked with your audio files"}
{"file_name": "third_audio_file.mp3", "additional_feature": "This is a third value of a text feature you linked with your audio files"}
```

### Automatic Speech Recognition

ASR datasets contain text transcriptions of recorded audio files. An example `metadata.jsonl` might look like:

```
{"file_name": "11295_11059_000000.flac", "transcription": "znowu się duch z ciałem zrośnie w młodocianej wstaniesz wiosnie i możesz skutkiem tych leków umierać wstawać wiek wieków dalej tam były przestrogi jak siekać głowę jak nogi"}
{"file_name": "11295_11059_000001.flac", "transcription": "już u źwierzyńca podwojów król zasiada przy nim książęta i panowie rada a gdzie wzniosły krążył ganek rycerze obok kochanek król skinął palcem zaczęto igrzysko"}
{"file_name": "11295_11059_000002.flac", "transcription": "pewnie kędyś w obłędzie ubite minęły szlaki zaczekajmy dzień jaki poślemy szukać wszędzie dziś jutro pewnie będzie posłali wszędzie sługi czekali dzień i drugi gdy nic nie doczekali z płaczem chcą jechać dali"}
{"file_name": "11295_11059_000003.flac", "transcription": "na miejscach które dziś piaskiem zaniosło gdzie car i trzcina zarasta po których teraz wasze biega wiosło stał okrąg pięknego miasta"}
```

Load the dataset with `AudioFolder`, and it will create a `transcription` column with text transcriptions:

```py
>>> dataset = load_dataset("audiofolder", data_dir="/path/to/folder", split="train")
>>> dataset[0]["transcription"]
"znowu się duch z ciałem zrośnie w młodocianej wstaniesz wiosnie i możesz skutkiem tych leków umierać wstawać wiek wieków dalej tam były przestrogi jak siekać głowę jak nogi"
```

## AudioFolder with labels

If you have an audio classification task, set `drop_labels=False` to infer labels from directories names as defined in [`~datasets.packaged_modules.audiofolder.AudioFolderConfig`].
Your dataset structure should look like this:

```
folder/train/label_0/first_audio_label_0.mp3
folder/train/label_0/second_audio_label_0.mp3
folder/train/label_1/first_audio_label_1.mp3

folder/train/label_9/last_audio_label_99.mp3
```

`AudioFolder` will creates a `label` column of a [`~datasets.features.ClassLabel`] type based on the directory name.
polinaeterna marked this conversation as resolved.
Show resolved Hide resolved

### Language identification

Language identification datasets have audio recordings of speech in multiple languages:

```
folder/train/ar/0197_720_0207_190.wav
folder/train/ar/0179_830_0185_540.mp3
folder/train/ar/0179_830_0185_540.mp3

folder/train/zh/0442_690_0454_380.mp3
```

Load the dataset with `drop_labels=False`, and `AudioFolder` will create a `label` column with the language id based on the directory name:

```
>>> dataset = load_dataset("audiofolder", data_dir="/path/to/folder", drop_labels=False)
>>> dataset["train"][0]
{'audio':
{'path': '/path/to/extracted/audio/0197_720_0207_190.mp3',
'array': array([-3.6621094e-04, -6.1035156e-05, 6.1035156e-05, ..., -5.1879883e-04, -1.0070801e-03, -7.6293945e-04],
'sampling_rate': 16000}
'label': 0 # "ar"
}

>>> dataset["train"][-1]
{'audio':
{'path': '/path/to/extracted/audio/0442_690_0454_380.mp3',
'array': array([1.8920898e-03, 9.4604492e-04, 1.9226074e-03, ..., 9.1552734e-05, 1.8310547e-04, 6.1035156e-05],
'sampling_rate': 16000}
'label': 99 # "zh"
}
```

<Tip>

Alternatively, you can add `label` column to your `metadata.jsonl` file.

</Tip>
4 changes: 4 additions & 0 deletions docs/source/package_reference/loading_methods.mdx
Expand Up @@ -68,3 +68,7 @@ load_dataset("csv", data_dir="path/to/data/dir", sep="\t")
### Images

[[autodoc]] datasets.packaged_modules.imagefolder.ImageFolderConfig

### Audio

[[autodoc]] datasets.packaged_modules.audiofolder.AudioFolderConfig
11 changes: 7 additions & 4 deletions src/datasets/packaged_modules/__init__.py
Expand Up @@ -3,6 +3,7 @@
from hashlib import sha256
from typing import List

from .audiofolder import audiofolder
from .csv import csv
from .imagefolder import imagefolder
from .json import json
Expand Down Expand Up @@ -32,6 +33,7 @@ def _hash_python_lines(lines: List[str]) -> str:
"parquet": (parquet.__name__, _hash_python_lines(inspect.getsource(parquet).splitlines())),
"text": (text.__name__, _hash_python_lines(inspect.getsource(text).splitlines())),
"imagefolder": (imagefolder.__name__, _hash_python_lines(inspect.getsource(imagefolder).splitlines())),
"audiofolder": (audiofolder.__name__, _hash_python_lines(inspect.getsource(audiofolder).splitlines())),
}

_EXTENSION_TO_MODULE = {
Expand All @@ -42,7 +44,8 @@ def _hash_python_lines(lines: List[str]) -> str:
"parquet": ("parquet", {}),
"txt": ("text", {}),
}
_EXTENSION_TO_MODULE.update({ext[1:]: ("imagefolder", {}) for ext in imagefolder.ImageFolder.IMAGE_EXTENSIONS})
_EXTENSION_TO_MODULE.update({ext[1:].upper(): ("imagefolder", {}) for ext in imagefolder.ImageFolder.IMAGE_EXTENSIONS})

_MODULE_SUPPORTS_METADATA = {"imagefolder"}
_EXTENSION_TO_MODULE.update({ext[1:]: ("imagefolder", {}) for ext in imagefolder.ImageFolder.EXTENSIONS})
_EXTENSION_TO_MODULE.update({ext[1:].upper(): ("imagefolder", {}) for ext in imagefolder.ImageFolder.EXTENSIONS})
_EXTENSION_TO_MODULE.update({ext[1:]: ("audiofolder", {}) for ext in audiofolder.AudioFolder.EXTENSIONS})
_EXTENSION_TO_MODULE.update({ext[1:].upper(): ("audiofolder", {}) for ext in audiofolder.AudioFolder.EXTENSIONS})
_MODULE_SUPPORTS_METADATA = {"imagefolder", "audiofolder"}
Empty file.
80 changes: 80 additions & 0 deletions src/datasets/packaged_modules/audiofolder/audiofolder.py
@@ -0,0 +1,80 @@
from dataclasses import dataclass
from typing import ClassVar, List

import datasets

from ..base import autofolder


logger = datasets.utils.logging.get_logger(__name__)


@dataclass
class AudioFolderConfig(autofolder.AutoFolderConfig):
"""Builder Config for AudioFolder."""

_base_feature: ClassVar = datasets.Audio()
mariosasko marked this conversation as resolved.
Show resolved Hide resolved
drop_labels: bool = True # usually we don't need labels as classification is not the main audio task
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to say that I'm still in favor of setting it to None by default for consistency with imagefolder ^^'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well I don't have a strong opinion here anymore :D
if we set drop_labels=None by default as you suggested, it might be confusing in cases when users provide only audio files, without metadata (or with broken metadata?). this is probably quite unlikely, so I'm ok with your suggestion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mariosasko what do you think about that?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's good to be consistent, so I agree with @lhoestq.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm ok with that but it makes explaining things in documentation a bit more complicated...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i've updated the docs, tried to make it simpler. still not sure that this logic with default None value of "drop_labels" is clear but I guess we'd better see what users say.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lhoestq @mariosasko what do you think about it now? 🤗
also, don't you know what's happening with the CI? why it takes forever and finally some jobs are cancelled?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks all good to me ! :D

Not sure what's happening with the CI though. I just re-launched one job to see if it was caused by a bug in github actions or the windows runners

drop_metadata: bool = None


class AudioFolder(autofolder.AutoFolder):
BUILDER_CONFIG_CLASS = AudioFolderConfig
EXTENSIONS: List[str] = [] # definition at the bottom of the script

def _info(self):
return datasets.DatasetInfo(features=self.config.features)

def _split_generators(self, dl_manager):
# _prepare_split_generators() sets self.info.features,
# infers labels, finds metadata files if needed and returns splits
return self._prepare_split_generators(dl_manager)

def _generate_examples(self, files, metadata_files, split_name, add_metadata, add_labels):
generator = self._prepare_generate_examples(files, metadata_files, split_name, add_metadata, add_labels)
for _, example in generator:
yield _, example
polinaeterna marked this conversation as resolved.
Show resolved Hide resolved


# Obtained with:
# ```
# import soundfile as sf
#
# AUDIO_EXTENSIONS = [f".{format.lower()}" for format in sf.available_formats().keys()]
#
# # .mp3 is currently decoded via `torchaudio`, .opus decoding is supported if version of `libsndfile` >= 1.0.30:
# AUDIO_EXTENSIONS.extend([".mp3", ".opus"])
# ```
# We intentionally do not run this code on launch because:
# (1) Soundfile is an optional dependency, so importing it in global namespace is not allowed
# (2) To ensure the list of supported extensions is deterministic
AUDIO_EXTENSIONS = [
".aiff",
".au",
".avr",
".caf",
".flac",
".htk",
".svx",
".mat4",
".mat5",
".mpc2k",
".ogg",
".paf",
".pvf",
".raw",
".rf64",
".sd2",
".sds",
".ircam",
".voc",
".w64",
".wav",
".nist",
".wavex",
".wve",
".xi",
".mp3",
".opus",
]
AudioFolder.EXTENSIONS = AUDIO_EXTENSIONS
Empty file.