Audio datasets are loaded from the audio
column, which contains three important fields:
array
: the decoded audio data represented as a 1-dimensional array.path
: the path to the downloaded audio file.sampling_rate
: the sampling rate of the audio data.
To work with audio datasets, you need to have the audio
dependency installed. Check out the installation guide to learn how to install it.
When you load an audio dataset and call the audio
column, the [Audio
] feature automatically decodes and resamples the audio file:
>>> from datasets import load_dataset, Audio
>>> dataset = load_dataset("PolyAI/minds14", "en-US", split="train")
>>> dataset[0]["audio"]
{'array': array([ 0. , 0.00024414, -0.00024414, ..., -0.00024414,
0. , 0. ], dtype=float32),
'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~JOINT_ACCOUNT/602ba55abb1e6d0fbce92065.wav',
'sampling_rate': 8000}
Index into an audio dataset using the row index first and then the audio
column - dataset[0]["audio"]
- to avoid decoding and resampling all the audio files in the dataset. Otherwise, this can be a slow and time-consuming process if you have a large dataset.
For a guide on how to load any type of dataset, take a look at the general loading guide.
The path
is useful for loading your own dataset. Use the [~Dataset.cast_column
] function to take a column of audio file paths, and decode it into array
's with the [Audio
] feature:
>>> audio_dataset = Dataset.from_dict({"audio": ["path/to/audio_1", "path/to/audio_2", ..., "path/to/audio_n"]}).cast_column("audio", Audio())
If you only want to load the underlying path to the audio dataset without decoding the audio file into an array
, set decode=False
in the [Audio
] feature:
>>> dataset = load_dataset("PolyAI/minds14", "en-US", split="train").cast_column("audio", Audio(decode=False))
>>> dataset[0]
{'audio': {'bytes': None,
'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~JOINT_ACCOUNT/602ba55abb1e6d0fbce92065.wav'},
'english_transcription': 'I would like to set up a joint account with my partner',
'intent_class': 11,
'lang_id': 4,
'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~JOINT_ACCOUNT/602ba55abb1e6d0fbce92065.wav',
'transcription': 'I would like to set up a joint account with my partner'}
You can also load a dataset with an AudioFolder
dataset builder. It does not require writing a custom dataloader, making it useful for quickly loading audio data.
To link your audio files with metadata information, make sure your dataset has a metadata.csv
file. Your dataset structure might look like:
folder/train/metadata.csv
folder/train/first_audio_file.mp3
folder/train/second_audio_file.mp3
folder/train/third_audio_file.mp3
Your metadata.csv
file must have a file_name
column which links audio files with their metadata. An example metadata.csv
file might look like:
file_name,transcription
first_audio_file.mp3,znowu si臋 duch z cia艂em zro艣nie w m艂odocianej wstaniesz wiosnie i mo偶esz skutkiem tych lek贸w umiera膰 wstawa膰 wiek wiek贸w dalej tam by艂y przestrogi jak sieka膰 g艂ow臋 jak nogi
second_audio_file.mp3,ju偶 u 藕wierzy艅ca podwoj贸w kr贸l zasiada przy nim ksi膮偶臋ta i panowie rada a gdzie wznios艂y kr膮偶y艂 ganek rycerze obok kochanek kr贸l skin膮艂 palcem zacz臋to igrzysko
third_audio_file.mp3,pewnie k臋dy艣 w ob艂臋dzie ubite min臋艂y szlaki zaczekajmy dzie艅 jaki po艣lemy szuka膰 wsz臋dzie dzi艣 jutro pewnie b臋dzie pos艂ali wsz臋dzie s艂ugi czekali dzie艅 i drugi gdy nic nie doczekali z p艂aczem chc膮 jecha膰 dali
Metadata can also be specified as JSON Lines, in which case use metadata.jsonl
as the name of the metadata file. This format is helpful in scenarios when one of the columns is complex, e.g. a list of floats, to avoid parsing errors or reading the complex values as strings.
Load your audio dataset by specifying audiofolder
and the directory containing your data in data_dir
:
>>> from datasets import load_dataset
>>> dataset = load_dataset("audiofolder", data_dir="/path/to/folder")
AudioFolder
will load audio data and create a transcription
column containing texts from metadata.csv
:
>>> dataset["train"][0]
{'audio':
{'path': '/path/to/extracted/audio/first_audio_file.mp3',
'array': array([ 0.00088501, 0.0012207 , 0.00131226, ..., -0.00045776, -0.00054932, -0.00054932], dtype=float32),
'sampling_rate': 16000},
'transcription': 'znowu si臋 duch z cia艂em zro艣nie w m艂odocianej wstaniesz wiosnie i mo偶esz skutkiem tych lek贸w umiera膰 wstawa膰 wiek wiek贸w dalej tam by艂y przestrogi jak sieka膰 g艂ow臋 jak nogi'
}
You can load remote datasets from their URLs with the data_files
parameter:
>>> dataset = load_dataset("audiofolder", data_files="https://s3.amazonaws.com/datasets.huggingface.co/SpeechCommands/v0.01/v0.01_test.tar.gz")
If your data directory doesn't contain any metadata files, by default AudioFolder
automatically adds a label
column of [~datasets.features.ClassLabel
] type, with labels based on the directory name.
It might be useful if you have an audio classification task.
Language identification datasets have audio recordings of speech in multiple languages:
folder/train/ar/0197_720_0207_190.wav
folder/train/ar/0179_830_0185_540.mp3
folder/train/ar/0179_830_0185_540.mp3
folder/train/zh/0442_690_0454_380.mp3
As there are no metadata files, AudioFolder
will create a label
column with the language id based on the directory name:
>>> dataset = load_dataset("audiofolder", data_dir="/path/to/folder", drop_labels=False)
>>> dataset["train"][0]
{'audio':
{'path': '/path/to/extracted/audio/0197_720_0207_190.mp3',
'array': array([-3.6621094e-04, -6.1035156e-05, 6.1035156e-05, ..., -5.1879883e-04, -1.0070801e-03, -7.6293945e-04],
'sampling_rate': 16000}
'label': 0 # "ar"
}
>>> dataset["train"][-1]
{'audio':
{'path': '/path/to/extracted/audio/0442_690_0454_380.mp3',
'array': array([1.8920898e-03, 9.4604492e-04, 1.9226074e-03, ..., 9.1552734e-05, 1.8310547e-04, 6.1035156e-05],
'sampling_rate': 16000}
'label': 99 # "zh"
}
If you have metadata files inside your data directory, but you still want to infer labels from directories names, set drop_labels=False
as defined in [~datasets.packaged_modules.audiofolder.AudioFolderConfig
].
Alternatively, you can add label
column to your metadata.csv
file.
If you have no metadata files and want to drop automatically created labels, set drop_labels=True
. In this case your dataset would contain only an audio
column.