Skip to content

Latest commit

History

History
156 lines (110 loc) 路 7.07 KB

audio_load.mdx

File metadata and controls

156 lines (110 loc) 路 7.07 KB

Load audio data

Audio datasets are loaded from the audio column, which contains three important fields:

  • array: the decoded audio data represented as a 1-dimensional array.
  • path: the path to the downloaded audio file.
  • sampling_rate: the sampling rate of the audio data.

To work with audio datasets, you need to have the audio dependency installed. Check out the installation guide to learn how to install it.

When you load an audio dataset and call the audio column, the [Audio] feature automatically decodes and resamples the audio file:

>>> from datasets import load_dataset, Audio

>>> dataset = load_dataset("PolyAI/minds14", "en-US", split="train")
>>> dataset[0]["audio"]
{'array': array([ 0.        ,  0.00024414, -0.00024414, ..., -0.00024414,
         0.        ,  0.        ], dtype=float32),
 'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~JOINT_ACCOUNT/602ba55abb1e6d0fbce92065.wav',
 'sampling_rate': 8000}

Index into an audio dataset using the row index first and then the audio column - dataset[0]["audio"] - to avoid decoding and resampling all the audio files in the dataset. Otherwise, this can be a slow and time-consuming process if you have a large dataset.

For a guide on how to load any type of dataset, take a look at the general loading guide.

Local files

The path is useful for loading your own dataset. Use the [~Dataset.cast_column] function to take a column of audio file paths, and decode it into array's with the [Audio] feature:

>>> audio_dataset = Dataset.from_dict({"audio": ["path/to/audio_1", "path/to/audio_2", ..., "path/to/audio_n"]}).cast_column("audio", Audio())

If you only want to load the underlying path to the audio dataset without decoding the audio file into an array, set decode=False in the [Audio] feature:

>>> dataset = load_dataset("PolyAI/minds14", "en-US", split="train").cast_column("audio", Audio(decode=False))
>>> dataset[0]
{'audio': {'bytes': None,
  'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~JOINT_ACCOUNT/602ba55abb1e6d0fbce92065.wav'},
 'english_transcription': 'I would like to set up a joint account with my partner',
 'intent_class': 11,
 'lang_id': 4,
 'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~JOINT_ACCOUNT/602ba55abb1e6d0fbce92065.wav',
 'transcription': 'I would like to set up a joint account with my partner'}

AudioFolder

You can also load a dataset with an AudioFolder dataset builder. It does not require writing a custom dataloader, making it useful for quickly loading audio data.

AudioFolder with metadata

To link your audio files with metadata information, make sure your dataset has a metadata.csv file. Your dataset structure might look like:

folder/train/metadata.csv
folder/train/first_audio_file.mp3
folder/train/second_audio_file.mp3
folder/train/third_audio_file.mp3

Your metadata.csv file must have a file_name column which links audio files with their metadata. An example metadata.csv file might look like:

file_name,transcription
first_audio_file.mp3,znowu si臋 duch z cia艂em zro艣nie w m艂odocianej wstaniesz wiosnie i mo偶esz skutkiem tych lek贸w umiera膰 wstawa膰 wiek wiek贸w dalej tam by艂y przestrogi jak sieka膰 g艂ow臋 jak nogi
second_audio_file.mp3,ju偶 u 藕wierzy艅ca podwoj贸w kr贸l zasiada przy nim ksi膮偶臋ta i panowie rada a gdzie wznios艂y kr膮偶y艂 ganek rycerze obok kochanek kr贸l skin膮艂 palcem zacz臋to igrzysko
third_audio_file.mp3,pewnie k臋dy艣 w ob艂臋dzie ubite min臋艂y szlaki zaczekajmy dzie艅 jaki po艣lemy szuka膰 wsz臋dzie dzi艣 jutro pewnie b臋dzie pos艂ali wsz臋dzie s艂ugi czekali dzie艅 i drugi gdy nic nie doczekali z p艂aczem chc膮 jecha膰 dali

Metadata can also be specified as JSON Lines, in which case use metadata.jsonl as the name of the metadata file. This format is helpful in scenarios when one of the columns is complex, e.g. a list of floats, to avoid parsing errors or reading the complex values as strings.

Load your audio dataset by specifying audiofolder and the directory containing your data in data_dir:

>>> from datasets import load_dataset

>>> dataset = load_dataset("audiofolder", data_dir="/path/to/folder")

AudioFolder will load audio data and create a transcription column containing texts from metadata.csv:

>>> dataset["train"][0]
{'audio':
    {'path': '/path/to/extracted/audio/first_audio_file.mp3',
    'array': array([ 0.00088501,  0.0012207 ,  0.00131226, ..., -0.00045776, -0.00054932, -0.00054932], dtype=float32),
    'sampling_rate': 16000},
 'transcription': 'znowu si臋 duch z cia艂em zro艣nie w m艂odocianej wstaniesz wiosnie i mo偶esz skutkiem tych lek贸w umiera膰 wstawa膰 wiek wiek贸w dalej tam by艂y przestrogi jak sieka膰 g艂ow臋 jak nogi'
}

You can load remote datasets from their URLs with the data_files parameter:

>>> dataset = load_dataset("audiofolder", data_files="https://s3.amazonaws.com/datasets.huggingface.co/SpeechCommands/v0.01/v0.01_test.tar.gz")

AudioFolder with labels

If your data directory doesn't contain any metadata files, by default AudioFolder automatically adds a label column of [~datasets.features.ClassLabel] type, with labels based on the directory name. It might be useful if you have an audio classification task.

Language identification

Language identification datasets have audio recordings of speech in multiple languages:

folder/train/ar/0197_720_0207_190.wav
folder/train/ar/0179_830_0185_540.mp3
folder/train/ar/0179_830_0185_540.mp3

folder/train/zh/0442_690_0454_380.mp3

As there are no metadata files, AudioFolder will create a label column with the language id based on the directory name:

>>> dataset = load_dataset("audiofolder", data_dir="/path/to/folder", drop_labels=False)
>>> dataset["train"][0]
{'audio':
    {'path': '/path/to/extracted/audio/0197_720_0207_190.mp3',
    'array': array([-3.6621094e-04, -6.1035156e-05,  6.1035156e-05, ..., -5.1879883e-04, -1.0070801e-03, -7.6293945e-04],
    'sampling_rate': 16000}
 'label': 0  # "ar"
}

>>> dataset["train"][-1]
{'audio':
    {'path': '/path/to/extracted/audio/0442_690_0454_380.mp3',
    'array': array([1.8920898e-03, 9.4604492e-04, 1.9226074e-03, ..., 9.1552734e-05, 1.8310547e-04, 6.1035156e-05],
    'sampling_rate': 16000}
 'label': 99  # "zh"
}

If you have metadata files inside your data directory, but you still want to infer labels from directories names, set drop_labels=False as defined in [~datasets.packaged_modules.audiofolder.AudioFolderConfig].

Alternatively, you can add label column to your metadata.csv file.

If you have no metadata files and want to drop automatically created labels, set drop_labels=True. In this case your dataset would contain only an audio column.