Add AudioFolder packaged loader #4530

polinaeterna · 2022-06-20T12:54:02Z

will close #3964

AudioFolder is almost identical to ImageFolder except for inferring labels is not the default behavior (drop_labels is set to True in config), the option of inferring them is preserved though.

The weird thing is happening with the test_data_files_with_metadata_and_archives when streaming is True. Here is the log from the CI:


../.pyenv/versions/3.6.15/lib/python3.6/site-packages/datasets/features/audio.py:237: in _decode_non_mp3_path_like
    array, sampling_rate = librosa.load(f, sr=self.sampling_rate, mono=self.mono)
../.pyenv/versions/3.6.15/lib/python3.6/site-packages/librosa/util/decorators.py:88: in inner_f
    return f(*args, **kwargs)
../.pyenv/versions/3.6.15/lib/python3.6/site-packages/librosa/core/audio.py:176: in load
    raise (exc)
../.pyenv/versions/3.6.15/lib/python3.6/site-packages/librosa/core/audio.py:155: in load
    context = sf.SoundFile(path)
../.pyenv/versions/3.6.15/lib/python3.6/site-packages/soundfile.py:629: in __init__
    self._file = self._open(file, mode_int, closefd)
../.pyenv/versions/3.6.15/lib/python3.6/site-packages/soundfile.py:1184: in _open
    "Error opening {0!r}: ".format(self.name))
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

err = 72
prefix = "Error opening <zipfile.ZipExtFile name='audio_file.wav' mode='r' compress_type=deflate>: "

    def _error_check(err, prefix=""):
        """Pretty-print a numerical error code if there is an error."""
        if err != 0:
            err_str = _snd.sf_error_number(err)
>           raise RuntimeError(prefix + _ffi.string(err_str).decode('utf-8', 'replace'))
E           RuntimeError: Error opening <zipfile.ZipExtFile name='audio_file.wav' mode='r' compress_type=deflate>: Error in WAV file. No 'data' chunk marker.

I hadn't been able to reproduce this locally until I created the same test environment (I mean with pip install .[tests]) with python3.6. The same env but with python3.8 passes the test! I didn't manage to figure out what's wrong, I also tried simply to replace the test wav file and still got the same error. Versions of soundfile, librosa and libsndfile are identical. Might it be something with zip compression? Sounds weird but I don't have any other ideas...

TODO:

align with Fix ImageFolder with parameters drop_metadata=True and drop_labels=False (when metadata.jsonl is present) #4622
documentation
tests for AutoFolder?

…ferring labels is not default)

HuggingFaceDocBuilderDev · 2022-06-22T13:03:51Z

The documentation is not available anymore as the PR was closed or merged.

…ase)

…ust in case)" This reverts commit d0b2592.

…s are different)

polinaeterna · 2022-06-22T18:05:05Z

@lhoestq @mariosasko I don't know what to do with the test, do you have any ideas? :)

polinaeterna · 2022-06-23T11:05:19Z

also it's passed in pyarrow_latest_WIN

…ts into add-audio-folder-new

lhoestq · 2022-06-23T12:35:19Z

If the error only happens on 3.6, maybe #4460 can help ^^' It seems to work in 3.7 on the windows CI

inferring labels is not the default behavior (drop_labels is set to True in config)

I think it a missed opportunity to have a consistent API between imagefolder and audiofolder, since they do everything the same way. Can you give more details why you think we should drop the labels by default ?

mariosasko · 2022-06-23T13:30:01Z

Considering audio classification in audio is not as common as image classification in image, I'm ok with having different config defaults as long as they are properly documented (check Papers With Code for stats and compare the classification numbers to the other tasks, do this for both modalities)

Also, WDYT about creating a generic folder loader that ImageFolder and AudioFolder then subclass to avoid having to update both of them when there is something to update/fix?

polinaeterna · 2022-06-23T14:10:04Z

@lhoestq I think it doesn't change the API itself, it just doesn't infer labels by default, but you can still set drop_labels=False to load_dataset and the labels will be inferred.
Suppose that one has data structured as follows:

data/
   train/
      audio/
          file1.wav
          file2.wav
          file3.wav
      metadata.jsonl
   test/
      audio/
          file1.wav
          file2.wav
          file3.wav
      metadata.jsonl

If users load this dataset with load_dataset("audiofolder", data_dir="data") (the most native way), they will get a label feature that will always be equal to 0 (= "audio"). To mitigate this, they will have to always specify load_dataset("audiofolder", data_dir="data", drop_labels=True) explicitly and I believe it's not convenient.

At the same time, label column can be added just as easy as adding one argument: load_dataset("audiofolder", data_dir="data", drop_labels=False). As classification task is not as common, I think it should require more symbols to be added to the code :D

But this is definitely should be explained in the docs, which I've forgotten to update... I'll add this section soon.

Also +to the generic loader, will work on it.

lhoestq · 2022-06-23T14:17:15Z

If a metadata.jsonl file is present, then it doesn't have to infer the labels I agree. Note that this is already the case for imagefolder ;) in your case load_dataset("audiofolder", data_dir="data") won't return labels !

Labels are only inferred if there are no metadata.jsonl

…ual extending

…nfig's parameter

lhoestq

Thanks @polinaeterna and @mariosasko this is super cool ! IMO once the class name is settled we're good to go. We can re-organize the tests in a subsequent PR I think :)

I also added a comment about the default for drop_labels again. I think that if the documentation focuses first on audio+metadata, it's ok to have the same default as ImageFolder

lhoestq · 2022-08-17T15:38:31Z

tests/packaged_modules/test_autofolder.py

@@ -0,0 +1,439 @@
+import importlib


+1 on this !

lhoestq · 2022-08-17T15:40:19Z

src/datasets/streaming.py

+    for module in parent_builder_modules:
+        extend_module_for_streaming(module, use_auth_token=builder.use_auth_token)


Cool ! Sounds good to me

lhoestq · 2022-08-17T15:46:20Z

src/datasets/packaged_modules/audiofolder/audiofolder.py

+class AudioFolderConfig(folder_builder.FolderBuilderConfig):
+    """Builder Config for AudioFolder."""
+
+    drop_labels: bool = True  # usually we don't need labels as classification is not the main audio task


Just to say that I'm still in favor of setting it to None by default for consistency with imagefolder ^^'

well I don't have a strong opinion here anymore :D
if we set drop_labels=None by default as you suggested, it might be confusing in cases when users provide only audio files, without metadata (or with broken metadata?). this is probably quite unlikely, so I'm ok with your suggestion.

@mariosasko what do you think about that?

It's good to be consistent, so I agree with @lhoestq.

i'm ok with that but it makes explaining things in documentation a bit more complicated...

i've updated the docs, tried to make it simpler. still not sure that this logic with default None value of "drop_labels" is clear but I guess we'd better see what users say.

@lhoestq @mariosasko what do you think about it now? 🤗
also, don't you know what's happening with the CI? why it takes forever and finally some jobs are cancelled?

Looks all good to me ! :D

Not sure what's happening with the CI though. I just re-launched one job to see if it was caused by a bug in github actions or the windows runners

lhoestq · 2022-08-17T16:08:13Z

docs/source/audio_load.mdx

+>>> dataset = load_dataset("audiofolder", data_files="https://s3.amazonaws.com/datasets.huggingface.co/SpeechCommands/v0.01/v0.01_test.tar.gz")
+```
+
+## AudioFolder with metadata


I think I would put this section first, since this is the main use case anyway.

lhoestq · 2022-08-17T16:09:38Z

docs/source/audio_load.mdx

@@ -55,3 +55,126 @@ If you only want to load the underlying path to the audio dataset without decodi
 'transcription': 'I would like to set up a joint account with my partner'}
 ```

+## AudioFolder
+
+You can also load a dataset with an `AudioFolder` dataset builder. It does not require writing a custom dataloader, making it useful for quickly loading audio data. Your dataset structure might look like:


I think you can remove this part, since it's already presented later in "AudioFolder with labels"

src/datasets/streaming.py

Co-authored-by: Mario Šaško <mario@huggingface.co>

…n't introduce any new params to be more consistent with usual datasets script as it doesn't do anything here anyway

…(ツ)_/¯

…king ¯\_(ツ)_/¯" This reverts commit fc41118.

src/datasets/streaming.py

…odule Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

…ts into add-audio-folder-new

lhoestq

LGTM !

polinaeterna added 9 commits June 20, 2022 14:52

add audiofolder loader (almost identical to imagefolder except for in…

cc87b0e

…ferring labels is not default)

check codestyle

0adcd56

add tests

e46eecb

remove unused imports

7cc4ab9

add dummy data

a648530

add instruction on how to obtain list of audio extensions

5cbbad1

fix comment

53e9ce3

add audiofolder dummy files in tests

60760ea

Merge branch 'master' into add-audio-folder-new

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.

GPG key ID: 4AEE18F83AFDEB23
Expired

Verified
Learn about vigilant mode

e4bb688

polinaeterna added 5 commits June 22, 2022 15:26

check if two separate files fix test error (i guess not but just in c…

d0b2592

…ase)

remove unused imports

15ca3cf

Revert "check if two separate files fix test error (i guess not but j…

420dd2b

…ust in case)" This reverts commit d0b2592.

add uppercased formats, modify test for zip archive (check that array…

68b7f5a

…s are different)

Merge branch 'huggingface:master' into add-audio-folder-new

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.

GPG key ID: 4AEE18F83AFDEB23
Expired

Verified
Learn about vigilant mode

7c75e81

polinaeterna requested review from lhoestq and mariosasko June 22, 2022 18:03

polinaeterna marked this pull request as ready for review June 22, 2022 18:03

polinaeterna added 4 commits June 23, 2022 13:22

add contributors

eecc449

Merge branch 'add-audio-folder-new' of github.com:polinaeterna/datase…

a081364

…ts into add-audio-folder-new

Merge branch 'huggingface:master' into add-audio-folder-new

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.

GPG key ID: 4AEE18F83AFDEB23
Expired

Verified
Learn about vigilant mode

93c6afa

Merge branch 'add-audio-folder-new' of github.com:polinaeterna/datase…

f5d9841

…ts into add-audio-folder-new

polinaeterna self-assigned this Jun 23, 2022

polinaeterna added the enhancement label Jun 23, 2022

polinaeterna requested a review from lhoestq August 16, 2022 11:30

polinaeterna added 5 commits August 16, 2022 17:04

patch parents that inherit from DatasetBuilder, revert get_imports

bfecab4

rename autofolder -> folder_builder

90dc043

remove autofolder dir

292a8c5

remove axtending for streaming from tests, it should work without man…

3e32181

…ual extending

make base column name an abstract attr of FolderBuilder instead of co…

fe80766

…nfig's parameter

lhoestq reviewed Aug 17, 2022

View reviewed changes

mariosasko reviewed Aug 17, 2022

View reviewed changes

src/datasets/streaming.py Outdated Show resolved Hide resolved

polinaeterna and others added 7 commits August 17, 2022 19:18

Update src/datasets/streaming.py

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.

GPG key ID: 4AEE18F83AFDEB23
Expired

Verified
Learn about vigilant mode

f74922c

Co-authored-by: Mario Šaško <mario@huggingface.co>

rename FolderBuilder -> FolderBasedBuilder

227ce04

set drop_labels to None by default for AudioFolder

034b88c

remove dataclass decorator from audio/image folder configs as they do…

54c6cf2

…n't introduce any new params to be more consistent with usual datasets script as it doesn't do anything here anyway

remove ABC from FolderBasedBuilder as it does nothing

7f6719b

update documentation

748576b

fix docs

615a839

polinaeterna requested review from lhoestq and mariosasko August 18, 2022 15:18

polinaeterna added 5 commits August 18, 2022 17:22

SORRY another small fix in docs

02f8f57

get back abc and dataclasses just because of the magical thinking ¯\_…

fc41118

…(ツ)_/¯

Revert "get back abc and dataclasses just because of the magical thin…

9ee04ed

…king ¯\_(ツ)_/¯" This reverts commit fc41118.

Merge remote-tracking branch 'upstream/main' into add-audio-folder-new

accb8cd

Merge remote-tracking branch 'upstream/main' into add-audio-folder-new

adccfd8

lhoestq reviewed Aug 22, 2022

View reviewed changes

src/datasets/streaming.py Outdated Show resolved Hide resolved

polinaeterna and others added 4 commits August 22, 2022 12:38

Merge branch 'add-audio-folder-new' of github.com:polinaeterna/datase…

189e98b

…ts into add-audio-folder-new

fix linters

89e298c

add comment to the patching thing

fbef2b0

lhoestq approved these changes Aug 22, 2022

View reviewed changes

polinaeterna merged commit 6ea46d8 into huggingface:main Aug 22, 2022

polinaeterna deleted the add-audio-folder-new branch August 22, 2022 14:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add AudioFolder packaged loader #4530

Add AudioFolder packaged loader #4530

polinaeterna commented Jun 20, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented Jun 22, 2022 •

edited

Loading

polinaeterna commented Jun 22, 2022 •

edited

Loading

polinaeterna commented Jun 23, 2022 •

edited

Loading

lhoestq commented Jun 23, 2022 •

edited

Loading

mariosasko commented Jun 23, 2022

polinaeterna commented Jun 23, 2022 •

edited

Loading

lhoestq commented Jun 23, 2022

lhoestq left a comment

lhoestq Aug 17, 2022

lhoestq Aug 17, 2022

lhoestq Aug 17, 2022

polinaeterna Aug 17, 2022

polinaeterna Aug 17, 2022

mariosasko Aug 17, 2022

polinaeterna Aug 17, 2022

polinaeterna Aug 18, 2022

polinaeterna Aug 19, 2022

lhoestq Aug 19, 2022

lhoestq Aug 17, 2022

lhoestq Aug 17, 2022

lhoestq left a comment

		for module in parent_builder_modules:
		extend_module_for_streaming(module, use_auth_token=builder.use_auth_token)

Add AudioFolder packaged loader #4530

Add AudioFolder packaged loader #4530

Conversation

polinaeterna commented Jun 20, 2022 • edited Loading

HuggingFaceDocBuilderDev commented Jun 22, 2022 • edited Loading

polinaeterna commented Jun 22, 2022 • edited Loading

polinaeterna commented Jun 23, 2022 • edited Loading

lhoestq commented Jun 23, 2022 • edited Loading

mariosasko commented Jun 23, 2022

polinaeterna commented Jun 23, 2022 • edited Loading

lhoestq commented Jun 23, 2022

lhoestq left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lhoestq left a comment

Choose a reason for hiding this comment

polinaeterna commented Jun 20, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented Jun 22, 2022 •

edited

Loading

polinaeterna commented Jun 22, 2022 •

edited

Loading

polinaeterna commented Jun 23, 2022 •

edited

Loading

lhoestq commented Jun 23, 2022 •

edited

Loading

polinaeterna commented Jun 23, 2022 •

edited

Loading