Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add AudioFolder packaged loader #4530

Merged
merged 81 commits into from Aug 22, 2022

Conversation

polinaeterna
Copy link
Contributor

@polinaeterna polinaeterna commented Jun 20, 2022

will close #3964

AudioFolder is almost identical to ImageFolder except for inferring labels is not the default behavior (drop_labels is set to True in config), the option of inferring them is preserved though.

The weird thing is happening with the test_data_files_with_metadata_and_archives when streaming is True. Here is the log from the CI:


../.pyenv/versions/3.6.15/lib/python3.6/site-packages/datasets/features/audio.py:237: in _decode_non_mp3_path_like
    array, sampling_rate = librosa.load(f, sr=self.sampling_rate, mono=self.mono)
../.pyenv/versions/3.6.15/lib/python3.6/site-packages/librosa/util/decorators.py:88: in inner_f
    return f(*args, **kwargs)
../.pyenv/versions/3.6.15/lib/python3.6/site-packages/librosa/core/audio.py:176: in load
    raise (exc)
../.pyenv/versions/3.6.15/lib/python3.6/site-packages/librosa/core/audio.py:155: in load
    context = sf.SoundFile(path)
../.pyenv/versions/3.6.15/lib/python3.6/site-packages/soundfile.py:629: in __init__
    self._file = self._open(file, mode_int, closefd)
../.pyenv/versions/3.6.15/lib/python3.6/site-packages/soundfile.py:1184: in _open
    "Error opening {0!r}: ".format(self.name))
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

err = 72
prefix = "Error opening <zipfile.ZipExtFile name='audio_file.wav' mode='r' compress_type=deflate>: "

    def _error_check(err, prefix=""):
        """Pretty-print a numerical error code if there is an error."""
        if err != 0:
            err_str = _snd.sf_error_number(err)
>           raise RuntimeError(prefix + _ffi.string(err_str).decode('utf-8', 'replace'))
E           RuntimeError: Error opening <zipfile.ZipExtFile name='audio_file.wav' mode='r' compress_type=deflate>: Error in WAV file. No 'data' chunk marker.

I hadn't been able to reproduce this locally until I created the same test environment (I mean with pip install .[tests]) with python3.6. The same env but with python3.8 passes the test! I didn't manage to figure out what's wrong, I also tried simply to replace the test wav file and still got the same error. Versions of soundfile, librosa and libsndfile are identical. Might it be something with zip compression? Sounds weird but I don't have any other ideas...

TODO:

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Jun 22, 2022

The documentation is not available anymore as the PR was closed or merged.

@polinaeterna polinaeterna marked this pull request as ready for review June 22, 2022 18:03
@polinaeterna
Copy link
Contributor Author

polinaeterna commented Jun 22, 2022

@lhoestq @mariosasko I don't know what to do with the test, do you have any ideas? :)

@polinaeterna
Copy link
Contributor Author

polinaeterna commented Jun 23, 2022

also it's passed in pyarrow_latest_WIN

@lhoestq
Copy link
Member

lhoestq commented Jun 23, 2022

If the error only happens on 3.6, maybe #4460 can help ^^' It seems to work in 3.7 on the windows CI

inferring labels is not the default behavior (drop_labels is set to True in config)

I think it a missed opportunity to have a consistent API between imagefolder and audiofolder, since they do everything the same way. Can you give more details why you think we should drop the labels by default ?

@mariosasko
Copy link
Contributor

Considering audio classification in audio is not as common as image classification in image, I'm ok with having different config defaults as long as they are properly documented (check Papers With Code for stats and compare the classification numbers to the other tasks, do this for both modalities)

Also, WDYT about creating a generic folder loader that ImageFolder and AudioFolder then subclass to avoid having to update both of them when there is something to update/fix?

@polinaeterna
Copy link
Contributor Author

polinaeterna commented Jun 23, 2022

@lhoestq I think it doesn't change the API itself, it just doesn't infer labels by default, but you can still set drop_labels=False to load_dataset and the labels will be inferred.
Suppose that one has data structured as follows:

data/
   train/
      audio/
          file1.wav
          file2.wav
          file3.wav
      metadata.jsonl
   test/
      audio/
          file1.wav
          file2.wav
          file3.wav
      metadata.jsonl

If users load this dataset with load_dataset("audiofolder", data_dir="data") (the most native way), they will get a label feature that will always be equal to 0 (= "audio"). To mitigate this, they will have to always specify load_dataset("audiofolder", data_dir="data", drop_labels=True) explicitly and I believe it's not convenient.

At the same time, label column can be added just as easy as adding one argument: load_dataset("audiofolder", data_dir="data", drop_labels=False). As classification task is not as common, I think it should require more symbols to be added to the code :D

But this is definitely should be explained in the docs, which I've forgotten to update... I'll add this section soon.

Also +to the generic loader, will work on it.

@polinaeterna polinaeterna self-assigned this Jun 23, 2022
@polinaeterna polinaeterna added the enhancement New feature or request label Jun 23, 2022
@lhoestq
Copy link
Member

lhoestq commented Jun 23, 2022

If a metadata.jsonl file is present, then it doesn't have to infer the labels I agree. Note that this is already the case for imagefolder ;) in your case load_dataset("audiofolder", data_dir="data") won't return labels !

Labels are only inferred if there are no metadata.jsonl

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @polinaeterna and @mariosasko this is super cool ! IMO once the class name is settled we're good to go. We can re-organize the tests in a subsequent PR I think :)

I also added a comment about the default for drop_labels again. I think that if the documentation focuses first on audio+metadata, it's ok to have the same default as ImageFolder

@@ -0,0 +1,439 @@
import importlib
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on this !

Comment on lines +127 to +128
for module in parent_builder_modules:
extend_module_for_streaming(module, use_auth_token=builder.use_auth_token)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool ! Sounds good to me

class AudioFolderConfig(folder_builder.FolderBuilderConfig):
"""Builder Config for AudioFolder."""

drop_labels: bool = True # usually we don't need labels as classification is not the main audio task
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to say that I'm still in favor of setting it to None by default for consistency with imagefolder ^^'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well I don't have a strong opinion here anymore :D
if we set drop_labels=None by default as you suggested, it might be confusing in cases when users provide only audio files, without metadata (or with broken metadata?). this is probably quite unlikely, so I'm ok with your suggestion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mariosasko what do you think about that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's good to be consistent, so I agree with @lhoestq.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm ok with that but it makes explaining things in documentation a bit more complicated...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i've updated the docs, tried to make it simpler. still not sure that this logic with default None value of "drop_labels" is clear but I guess we'd better see what users say.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lhoestq @mariosasko what do you think about it now? 🤗
also, don't you know what's happening with the CI? why it takes forever and finally some jobs are cancelled?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks all good to me ! :D

Not sure what's happening with the CI though. I just re-launched one job to see if it was caused by a bug in github actions or the windows runners

>>> dataset = load_dataset("audiofolder", data_files="https://s3.amazonaws.com/datasets.huggingface.co/SpeechCommands/v0.01/v0.01_test.tar.gz")
```

## AudioFolder with metadata
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I would put this section first, since this is the main use case anyway.

@@ -55,3 +55,126 @@ If you only want to load the underlying path to the audio dataset without decodi
'transcription': 'I would like to set up a joint account with my partner'}
```

## AudioFolder

You can also load a dataset with an `AudioFolder` dataset builder. It does not require writing a custom dataloader, making it useful for quickly loading audio data. Your dataset structure might look like:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can remove this part, since it's already presented later in "AudioFolder with labels"

src/datasets/streaming.py Outdated Show resolved Hide resolved
Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM !

@polinaeterna polinaeterna merged commit 6ea46d8 into huggingface:main Aug 22, 2022
@polinaeterna polinaeterna deleted the add-audio-folder-new branch August 22, 2022 14:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add default Audio Loader
4 participants