Add Audio Spectogram Transformer #19981

NielsRogge · 2022-10-31T13:23:52Z

What does this PR do?

This PR adds the Audio Spectogram Transformer (AST) model from MIT.

Similar to Whisper (actually prior to Whisper), the model treats audio as an image and applies a Vision Transformer on it.

The model gets SOTA results on audio classification benchmarks.

sanchit-gandhi

Thanks for adding this audio-classification model @NielsRogge! The modelling components in general look good to me 👌 A couple of questions regarding the feature extractor (namely the mean / std normalisation). Also, it would be great to document the performant audio-classification capabilities of this model with an inference/fine-tuning example at examples/pytorch/audio-classification! 🚀 Very excited to see the accompanying tutorial video for this model addition 😉

Also one thing you're gonna kick yourself for... it's "spectrogram" not "spectogram" 😅 hope the renaming isn't too arduous!

docs/source/en/model_doc/audio-spectogram-transformer.mdx

...ansformers/models/audio_spectogram_transformer/configuration_audio_spectogram_transformer.py

docs/source/en/model_doc/audio-spectogram-transformer.mdx

...rmers/models/audio_spectogram_transformer/feature_extraction_audio_spectogram_transformer.py

src/transformers/models/audio_spectogram_transformer/modeling_audio_spectogram_transformer.py

src/transformers/models/audio_spectogram_transformer/test.py

.../models/audio_spectogram_transformer/test_feature_extraction_audio_spectogram_transformer.py

tests/models/audio_spectogram_transformer/test_modeling_audio_spectogram_transformer.py

sanchit-gandhi · 2022-10-31T17:21:32Z

...rmers/models/audio_spectogram_transformer/feature_extraction_audio_spectogram_transformer.py

+            raw_speech (`np.ndarray`, `List[float]`, `List[np.ndarray]`, `List[List[float]]`):
+                The sequence or batch of sequences to be padded. Each sequence can be a numpy array, a list of float
+                values, a list of numpy arrays or a list of list of float values.


Wonder whether it makes sense to adapt this to handle torch tensors as inputs? I know we don't do this currently in Speech2Text2, but I saw that in quite a few of the tests we go from: torch -> numpy -> feat-extractor -> torch, which seems pretty convoluted!

sgugger

Very clean, thanks a lot for adding this new model!

...sformers/models/audio_spectrogram_transformer/configuration_audio_spectrogram_transformer.py

...ers/models/audio_spectrogram_transformer/feature_extraction_audio_spectrogram_transformer.py

HuggingFaceDocBuilderDev · 2022-11-01T16:36:20Z

The documentation is not available anymore as the PR was closed or merged.

FrankFundel · 2022-11-04T10:07:45Z

Hey, how can I use it already? I installed the branch, but unsure how to load the model. I'm new with huggingface :D

sanchit-gandhi · 2022-11-04T17:12:30Z

Hey @FrankFundel - hoping @NielsRogge adds a nice example as part of this PR documenting just that 🤞 In the mean time, you can try adapting the example from https://huggingface.co/docs/transformers/tasks/audio_classification

You'll need to change the repo names from facebook/wav2vec2-base to the appropriate Audio Spectrogram Transformer repo name. You'll also need to change the preprocess function (https://huggingface.co/docs/transformers/tasks/audio_classification#preprocess) to something like:

def preprocess_function(examples):
    audio_arrays = [x["array"] for x in examples["audio"]]
    input_features = feature_extractor(audio_array, sampling_rate=feature_extractor.sampling_rate)
    return input_features

This is all currently untested, so might require some playing around to make it work.

HuggingFaceDocBuilderDev · 2022-11-15T08:34:18Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

HuggingFaceDocBuilderDev · 2022-11-15T15:19:28Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

HuggingFaceDocBuilderDev · 2022-11-15T16:01:50Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

* Optimizes DonutProcessor token2json method for speed * Applies black formatting * Updates Donut pretrained model name in test file * remaining pytorch type hints (#20217) * Update modeling_flava.py * Update modeling_markuplm.py * Update modeling_glpn.py * Update modeling_roc_bert.py * Update modeling_segformer.py * Update modeling_tapas.py * Update modeling_tapas.py * Update modeling_tapas.py * Update modeling_tapas.py * Update modeling_trocr.py * Update modeling_videomae.py * Update modeling_videomae.py * Update modeling_videomae.py * Update modeling_yolos.py * Update modeling_wav2vec2.py * Update modeling_jukebox.py * Update modeling_jukebox.py * Update modeling_jukebox.py * Update modeling_jukebox.py * Data collator for token classification pads labels column when receives pytorch tensors (#20244) * token cls data_collator pads labels column * remove walrus operator for code quality * remove redundat space * remove comment that was fixed * PR comments fix Co-authored-by: Alexander Markov <amarkov.me@gmail.com> * [Doctest] Add configuration_deformable_detr.py (#20273) * Update configuration_deformable_detr.py comment * Add DeformableDetrConfig to documentation_tests.txt * Fix summarization script (#20286) * [DOCTEST] Fix the documentation of RoCBert (#20142) * update part of the doc * add temp values, fix part of the doc * add template outputs * add correct models and outputss * style * fixup * [bnb] Let's warn users when saving 8-bit models (#20282) * add warning on 8-bit models - added tests - added wrapper * move to a private attribute - remove wrapper - changed `save_pretrained` method * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * fix suggestions Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Adding `zero-shot-object-detection` pipeline doctest. (#20274) * Adding `zero-shot-object-detection` pipeline doctest. * Remove nested_simplify. * Adding doctest for `object-detection` pipeline. (#20258) * Adding doctest for `object-detection` pipeline. * Removed nested_simplify. * Image transforms functionality used instead (#20278) * Image transforms functionality used instead * Import torch * Import rather than copy * Update src/transformers/models/conditional_detr/feature_extraction_conditional_detr.py * TF: add test for `PushToHubCallback` (#20231) * test hub tf callback * create repo before cloning it * Generate: general TF XLA constrastive search are now slow tests (#20277) * move contrastive search test to slow * Fixing the doctests failures. (#20294) * Fixing the doctests failures. * Fixup. * set the default cache_enable to True, aligned with the default value in pytorch cpu/cuda amp autocast (#20289) Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * Add docstrings for canine model (#19457) * Add docstrings for canine model * Update CanineForTokenClassification Co-authored-by: ydshieh <ydshieh@users.noreply.github.com> * Add AutoBackbone + ResNetBackbone (#20229) * Add ResNetBackbone * Define channels and strides as property * Remove file * Add test for backbone * Update BackboneOutput class * Remove strides property * Fix docstring * Add backbones to SHOULD_HAVE_THEIR_OWN_PAGE * Fix auto mapping name * Add sanity check for out_features * Set stage names based on depths * Update to tuple Co-authored-by: Niels Rogge <nielsrogge@Nielss-MacBook-Pro.local> * Add missing report button for Example test (#20293) Co-authored-by: ydshieh <ydshieh@users.noreply.github.com> * refactor test (#20300) - simplifies the devce checking test * [Tiny model creation] deal with `ImageProcessor` (#20298) Co-authored-by: ydshieh <ydshieh@users.noreply.github.com> * Fix blender bot missleading doc (#20301) * fix the doc to specify that add_prefix_space = False * add correct expected output * remove two tokens that should not be suppressed (#20302) * [ASR Examples] Update README for Whisper (#20230) * [ASR Examples] Update README for seq2seq * add language info * add training results * re-word * Add padding image transformation (#19838) * Add padding transformation * Add in upstream changes * Update tests & docs * Code formatting tuples in docstring * Pin TensorFlow (#20313) * Pin to the right version... * Also pin TensorFlow CPU * Add AnyPrecisionAdamW optimizer (#18961) * Add AnyPrecisionAdamW optimizer * Add optim_args argument to TrainingArgs * Add tests for AnyPrecisionOptimizer * Change AnyPrecisionAdam default params to float32 * Move default_anyprecision_kwargs in trainer test * Rename AnyPrecisionAdamW * [Proposal] Breaking change `zero-shot-object-detection` for improved consistency. (#20280) * [Proposal] Breaking change `zero-shot-object-detection` for improved consistency. This is a proposal to modify the output of `zero-shot-object-detection` to provide better alignment with other pipelines. The output is now strictly the same as `object-detection` whereas before it would output lists of lists. The name `candidate_labels` is used throughout for consistency with other `zero-shot` pipelines. The pipeline is changed to `ChunkPipeline` to support batching cleanly. This removes all the lists and list of lists shenanigans, it's now a matter of the base pipeline handling all this not this specific one. **Breaking change**: It did remove complex calls potentials `pipe(images = [image1, image2], text_queries=[candidates1, candidates2])` to support only `pipe([{"image": image1, "candidate_labels": candidates1}, {"image": image2, "candidate_labels": candidates2}])` when dealing with lists and/or datasets. We could keep them, but it will add a lot of complexity to the code base, since the pipeline is rather young, I'd rather break to keep the code simpler, but we can revert this. **Breaking change**: The name of the argument is now `image` instead of `images` since it expects by default only 1 image. This is revertable like the previous one. **Breaking change**: The types is now simplified and flattened: `pipe(inputs) == [{**object1}, {**object2}]` instead of the previous `pipe(inputs) == [[{**object1}, {**object1}], [{**object2}]]` Where the different instances would be grouped by candidate labels within lists. IMHO this is not really desirable, since it would output empty lists and is only adding superflous indirection compared to `zero-shot-object-detection`. It is relatively change free in terms of how the results, it does change computation however since now the batching is handled by the pipeline itself. It **did** change the results for the small models so there seems to be a real difference in how the models handle this. * Fixing the doctests. * Behind is_torch_available. * Fix flakey test with seed (#20318) * Pin TF 2.10.1 for Push CI (#20319) Co-authored-by: ydshieh <ydshieh@users.noreply.github.com> * Remove double brackets (#20307) * remove double brackets * oops get other bracket * TF: future proof our keras imports (#20317) * future proof our tf code * parse tf versions * Add Neighborhood Attention Transformer (NAT) and Dilated NAT (DiNAT) models (#20219) * Add DiNAT * Adds DiNAT + tests * Minor fixes * Added HF model * Add natten to dependencies. * Cleanup * Minor fixup * Reformat * Optional NATTEN import. * Reformat & add doc to _toctree * Reformat (finally) * Dummy objects for DiNAT * Add NAT + minor changes Adds NAT as its own independent model + docs, tests Adds NATTEN to ext deps to ensure ci picks it up. * Remove natten from `all` and `dev-torch` deps, add manual pip install to ci tests * Minor fixes. * Fix READMEs. * Requested changes to docs + minor fixes. * Requested changes. * Add NAT/DiNAT tests to layoutlm_job * Correction to Dinat doc. * Requested changes. * organize pipelines by modality (#20306) * Fix torch device issues (#20304) * fix device issue Co-authored-by: ydshieh <ydshieh@users.noreply.github.com> * Generate: add generation config class (#20218) Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * translate zh quicktour(#20095) (#20181) * zh quicktour(#20095) * add zh to doc workflow * remove untranslation from toctree Co-authored-by: BeifangSusu <BeifangSusu@bfss.com> * Add Spanish translation of serialization.mdx (#20245) * Update _toctree and clone original content * Translate first three sections * Add more translated chapters. Only 3 more left. * Finish translation * Run style from doc-builder * Address recommended changes from reviewer * Add LayerScale to NAT/DiNAT (#20325) * Add LayerScale to NAT/DiNAT. Completely dropped the ball on LayerScale in the original PR (#20219). This is just an optional argument in both models, and is only activated for larger variants in order to provide training stability. * Add LayerScale to NAT/DiNAT. Minor error fixed. Co-authored-by: Ali Hassani <ahassanijr@gmail.com> * [Switch Transformers] Fix failing slow test (#20346) * run slow test on GPU * remove unnecessary device assignment * use `torch_device` instead * fix: "BigSicence" typo in docs (#20331) * add MobileNetV1 model (#17799) * add model files etc for MobileNetV2 rename files for MobileNetV1 initial implementation of MobileNetV1 fix conversion script cleanup write docs tweaks fix conversion script extract hidden states fix test cases make fixup fixup it all remove main from doc link fixes fix tests fix up use google org fix weird assert * fixup * use google organization for checkpoints * Generate: `model_kwargs` can also be an input to `prepare_inputs_for_generation` (#20353) * Update Special Language Tokens for PLBART (#19980) * Update Special Language Tokens for PLBART * fix format * making mapping for language codes and updating tests: * fix format * fix consistency * add assert to both tokenizer tests. * fix format * Update src/transformers/models/plbart/tokenization_plbart.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * improvin readability, setting self.tgt_lang * fixing * readability Co-authored-by: jordiclive <jordiclive19@imperial.ac.uk> Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Add resources (#20296) Co-authored-by: Niels Rogge <nielsrogge@Nielss-MacBook-Pro.local> * Enhance HfArgumentParser functionality and ease of use (#20323) * Enhance HfArgumentParser * Fix type hints for older python versions * Fix and add tests (+formatting) * Add changes * doc-builder formatting * Remove unused import "Call" * Add Audio Spectogram Transformer (#19981) * First draft * Make conversion script work * Add id2label mapping, run code quality * Fix copies * Add first draft of feature extractor * Update conversion script to use feature extractor * Make more tests pass * Add docs * update input_features to input_values + pad by default to max length * Fix doc tests * Add feature extractor tests * Add proper padding/truncation to feature extractor * Add support for conversion of all audioset checkpoints * Improve docs and extend conversion script * Fix README * Rename spectogram to spectrogram * Fix copies * Add integration test * Remove dummy conv * Update to ast * Update organization * Fix init * Rename model to AST * Add require_torchaudio annotator * Move import of ASTFeatureExtractor under a is_speech_available * Fix rebase * Add pipeline config * Update name of classifier head * Rename time_dimension and frequency_dimension for clarity * Remove print statement * Fix pipeline test * Fix pipeline test * Fix index table * Fix init * Fix conversion script * Rename to ForAudioClassification * Fix index table Co-authored-by: Niels Rogge <nielsrogge@Nielss-MacBook-Pro.local> * Add inference section to task guides (#18781) * 📝 start adding inference section to task guides * ✨ make style * 📝 add multiple choice * add rest of inference sections * make style * add compute_metric, push_to_hub, pipeline * make style * add updated sequence and token classification * make style * make edits in token classification * add audio classification * make style * add asr * make style * add image classification * make style * add summarization * make style * add translation * make style * add multiple choice * add language modeling * add qa * make style * review and edits * apply reviews * make style * fix call to processor * apply audio reviews * update to better asr model * make style * Fix toctree for Section 3 in Spanish Documentation (#20360) * Order and group topics in the right section * Translate "Computer Vision" Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> Co-authored-by: IMvision12 <88665786+IMvision12@users.noreply.github.com> Co-authored-by: Alexander Markov <almarkv@yandex.ru> Co-authored-by: Alexander Markov <amarkov.me@gmail.com> Co-authored-by: Saad Mahmud <shuvro.mahmud79@gmail.com> Co-authored-by: Zachary Mueller <muellerzr@gmail.com> Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com> Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com> Co-authored-by: Wang, Yi <yi.a.wang@intel.com> Co-authored-by: raghavanone <115454562+raghavanone@users.noreply.github.com> Co-authored-by: ydshieh <ydshieh@users.noreply.github.com> Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com> Co-authored-by: Niels Rogge <nielsrogge@Nielss-MacBook-Pro.local> Co-authored-by: Yih-Dar <2521628+ydshieh@users.noreply.github.com> Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com> Co-authored-by: Sylvain Gugger <Sylvain.gugger@gmail.com> Co-authored-by: atturaioe <76523524+atturaioe@users.noreply.github.com> Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> Co-authored-by: Ali Hassani <68103095+alihassanijr@users.noreply.github.com> Co-authored-by: BFSS <31245245+bfss@users.noreply.github.com> Co-authored-by: BeifangSusu <BeifangSusu@bfss.com> Co-authored-by: Ian C <7807897+donelianc@users.noreply.github.com> Co-authored-by: Ali Hassani <ahassanijr@gmail.com> Co-authored-by: Raj Rajhans <me@rajrajhans.com> Co-authored-by: Matthijs Hollemans <mail@hollance.com> Co-authored-by: Jordan Clive <jordan.clive19@imperial.ac.uk> Co-authored-by: jordiclive <jordiclive19@imperial.ac.uk> Co-authored-by: Konstantin Dobler <konstantin.j.dobler@gmail.com>

* First draft * Make conversion script work * Add id2label mapping, run code quality * Fix copies * Add first draft of feature extractor * Update conversion script to use feature extractor * Make more tests pass * Add docs * update input_features to input_values + pad by default to max length * Fix doc tests * Add feature extractor tests * Add proper padding/truncation to feature extractor * Add support for conversion of all audioset checkpoints * Improve docs and extend conversion script * Fix README * Rename spectogram to spectrogram * Fix copies * Add integration test * Remove dummy conv * Update to ast * Update organization * Fix init * Rename model to AST * Add require_torchaudio annotator * Move import of ASTFeatureExtractor under a is_speech_available * Fix rebase * Add pipeline config * Update name of classifier head * Rename time_dimension and frequency_dimension for clarity * Remove print statement * Fix pipeline test * Fix pipeline test * Fix index table * Fix init * Fix conversion script * Rename to ForAudioClassification * Fix index table Co-authored-by: Niels Rogge <nielsrogge@Nielss-MacBook-Pro.local>

* Optimizes DonutProcessor token2json method for speed * Applies black formatting * Updates Donut pretrained model name in test file * remaining pytorch type hints (huggingface#20217) * Update modeling_flava.py * Update modeling_markuplm.py * Update modeling_glpn.py * Update modeling_roc_bert.py * Update modeling_segformer.py * Update modeling_tapas.py * Update modeling_tapas.py * Update modeling_tapas.py * Update modeling_tapas.py * Update modeling_trocr.py * Update modeling_videomae.py * Update modeling_videomae.py * Update modeling_videomae.py * Update modeling_yolos.py * Update modeling_wav2vec2.py * Update modeling_jukebox.py * Update modeling_jukebox.py * Update modeling_jukebox.py * Update modeling_jukebox.py * Data collator for token classification pads labels column when receives pytorch tensors (huggingface#20244) * token cls data_collator pads labels column * remove walrus operator for code quality * remove redundat space * remove comment that was fixed * PR comments fix Co-authored-by: Alexander Markov <amarkov.me@gmail.com> * [Doctest] Add configuration_deformable_detr.py (huggingface#20273) * Update configuration_deformable_detr.py comment * Add DeformableDetrConfig to documentation_tests.txt * Fix summarization script (huggingface#20286) * [DOCTEST] Fix the documentation of RoCBert (huggingface#20142) * update part of the doc * add temp values, fix part of the doc * add template outputs * add correct models and outputss * style * fixup * [bnb] Let's warn users when saving 8-bit models (huggingface#20282) * add warning on 8-bit models - added tests - added wrapper * move to a private attribute - remove wrapper - changed `save_pretrained` method * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * fix suggestions Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Adding `zero-shot-object-detection` pipeline doctest. (huggingface#20274) * Adding `zero-shot-object-detection` pipeline doctest. * Remove nested_simplify. * Adding doctest for `object-detection` pipeline. (huggingface#20258) * Adding doctest for `object-detection` pipeline. * Removed nested_simplify. * Image transforms functionality used instead (huggingface#20278) * Image transforms functionality used instead * Import torch * Import rather than copy * Update src/transformers/models/conditional_detr/feature_extraction_conditional_detr.py * TF: add test for `PushToHubCallback` (huggingface#20231) * test hub tf callback * create repo before cloning it * Generate: general TF XLA constrastive search are now slow tests (huggingface#20277) * move contrastive search test to slow * Fixing the doctests failures. (huggingface#20294) * Fixing the doctests failures. * Fixup. * set the default cache_enable to True, aligned with the default value in pytorch cpu/cuda amp autocast (huggingface#20289) Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * Add docstrings for canine model (huggingface#19457) * Add docstrings for canine model * Update CanineForTokenClassification Co-authored-by: ydshieh <ydshieh@users.noreply.github.com> * Add AutoBackbone + ResNetBackbone (huggingface#20229) * Add ResNetBackbone * Define channels and strides as property * Remove file * Add test for backbone * Update BackboneOutput class * Remove strides property * Fix docstring * Add backbones to SHOULD_HAVE_THEIR_OWN_PAGE * Fix auto mapping name * Add sanity check for out_features * Set stage names based on depths * Update to tuple Co-authored-by: Niels Rogge <nielsrogge@Nielss-MacBook-Pro.local> * Add missing report button for Example test (huggingface#20293) Co-authored-by: ydshieh <ydshieh@users.noreply.github.com> * refactor test (huggingface#20300) - simplifies the devce checking test * [Tiny model creation] deal with `ImageProcessor` (huggingface#20298) Co-authored-by: ydshieh <ydshieh@users.noreply.github.com> * Fix blender bot missleading doc (huggingface#20301) * fix the doc to specify that add_prefix_space = False * add correct expected output * remove two tokens that should not be suppressed (huggingface#20302) * [ASR Examples] Update README for Whisper (huggingface#20230) * [ASR Examples] Update README for seq2seq * add language info * add training results * re-word * Add padding image transformation (huggingface#19838) * Add padding transformation * Add in upstream changes * Update tests & docs * Code formatting tuples in docstring * Pin TensorFlow (huggingface#20313) * Pin to the right version... * Also pin TensorFlow CPU * Add AnyPrecisionAdamW optimizer (huggingface#18961) * Add AnyPrecisionAdamW optimizer * Add optim_args argument to TrainingArgs * Add tests for AnyPrecisionOptimizer * Change AnyPrecisionAdam default params to float32 * Move default_anyprecision_kwargs in trainer test * Rename AnyPrecisionAdamW * [Proposal] Breaking change `zero-shot-object-detection` for improved consistency. (huggingface#20280) * [Proposal] Breaking change `zero-shot-object-detection` for improved consistency. This is a proposal to modify the output of `zero-shot-object-detection` to provide better alignment with other pipelines. The output is now strictly the same as `object-detection` whereas before it would output lists of lists. The name `candidate_labels` is used throughout for consistency with other `zero-shot` pipelines. The pipeline is changed to `ChunkPipeline` to support batching cleanly. This removes all the lists and list of lists shenanigans, it's now a matter of the base pipeline handling all this not this specific one. **Breaking change**: It did remove complex calls potentials `pipe(images = [image1, image2], text_queries=[candidates1, candidates2])` to support only `pipe([{"image": image1, "candidate_labels": candidates1}, {"image": image2, "candidate_labels": candidates2}])` when dealing with lists and/or datasets. We could keep them, but it will add a lot of complexity to the code base, since the pipeline is rather young, I'd rather break to keep the code simpler, but we can revert this. **Breaking change**: The name of the argument is now `image` instead of `images` since it expects by default only 1 image. This is revertable like the previous one. **Breaking change**: The types is now simplified and flattened: `pipe(inputs) == [{**object1}, {**object2}]` instead of the previous `pipe(inputs) == [[{**object1}, {**object1}], [{**object2}]]` Where the different instances would be grouped by candidate labels within lists. IMHO this is not really desirable, since it would output empty lists and is only adding superflous indirection compared to `zero-shot-object-detection`. It is relatively change free in terms of how the results, it does change computation however since now the batching is handled by the pipeline itself. It **did** change the results for the small models so there seems to be a real difference in how the models handle this. * Fixing the doctests. * Behind is_torch_available. * Fix flakey test with seed (huggingface#20318) * Pin TF 2.10.1 for Push CI (huggingface#20319) Co-authored-by: ydshieh <ydshieh@users.noreply.github.com> * Remove double brackets (huggingface#20307) * remove double brackets * oops get other bracket * TF: future proof our keras imports (huggingface#20317) * future proof our tf code * parse tf versions * Add Neighborhood Attention Transformer (NAT) and Dilated NAT (DiNAT) models (huggingface#20219) * Add DiNAT * Adds DiNAT + tests * Minor fixes * Added HF model * Add natten to dependencies. * Cleanup * Minor fixup * Reformat * Optional NATTEN import. * Reformat & add doc to _toctree * Reformat (finally) * Dummy objects for DiNAT * Add NAT + minor changes Adds NAT as its own independent model + docs, tests Adds NATTEN to ext deps to ensure ci picks it up. * Remove natten from `all` and `dev-torch` deps, add manual pip install to ci tests * Minor fixes. * Fix READMEs. * Requested changes to docs + minor fixes. * Requested changes. * Add NAT/DiNAT tests to layoutlm_job * Correction to Dinat doc. * Requested changes. * organize pipelines by modality (huggingface#20306) * Fix torch device issues (huggingface#20304) * fix device issue Co-authored-by: ydshieh <ydshieh@users.noreply.github.com> * Generate: add generation config class (huggingface#20218) Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * translate zh quicktour(huggingface#20095) (huggingface#20181) * zh quicktour(huggingface#20095) * add zh to doc workflow * remove untranslation from toctree Co-authored-by: BeifangSusu <BeifangSusu@bfss.com> * Add Spanish translation of serialization.mdx (huggingface#20245) * Update _toctree and clone original content * Translate first three sections * Add more translated chapters. Only 3 more left. * Finish translation * Run style from doc-builder * Address recommended changes from reviewer * Add LayerScale to NAT/DiNAT (huggingface#20325) * Add LayerScale to NAT/DiNAT. Completely dropped the ball on LayerScale in the original PR (huggingface#20219). This is just an optional argument in both models, and is only activated for larger variants in order to provide training stability. * Add LayerScale to NAT/DiNAT. Minor error fixed. Co-authored-by: Ali Hassani <ahassanijr@gmail.com> * [Switch Transformers] Fix failing slow test (huggingface#20346) * run slow test on GPU * remove unnecessary device assignment * use `torch_device` instead * fix: "BigSicence" typo in docs (huggingface#20331) * add MobileNetV1 model (huggingface#17799) * add model files etc for MobileNetV2 rename files for MobileNetV1 initial implementation of MobileNetV1 fix conversion script cleanup write docs tweaks fix conversion script extract hidden states fix test cases make fixup fixup it all remove main from doc link fixes fix tests fix up use google org fix weird assert * fixup * use google organization for checkpoints * Generate: `model_kwargs` can also be an input to `prepare_inputs_for_generation` (huggingface#20353) * Update Special Language Tokens for PLBART (huggingface#19980) * Update Special Language Tokens for PLBART * fix format * making mapping for language codes and updating tests: * fix format * fix consistency * add assert to both tokenizer tests. * fix format * Update src/transformers/models/plbart/tokenization_plbart.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * improvin readability, setting self.tgt_lang * fixing * readability Co-authored-by: jordiclive <jordiclive19@imperial.ac.uk> Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Add resources (huggingface#20296) Co-authored-by: Niels Rogge <nielsrogge@Nielss-MacBook-Pro.local> * Enhance HfArgumentParser functionality and ease of use (huggingface#20323) * Enhance HfArgumentParser * Fix type hints for older python versions * Fix and add tests (+formatting) * Add changes * doc-builder formatting * Remove unused import "Call" * Add Audio Spectogram Transformer (huggingface#19981) * First draft * Make conversion script work * Add id2label mapping, run code quality * Fix copies * Add first draft of feature extractor * Update conversion script to use feature extractor * Make more tests pass * Add docs * update input_features to input_values + pad by default to max length * Fix doc tests * Add feature extractor tests * Add proper padding/truncation to feature extractor * Add support for conversion of all audioset checkpoints * Improve docs and extend conversion script * Fix README * Rename spectogram to spectrogram * Fix copies * Add integration test * Remove dummy conv * Update to ast * Update organization * Fix init * Rename model to AST * Add require_torchaudio annotator * Move import of ASTFeatureExtractor under a is_speech_available * Fix rebase * Add pipeline config * Update name of classifier head * Rename time_dimension and frequency_dimension for clarity * Remove print statement * Fix pipeline test * Fix pipeline test * Fix index table * Fix init * Fix conversion script * Rename to ForAudioClassification * Fix index table Co-authored-by: Niels Rogge <nielsrogge@Nielss-MacBook-Pro.local> * Add inference section to task guides (huggingface#18781) * 📝 start adding inference section to task guides * ✨ make style * 📝 add multiple choice * add rest of inference sections * make style * add compute_metric, push_to_hub, pipeline * make style * add updated sequence and token classification * make style * make edits in token classification * add audio classification * make style * add asr * make style * add image classification * make style * add summarization * make style * add translation * make style * add multiple choice * add language modeling * add qa * make style * review and edits * apply reviews * make style * fix call to processor * apply audio reviews * update to better asr model * make style * Fix toctree for Section 3 in Spanish Documentation (huggingface#20360) * Order and group topics in the right section * Translate "Computer Vision" Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> Co-authored-by: IMvision12 <88665786+IMvision12@users.noreply.github.com> Co-authored-by: Alexander Markov <almarkv@yandex.ru> Co-authored-by: Alexander Markov <amarkov.me@gmail.com> Co-authored-by: Saad Mahmud <shuvro.mahmud79@gmail.com> Co-authored-by: Zachary Mueller <muellerzr@gmail.com> Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com> Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com> Co-authored-by: Wang, Yi <yi.a.wang@intel.com> Co-authored-by: raghavanone <115454562+raghavanone@users.noreply.github.com> Co-authored-by: ydshieh <ydshieh@users.noreply.github.com> Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com> Co-authored-by: Niels Rogge <nielsrogge@Nielss-MacBook-Pro.local> Co-authored-by: Yih-Dar <2521628+ydshieh@users.noreply.github.com> Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com> Co-authored-by: Sylvain Gugger <Sylvain.gugger@gmail.com> Co-authored-by: atturaioe <76523524+atturaioe@users.noreply.github.com> Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> Co-authored-by: Ali Hassani <68103095+alihassanijr@users.noreply.github.com> Co-authored-by: BFSS <31245245+bfss@users.noreply.github.com> Co-authored-by: BeifangSusu <BeifangSusu@bfss.com> Co-authored-by: Ian C <7807897+donelianc@users.noreply.github.com> Co-authored-by: Ali Hassani <ahassanijr@gmail.com> Co-authored-by: Raj Rajhans <me@rajrajhans.com> Co-authored-by: Matthijs Hollemans <mail@hollance.com> Co-authored-by: Jordan Clive <jordan.clive19@imperial.ac.uk> Co-authored-by: jordiclive <jordiclive19@imperial.ac.uk> Co-authored-by: Konstantin Dobler <konstantin.j.dobler@gmail.com>

NielsRogge requested review from sgugger, patrickvonplaten and sanchit-gandhi and removed request for patrickvonplaten October 31, 2022 13:24

sanchit-gandhi reviewed Oct 31, 2022

View reviewed changes

sgugger approved these changes Nov 1, 2022

View reviewed changes

sgugger mentioned this pull request Nov 4, 2022

Attempting to test automatically the _keys_to_ignore. #20042

Merged

5 tasks

NielsRogge force-pushed the add_ast branch 2 times, most recently from 687f02b to a754720 Compare November 14, 2022 19:24

NielsRogge force-pushed the add_ast branch from 5b4eea6 to 5065bfb Compare November 15, 2022 15:06

NielsRogge force-pushed the add_ast branch 2 times, most recently from ac423e1 to c0918c5 Compare November 21, 2022 10:14

Niels Rogge added 12 commits November 21, 2022 16:24

First draft

8c10464

Make conversion script work

353ead3

Add id2label mapping, run code quality

f996abe

Fix copies

7006ed8

Add first draft of feature extractor

488e7e5

Update conversion script to use feature extractor

6a27c2e

Make more tests pass

e8feefc

Add docs

c0ec268

update input_features to input_values + pad by default to max length

d952e5f

Fix doc tests

3547d7c

Add feature extractor tests

0047122

Add proper padding/truncation to feature extractor

56aafc5

Niels Rogge added 25 commits November 21, 2022 16:24

Add support for conversion of all audioset checkpoints

f194e9d

Improve docs and extend conversion script

df8575c

Fix README

a08338f

Rename spectogram to spectrogram

1628536

Fix copies

9dede0c

Add integration test

fdead74

Remove dummy conv

70c948a

Update to ast

fd10b76

Update organization

341ade2

Fix init

5685d73

Rename model to AST

a0e7d50

Add require_torchaudio annotator

19cf9f6

Move import of ASTFeatureExtractor under a is_speech_available

6588eba

Fix rebase

3de1686

Add pipeline config

3b36797

Update name of classifier head

0a8afca

Rename time_dimension and frequency_dimension for clarity

4deea23

Remove print statement

8113ed1

Fix pipeline test

3154869

Fix pipeline test

4282f68

Fix index table

7e38d96

Fix init

519481c

Fix conversion script

5dbf899

Rename to ForAudioClassification

369ed79

Fix index table

b8e31b9

NielsRogge force-pushed the add_ast branch from 474ba04 to b8e31b9 Compare November 21, 2022 15:25

NielsRogge merged commit 4973d2a into huggingface:main Nov 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Audio Spectogram Transformer #19981

Add Audio Spectogram Transformer #19981

NielsRogge commented Oct 31, 2022 •

edited

sanchit-gandhi left a comment

sanchit-gandhi Oct 31, 2022

sgugger left a comment

HuggingFaceDocBuilderDev commented Nov 1, 2022 •

edited

FrankFundel commented Nov 4, 2022

sanchit-gandhi commented Nov 4, 2022

HuggingFaceDocBuilderDev commented Nov 15, 2022

HuggingFaceDocBuilderDev commented Nov 15, 2022

HuggingFaceDocBuilderDev commented Nov 15, 2022

Add Audio Spectogram Transformer #19981

Add Audio Spectogram Transformer #19981

Conversation

NielsRogge commented Oct 31, 2022 • edited

What does this PR do?

sanchit-gandhi left a comment

Choose a reason for hiding this comment

sanchit-gandhi Oct 31, 2022

Choose a reason for hiding this comment

sgugger left a comment

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Nov 1, 2022 • edited

FrankFundel commented Nov 4, 2022

sanchit-gandhi commented Nov 4, 2022

HuggingFaceDocBuilderDev commented Nov 15, 2022

HuggingFaceDocBuilderDev commented Nov 15, 2022

HuggingFaceDocBuilderDev commented Nov 15, 2022

NielsRogge commented Oct 31, 2022 •

edited

HuggingFaceDocBuilderDev commented Nov 1, 2022 •

edited