Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add WhisperModel to transformers #19166

Merged
merged 167 commits into from Oct 5, 2022
Merged
Show file tree
Hide file tree
Changes from 161 commits
Commits
Show all changes
167 commits
Select commit Hold shift + click to select a range
cd94d03
simplify loop
ArthurZucker Sep 19, 2022
569338e
Merge branch 'main' of https://github.com/huggingface/transformers in…
ArthurZucker Sep 22, 2022
46b0ebe
add featur extractor
ArthurZucker Sep 22, 2022
af9d14f
add model
ArthurZucker Sep 22, 2022
00cdcbe
start conversion
ArthurZucker Sep 22, 2022
a916bf1
add dropout
ArthurZucker Sep 22, 2022
7ebda7d
initial commit of test files
ArthurZucker Sep 22, 2022
974235f
copnversion for all models
ArthurZucker Sep 22, 2022
40c42ab
update processor for correct padding
ArthurZucker Sep 22, 2022
792d964
update feature extraction
ArthurZucker Sep 22, 2022
339f95c
update integration test logits match
ArthurZucker Sep 22, 2022
3a26273
fmnt: off for the logits
ArthurZucker Sep 22, 2022
ad5f990
on the fly mel bank
ArthurZucker Sep 22, 2022
d58b7a0
small nit
ArthurZucker Sep 22, 2022
c61258b
update test
ArthurZucker Sep 22, 2022
6acc131
update tokenizer
ArthurZucker Sep 23, 2022
71b3be8
nit feature extraction
ArthurZucker Sep 23, 2022
b4983e4
update
ArthurZucker Sep 23, 2022
e66815a
update tokenizer test
ArthurZucker Sep 23, 2022
a980ccc
adds logit processor and update tokenizer to get supress tokens
ArthurZucker Sep 23, 2022
001dff2
style
ArthurZucker Sep 23, 2022
81a7099
clean convert
ArthurZucker Sep 23, 2022
cbf1b4a
revert to original modeling tf utils
ArthurZucker Sep 23, 2022
b2d0e5d
Merge branch 'main' of https://github.com/huggingface/transformers in…
ArthurZucker Sep 23, 2022
66237cb
Update
ArthurZucker Sep 25, 2022
d2cfce3
update
ArthurZucker Sep 25, 2022
ca2a225
nit
ArthurZucker Sep 25, 2022
5afbaad
clean convert file
ArthurZucker Sep 25, 2022
9ce0bc9
update tests and nits
ArthurZucker Sep 25, 2022
16033a5
quality
ArthurZucker Sep 25, 2022
1f95255
slow generation test
ArthurZucker Sep 25, 2022
830528a
ffn_dim to allow customization
ArthurZucker Sep 25, 2022
9fc86bc
update readme
ArthurZucker Sep 25, 2022
0497d0f
add to toctreee
ArthurZucker Sep 25, 2022
57cb281
start fixing integration tests
ArthurZucker Sep 26, 2022
a27eb00
update tests and code
ArthurZucker Sep 26, 2022
ef6e08e
fix feature extractor
ArthurZucker Sep 26, 2022
fc5ce23
fix config tests common
ArthurZucker Sep 26, 2022
655e460
update code to fix tests
ArthurZucker Sep 26, 2022
d7dcfbd
fix feature exctractor
ArthurZucker Sep 26, 2022
c64e8a6
nit feature extraction
ArthurZucker Sep 26, 2022
edce53e
update test for new feature extractor
ArthurZucker Sep 26, 2022
e661aef
style
ArthurZucker Sep 26, 2022
2859221
add absrtact
ArthurZucker Sep 26, 2022
9a69dbd
large logits wioth custom decoder input ids
ArthurZucker Sep 26, 2022
6f1858d
wraap around is otrch available
ArthurZucker Sep 26, 2022
ac40d6d
fix feature extractor
ArthurZucker Sep 26, 2022
f8f7463
correct logits for whisper small.en
ArthurZucker Sep 26, 2022
5540472
nit
ArthurZucker Sep 26, 2022
71ac3f7
fix encoder_attentino_mask
ArthurZucker Sep 26, 2022
5f4a1f9
some fixes
patrickvonplaten Sep 26, 2022
cda4759
remove unnecessary inputs
patrickvonplaten Sep 26, 2022
7b892dd
nits
ArthurZucker Sep 27, 2022
d49614b
add normalizer file
ArthurZucker Sep 27, 2022
171e034
update etst tokenization
ArthurZucker Sep 27, 2022
cae0269
fix attention mask not defined
ArthurZucker Sep 27, 2022
de47019
Add model to README
NielsRogge Sep 27, 2022
9a8f99f
Fix doc tests
NielsRogge Sep 27, 2022
12b1ca5
fix generate
ArthurZucker Sep 27, 2022
b4c0cb9
remove uncoder attention mask useless
ArthurZucker Sep 27, 2022
f6b7550
update test modeling whisper
ArthurZucker Sep 27, 2022
8c96dfd
update condfig to add second non supress tokens
ArthurZucker Sep 27, 2022
378841c
nits on feature exrtactor
ArthurZucker Sep 27, 2022
3a2c411
nit for test tokenizers
ArthurZucker Sep 27, 2022
4dfbba1
update etsts
ArthurZucker Sep 27, 2022
0a39f49
update tests
ArthurZucker Sep 28, 2022
1fd1d52
update tokenization test
ArthurZucker Sep 28, 2022
2a900f4
fixup
ArthurZucker Sep 28, 2022
d16d3e1
invalidated hf token. Clean convert openai to whisper
ArthurZucker Sep 28, 2022
f62fd14
fix logit tests
ArthurZucker Sep 28, 2022
d4efa53
fixup
ArthurZucker Sep 28, 2022
4fe81c6
Merge pull request #1 from NielsRogge/add-whisper
ArthurZucker Sep 28, 2022
cf156ce
clean merge
ArthurZucker Sep 28, 2022
570941b
revert toc_tree changes
ArthurZucker Sep 28, 2022
261c1f2
remove useless LogitProcessor
ArthurZucker Sep 28, 2022
d65e755
Update whisper .mdx
ArthurZucker Sep 28, 2022
aa95777
update config file doc
ArthurZucker Sep 28, 2022
0a23c18
update configuration docstring
ArthurZucker Sep 28, 2022
5300956
update test tokenization
ArthurZucker Sep 28, 2022
afcf30d
update test tokenization
ArthurZucker Sep 28, 2022
03f11d8
update tokenization whisper
ArthurZucker Sep 28, 2022
017010f
update feature extraction
ArthurZucker Sep 28, 2022
9f0f332
nit test name
ArthurZucker Sep 28, 2022
fde6e99
style
ArthurZucker Sep 28, 2022
9cca7eb
quality
ArthurZucker Sep 28, 2022
fa69008
remove get suppress tokens and update non_speech tokens global variables
ArthurZucker Sep 28, 2022
69e2dce
Update src/transformers/models/whisper/feature_extraction_whisper.py
ArthurZucker Sep 28, 2022
4243f7d
clean modeling whisper and test
ArthurZucker Sep 28, 2022
044e371
fix large test
ArthurZucker Sep 28, 2022
1268f4b
Add multilingual audio test, and translate test
ArthurZucker Sep 28, 2022
1578988
style
ArthurZucker Sep 28, 2022
8387ce8
fix larg multilingual test
ArthurZucker Sep 28, 2022
6b14b67
nits
ArthurZucker Sep 28, 2022
cafe5f1
Update docs/source/en/model_doc/whisper.mdx
ArthurZucker Sep 28, 2022
bbf35b1
add copied from for attention layer
ArthurZucker Sep 28, 2022
40284ef
remove attention masks in doc
ArthurZucker Sep 28, 2022
ebb79e9
add english normalizer
ArthurZucker Sep 28, 2022
2fce16c
Merge branch 'add-whisper' of https://github.com/ArthurZucker/transfo…
ArthurZucker Sep 28, 2022
1b6a09c
update tokenization test
ArthurZucker Sep 28, 2022
5ca9dcb
remove copied from in whisper attention : no bias in k_proj only
ArthurZucker Sep 29, 2022
d0cf660
wrap around dependencies in english normalizer
ArthurZucker Sep 29, 2022
378b84b
style
ArthurZucker Sep 29, 2022
a71ead9
correct import generation logits
ArthurZucker Sep 29, 2022
bdc1259
for now, wrap feature extractor with torch
ArthurZucker Sep 29, 2022
bd99c23
Update src/transformers/models/whisper/convert_openai_whisper_to_tfms.py
ArthurZucker Sep 29, 2022
e204a51
Update src/transformers/models/whisper/configuration_whisper.py
ArthurZucker Sep 29, 2022
7fa70db
Update docs/source/en/model_doc/whisper.mdx
ArthurZucker Sep 29, 2022
5d95201
Merge branch 'main' of https://github.com/huggingface/transformers in…
ArthurZucker Sep 29, 2022
5d62a99
remove torch depencies for feature extraction and style
ArthurZucker Sep 29, 2022
96be7b5
Merge branch 'add-whisper' of https://github.com/ArthurZucker/transfo…
ArthurZucker Sep 29, 2022
f448015
fixup
ArthurZucker Sep 29, 2022
62b2572
nit
ArthurZucker Sep 29, 2022
351c942
update logitds
ArthurZucker Sep 29, 2022
6adeabe
style
ArthurZucker Sep 29, 2022
4b07c61
nit
ArthurZucker Sep 29, 2022
a276b07
nits and fix final tests
ArthurZucker Sep 29, 2022
07dd529
add `is_more_itertools_available` to utils
ArthurZucker Sep 29, 2022
bbafa58
quality
ArthurZucker Sep 29, 2022
07164fa
add begin supress tokens, supress tokens to generate args and config
ArthurZucker Sep 30, 2022
fd0c7e9
clean supressTokensLogitProcessor in generation logits
ArthurZucker Sep 30, 2022
1f4fe24
Nit naming
ArthurZucker Sep 30, 2022
848f1c3
add supressTokensAtBegin
ArthurZucker Sep 30, 2022
2498086
udpate tests, supress tokens to None or correct values
ArthurZucker Sep 30, 2022
3269d57
nit and style
ArthurZucker Sep 30, 2022
6b2ebd4
update RAG to fit test and generate_logit
ArthurZucker Sep 30, 2022
dff15c2
add copy pasted statment on english normalizer
ArthurZucker Sep 30, 2022
7c51de1
add arguments to config_common_kwargs
ArthurZucker Sep 30, 2022
da99700
Update src/transformers/generation_utils.py
ArthurZucker Sep 30, 2022
af1beac
Update src/transformers/generation_logits_process.py
ArthurZucker Sep 30, 2022
8277239
Update src/transformers/models/whisper/configuration_whisper.py
ArthurZucker Sep 30, 2022
7b5e793
Apply suggestions from code review
ArthurZucker Sep 30, 2022
325d088
revert changes based on reviews
ArthurZucker Sep 30, 2022
2f88dc8
update doc and nits
ArthurZucker Sep 30, 2022
2724add
Merge branch 'add-whisper' of https://github.com/ArthurZucker/transfo…
ArthurZucker Sep 30, 2022
8df5a58
more nits
ArthurZucker Sep 30, 2022
cef34fd
last nits
ArthurZucker Sep 30, 2022
90c9180
update test configuration common
ArthurZucker Oct 1, 2022
72e86ed
add BART name in decoder attention mask documentation
ArthurZucker Oct 1, 2022
7d69c3c
Update src/transformers/models/whisper/modeling_whisper.py
ArthurZucker Oct 1, 2022
7cf0df0
Merge branch 'add-whisper' of https://github.com/ArthurZucker/transfo…
ArthurZucker Oct 1, 2022
84c25dc
style
ArthurZucker Oct 1, 2022
bbf84b1
nit
ArthurZucker Oct 1, 2022
f2ac0f5
nit
ArthurZucker Oct 1, 2022
93e9b2a
add english.json file to git
ArthurZucker Oct 3, 2022
e6ee5b5
nits on documentation
ArthurZucker Oct 3, 2022
a6361d4
nit
ArthurZucker Oct 3, 2022
f92b9a8
nits
ArthurZucker Oct 3, 2022
8d40196
last styling
ArthurZucker Oct 3, 2022
392563e
add main toctree file
ArthurZucker Oct 3, 2022
2911736
Merge branch 'main' of https://github.com/huggingface/transformers in…
ArthurZucker Oct 3, 2022
009cdef
remove sentence piece dependency
ArthurZucker Oct 3, 2022
ef83269
clean init file
ArthurZucker Oct 3, 2022
78d1ed2
fix tokenizer that has no dependencies on sentencepiece
ArthurZucker Oct 3, 2022
f572f5f
update whisper init file, nit
ArthurZucker Oct 3, 2022
837a410
remove english.json file
ArthurZucker Oct 4, 2022
ede09a4
add get decoder prompt id
ArthurZucker Oct 4, 2022
529746a
revert changes and add forced logit processor
ArthurZucker Oct 4, 2022
d403c9d
nit
ArthurZucker Oct 4, 2022
b82fe09
clean normalizer
ArthurZucker Oct 4, 2022
ff8aa6c
remove protected
ArthurZucker Oct 4, 2022
40461a9
update
ArthurZucker Oct 5, 2022
c5a2581
Update src/transformers/models/whisper/configuration_whisper.py
ArthurZucker Oct 5, 2022
2c61839
update based on review
ArthurZucker Oct 5, 2022
9eb7cc3
Merge branch 'add-whisper' of https://github.com/ArthurZucker/transfo…
ArthurZucker Oct 5, 2022
135be7d
Update src/transformers/models/whisper/configuration_whisper.py
ArthurZucker Oct 5, 2022
8e047f9
add batched tests
ArthurZucker Oct 5, 2022
3ea5d78
Merge branch 'add-whisper' of https://github.com/ArthurZucker/transfo…
ArthurZucker Oct 5, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Expand Up @@ -393,6 +393,7 @@ Current number of checkpoints: ![](https://img.shields.io/endpoint?url=https://h
1. **[Wav2Vec2-Conformer](https://huggingface.co/docs/transformers/model_doc/wav2vec2-conformer)** (from Facebook AI) released with the paper [FAIRSEQ S2T: Fast Speech-to-Text Modeling with FAIRSEQ](https://arxiv.org/abs/2010.05171) by Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Sravya Popuri, Dmytro Okhonko, Juan Pino.
1. **[Wav2Vec2Phoneme](https://huggingface.co/docs/transformers/model_doc/wav2vec2_phoneme)** (from Facebook AI) released with the paper [Simple and Effective Zero-shot Cross-lingual Phoneme Recognition](https://arxiv.org/abs/2109.11680) by Qiantong Xu, Alexei Baevski, Michael Auli.
1. **[WavLM](https://huggingface.co/docs/transformers/model_doc/wavlm)** (from Microsoft Research) released with the paper [WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing](https://arxiv.org/abs/2110.13900) by Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Furu Wei.
1. **[Whisper](https://huggingface.co/docs/transformers/main/model_doc/whisper)** (from OpenAI) released with the paper [Robust Speech Recognition via Large-Scale Weak Supervision](https://cdn.openai.com/papers/whisper.pdf) by Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, Ilya Sutskever.
1. **[X-CLIP](https://huggingface.co/docs/transformers/model_doc/xclip)** (from Microsoft Research) released with the paper [Expanding Language-Image Pretrained Models for General Video Recognition](https://arxiv.org/abs/2208.02816) by Bolin Ni, Houwen Peng, Minghao Chen, Songyang Zhang, Gaofeng Meng, Jianlong Fu, Shiming Xiang, Haibin Ling.
1. **[XGLM](https://huggingface.co/docs/transformers/model_doc/xglm)** (From Facebook AI) released with the paper [Few-shot Learning with Multilingual Language Models](https://arxiv.org/abs/2112.10668) by Xi Victoria Lin, Todor Mihaylov, Mikel Artetxe, Tianlu Wang, Shuohui Chen, Daniel Simig, Myle Ott, Naman Goyal, Shruti Bhosale, Jingfei Du, Ramakanth Pasunuru, Sam Shleifer, Punit Singh Koura, Vishrav Chaudhary, Brian O'Horo, Jeff Wang, Luke Zettlemoyer, Zornitsa Kozareva, Mona Diab, Veselin Stoyanov, Xian Li.
1. **[XLM](https://huggingface.co/docs/transformers/model_doc/xlm)** (from Facebook) released together with the paper [Cross-lingual Language Model Pretraining](https://arxiv.org/abs/1901.07291) by Guillaume Lample and Alexis Conneau.
Expand Down
1 change: 1 addition & 0 deletions README_ko.md
Expand Up @@ -343,6 +343,7 @@ Flax, PyTorch, TensorFlow 설치 페이지에서 이들을 conda로 설치하는
1. **[Wav2Vec2-Conformer](https://huggingface.co/docs/transformers/model_doc/wav2vec2-conformer)** (from Facebook AI) released with the paper [FAIRSEQ S2T: Fast Speech-to-Text Modeling with FAIRSEQ](https://arxiv.org/abs/2010.05171) by Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Sravya Popuri, Dmytro Okhonko, Juan Pino.
1. **[Wav2Vec2Phoneme](https://huggingface.co/docs/transformers/model_doc/wav2vec2_phoneme)** (from Facebook AI) released with the paper [Simple and Effective Zero-shot Cross-lingual Phoneme Recognition](https://arxiv.org/abs/2109.11680) by Qiantong Xu, Alexei Baevski, Michael Auli.
1. **[WavLM](https://huggingface.co/docs/transformers/model_doc/wavlm)** (from Microsoft Research) released with the paper [WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing](https://arxiv.org/abs/2110.13900) by Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Furu Wei.
1. **[Whisper](https://huggingface.co/docs/transformers/main/model_doc/whisper)** (from OpenAI) released with the paper [Robust Speech Recognition via Large-Scale Weak Supervision](https://cdn.openai.com/papers/whisper.pdf) by Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, Ilya Sutskever.
1. **[X-CLIP](https://huggingface.co/docs/transformers/model_doc/xclip)** (from Microsoft Research) released with the paper [Expanding Language-Image Pretrained Models for General Video Recognition](https://arxiv.org/abs/2208.02816) by Bolin Ni, Houwen Peng, Minghao Chen, Songyang Zhang, Gaofeng Meng, Jianlong Fu, Shiming Xiang, Haibin Ling.
1. **[XGLM](https://huggingface.co/docs/transformers/model_doc/xglm)** (From Facebook AI) released with the paper [Few-shot Learning with Multilingual Language Models](https://arxiv.org/abs/2112.10668) by Xi Victoria Lin, Todor Mihaylov, Mikel Artetxe, Tianlu Wang, Shuohui Chen, Daniel Simig, Myle Ott, Naman Goyal, Shruti Bhosale, Jingfei Du, Ramakanth Pasunuru, Sam Shleifer, Punit Singh Koura, Vishrav Chaudhary, Brian O'Horo, Jeff Wang, Luke Zettlemoyer, Zornitsa Kozareva, Mona Diab, Veselin Stoyanov, Xian Li.
1. **[XLM](https://huggingface.co/docs/transformers/model_doc/xlm)** (from Facebook) released together with the paper [Cross-lingual Language Model Pretraining](https://arxiv.org/abs/1901.07291) by Guillaume Lample and Alexis Conneau.
Expand Down
1 change: 1 addition & 0 deletions README_zh-hans.md
Expand Up @@ -367,6 +367,7 @@ conda install -c huggingface transformers
1. **[Wav2Vec2-Conformer](https://huggingface.co/docs/transformers/model_doc/wav2vec2-conformer)** (来自 Facebook AI) 伴随论文 [FAIRSEQ S2T: Fast Speech-to-Text Modeling with FAIRSEQ](https://arxiv.org/abs/2010.05171) 由 Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Sravya Popuri, Dmytro Okhonko, Juan Pino 发布。
1. **[Wav2Vec2Phoneme](https://huggingface.co/docs/transformers/model_doc/wav2vec2_phoneme)** (来自 Facebook AI) 伴随论文 [Simple and Effective Zero-shot Cross-lingual Phoneme Recognition](https://arxiv.org/abs/2109.11680) 由 Qiantong Xu, Alexei Baevski, Michael Auli 发布。
1. **[WavLM](https://huggingface.co/docs/transformers/model_doc/wavlm)** (from Microsoft Research) released with the paper [WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing](https://arxiv.org/abs/2110.13900) by Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Furu Wei.
1. **[Whisper](https://huggingface.co/docs/transformers/main/model_doc/whisper)** (来自 OpenAI) 伴随论文 [Robust Speech Recognition via Large-Scale Weak Supervision](https://cdn.openai.com/papers/whisper.pdf) 由 Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, Ilya Sutskever 发布。
1. **[X-CLIP](https://huggingface.co/docs/transformers/model_doc/xclip)** (来自 Microsoft Research) 伴随论文 [Expanding Language-Image Pretrained Models for General Video Recognition](https://arxiv.org/abs/2208.02816) 由 Bolin Ni, Houwen Peng, Minghao Chen, Songyang Zhang, Gaofeng Meng, Jianlong Fu, Shiming Xiang, Haibin Ling 发布。
1. **[XGLM](https://huggingface.co/docs/transformers/model_doc/xglm)** (From Facebook AI) released with the paper [Few-shot Learning with Multilingual Language Models](https://arxiv.org/abs/2112.10668) by Xi Victoria Lin, Todor Mihaylov, Mikel Artetxe, Tianlu Wang, Shuohui Chen, Daniel Simig, Myle Ott, Naman Goyal, Shruti Bhosale, Jingfei Du, Ramakanth Pasunuru, Sam Shleifer, Punit Singh Koura, Vishrav Chaudhary, Brian O'Horo, Jeff Wang, Luke Zettlemoyer, Zornitsa Kozareva, Mona Diab, Veselin Stoyanov, Xian Li.
1. **[XLM](https://huggingface.co/docs/transformers/model_doc/xlm)** (来自 Facebook) 伴随论文 [Cross-lingual Language Model Pretraining](https://arxiv.org/abs/1901.07291) 由 Guillaume Lample and Alexis Conneau 发布。
Expand Down
1 change: 1 addition & 0 deletions README_zh-hant.md
Expand Up @@ -379,6 +379,7 @@ conda install -c huggingface transformers
1. **[Wav2Vec2-Conformer](https://huggingface.co/docs/transformers/model_doc/wav2vec2-conformer)** (from Facebook AI) released with the paper [FAIRSEQ S2T: Fast Speech-to-Text Modeling with FAIRSEQ](https://arxiv.org/abs/2010.05171) by Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Sravya Popuri, Dmytro Okhonko, Juan Pino.
1. **[Wav2Vec2Phoneme](https://huggingface.co/docs/transformers/model_doc/wav2vec2_phoneme)** (from Facebook AI) released with the paper [Simple and Effective Zero-shot Cross-lingual Phoneme Recognition](https://arxiv.org/abs/2109.11680) by Qiantong Xu, Alexei Baevski, Michael Auli.
1. **[WavLM](https://huggingface.co/docs/transformers/model_doc/wavlm)** (from Microsoft Research) released with the paper [WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing](https://arxiv.org/abs/2110.13900) by Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Furu Wei.
1. **[Whisper](https://huggingface.co/docs/transformers/main/model_doc/whisper)** (from OpenAI) released with the paper [Robust Speech Recognition via Large-Scale Weak Supervision](https://cdn.openai.com/papers/whisper.pdf) by Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, Ilya Sutskever.
1. **[X-CLIP](https://huggingface.co/docs/transformers/model_doc/xclip)** (from Microsoft Research) released with the paper [Expanding Language-Image Pretrained Models for General Video Recognition](https://arxiv.org/abs/2208.02816) by Bolin Ni, Houwen Peng, Minghao Chen, Songyang Zhang, Gaofeng Meng, Jianlong Fu, Shiming Xiang, Haibin Ling.
1. **[XGLM](https://huggingface.co/docs/transformers/model_doc/xglm)** (From Facebook AI) released with the paper [Few-shot Learning with Multilingual Language Models](https://arxiv.org/abs/2112.10668) by Xi Victoria Lin, Todor Mihaylov, Mikel Artetxe, Tianlu Wang, Shuohui Chen, Daniel Simig, Myle Ott, Naman Goyal, Shruti Bhosale, Jingfei Du, Ramakanth Pasunuru, Sam Shleifer, Punit Singh Koura, Vishrav Chaudhary, Brian O'Horo, Jeff Wang, Luke Zettlemoyer, Zornitsa Kozareva, Mona Diab, Veselin Stoyanov, Xian Li.
1. **[XLM](https://huggingface.co/docs/transformers/model_doc/xlm)** (from Facebook) released together with the paper [Cross-lingual Language Model Pretraining](https://arxiv.org/abs/1901.07291) by Guillaume Lample and Alexis Conneau.
Expand Down
2 changes: 2 additions & 0 deletions docs/source/en/_toctree.yml
Expand Up @@ -447,6 +447,8 @@
title: Wav2Vec2Phoneme
- local: model_doc/wavlm
title: WavLM
- local: model_doc/whisper
title: Whisper
- local: model_doc/xls_r
title: XLS-R
- local: model_doc/xlsr_wav2vec2
Expand Down
2 changes: 2 additions & 0 deletions docs/source/en/index.mdx
Expand Up @@ -183,6 +183,7 @@ The documentation is organized into five sections:
1. **[Wav2Vec2-Conformer](model_doc/wav2vec2-conformer)** (from Facebook AI) released with the paper [FAIRSEQ S2T: Fast Speech-to-Text Modeling with FAIRSEQ](https://arxiv.org/abs/2010.05171) by Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Sravya Popuri, Dmytro Okhonko, Juan Pino.
1. **[Wav2Vec2Phoneme](model_doc/wav2vec2_phoneme)** (from Facebook AI) released with the paper [Simple and Effective Zero-shot Cross-lingual Phoneme Recognition](https://arxiv.org/abs/2109.11680) by Qiantong Xu, Alexei Baevski, Michael Auli.
1. **[WavLM](model_doc/wavlm)** (from Microsoft Research) released with the paper [WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing](https://arxiv.org/abs/2110.13900) by Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Furu Wei.
1. **[Whisper](model_doc/whisper)** (from OpenAI) released with the paper [Robust Speech Recognition via Large-Scale Weak Supervision](https://cdn.openai.com/papers/whisper.pdf) by Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, Ilya Sutskever.
1. **[X-CLIP](model_doc/xclip)** (from Microsoft Research) released with the paper [Expanding Language-Image Pretrained Models for General Video Recognition](https://arxiv.org/abs/2208.02816) by Bolin Ni, Houwen Peng, Minghao Chen, Songyang Zhang, Gaofeng Meng, Jianlong Fu, Shiming Xiang, Haibin Ling.
1. **[XGLM](model_doc/xglm)** (From Facebook AI) released with the paper [Few-shot Learning with Multilingual Language Models](https://arxiv.org/abs/2112.10668) by Xi Victoria Lin, Todor Mihaylov, Mikel Artetxe, Tianlu Wang, Shuohui Chen, Daniel Simig, Myle Ott, Naman Goyal, Shruti Bhosale, Jingfei Du, Ramakanth Pasunuru, Sam Shleifer, Punit Singh Koura, Vishrav Chaudhary, Brian O'Horo, Jeff Wang, Luke Zettlemoyer, Zornitsa Kozareva, Mona Diab, Veselin Stoyanov, Xian Li.
1. **[XLM](model_doc/xlm)** (from Facebook) released together with the paper [Cross-lingual Language Model Pretraining](https://arxiv.org/abs/1901.07291) by Guillaume Lample and Alexis Conneau.
Expand Down Expand Up @@ -329,6 +330,7 @@ Flax), PyTorch, and/or TensorFlow.
| Wav2Vec2 | ✅ | ❌ | ✅ | ✅ | ✅ |
| Wav2Vec2-Conformer | ❌ | ❌ | ✅ | ❌ | ❌ |
| WavLM | ❌ | ❌ | ✅ | ❌ | ❌ |
| Whisper | ✅ | ❌ | ✅ | ❌ | ❌ |
| X-CLIP | ❌ | ❌ | ✅ | ❌ | ❌ |
| XGLM | ✅ | ✅ | ✅ | ✅ | ✅ |
| XLM | ✅ | ❌ | ✅ | ✅ | ❌ |
Expand Down
68 changes: 68 additions & 0 deletions docs/source/en/model_doc/whisper.mdx
@@ -0,0 +1,68 @@
<!--Copyright 2022 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->

# Whisper

## Overview

The Whisper model was proposed in [Robust Speech Recognition via Large-Scale Weak Supervision](https://cdn.openai.com/papers/whisper.pdf) by Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, Ilya Sutskever.

The abstract from the paper is the following:

*We study the capabilities of speech processing systems trained simply to predict large amounts of transcripts of audio on the internet. When scaled to 680,000 hours of multilingual and multitask supervision, the resulting models generalize well to standard benchmarks and are often competitive with prior fully supervised results but in a zeroshot transfer setting without the need for any finetuning. When compared to humans, the models approach their accuracy and robustness. We are releasing models and inference code to serve as a foundation for further work on robust speech processing.*


Tips:

- The model usually performs well without requiring any finetuning.
- The architecture follows a classic encoder-decoder architecture, which means that it relies on the [`~generation_utils.GenerationMixin.generate`] function for inference.
- One can use [`WhisperProcessor`] to prepare audio for the model, and decode the predicted ID's back into text.

This model was contributed by [Arthur Zucker](https://huggingface.co/ArthurZ).
The original code can be found [here](https://github.com/openai/whisper).


## WhisperConfig

[[autodoc]] WhisperConfig

## WhisperTokenizer

[[autodoc]] WhisperTokenizer
- build_inputs_with_special_tokens
- get_special_tokens_mask
- create_token_type_ids_from_sequences
- save_vocabulary

## WhisperFeatureExtractor

[[autodoc]] WhisperFeatureExtractor
- __call__

## WhisperProcessor

[[autodoc]] WhisperProcessor
- __call__
- from_pretrained
- save_pretrained
- batch_decode
- decode

## WhisperModel

[[autodoc]] WhisperModel
- forward

## WhisperForConditionalGeneration

[[autodoc]] WhisperForConditionalGeneration
- forward
28 changes: 28 additions & 0 deletions src/transformers/__init__.py
Expand Up @@ -390,6 +390,13 @@
"WAVLM_PRETRAINED_CONFIG_ARCHIVE_MAP",
"WavLMConfig",
],
"models.whisper": [
"WHISPER_PRETRAINED_CONFIG_ARCHIVE_MAP",
"WhisperConfig",
"WhisperFeatureExtractor",
"WhisperProcessor",
"WhisperTokenizer",
],
"models.x_clip": [
"XCLIP_PRETRAINED_CONFIG_ARCHIVE_MAP",
"XCLIPConfig",
Expand Down Expand Up @@ -1861,6 +1868,14 @@
"Speech2TextPreTrainedModel",
]
)
_import_structure["models.whisper"].extend(
[
"WHISPER_PRETRAINED_MODEL_ARCHIVE_LIST",
"WhisperForConditionalGeneration",
"WhisperModel",
"WhisperPreTrainedModel",
]
)
_import_structure["models.speech_to_text_2"].extend(["Speech2Text2ForCausalLM", "Speech2Text2PreTrainedModel"])
_import_structure["models.splinter"].extend(
[
Expand Down Expand Up @@ -3337,6 +3352,13 @@
from .models.wav2vec2_phoneme import Wav2Vec2PhonemeCTCTokenizer
from .models.wav2vec2_with_lm import Wav2Vec2ProcessorWithLM
from .models.wavlm import WAVLM_PRETRAINED_CONFIG_ARCHIVE_MAP, WavLMConfig
from .models.whisper import (
WHISPER_PRETRAINED_CONFIG_ARCHIVE_MAP,
WhisperConfig,
WhisperFeatureExtractor,
WhisperProcessor,
WhisperTokenizer,
)
from .models.x_clip import (
XCLIP_PRETRAINED_CONFIG_ARCHIVE_MAP,
XCLIPConfig,
Expand Down Expand Up @@ -4720,6 +4742,12 @@
WavLMModel,
WavLMPreTrainedModel,
)
from .models.whisper import (
WHISPER_PRETRAINED_MODEL_ARCHIVE_LIST,
WhisperForConditionalGeneration,
WhisperModel,
WhisperPreTrainedModel,
)
from .models.x_clip import (
XCLIP_PRETRAINED_MODEL_ARCHIVE_LIST,
XCLIPModel,
Expand Down
2 changes: 2 additions & 0 deletions src/transformers/configuration_utils.py
Expand Up @@ -299,6 +299,8 @@ def __init__(self, **kwargs):
self.forced_eos_token_id = kwargs.pop("forced_eos_token_id", None)
self.remove_invalid_values = kwargs.pop("remove_invalid_values", False)
self.exponential_decay_length_penalty = kwargs.pop("exponential_decay_length_penalty", None)
self.suppress_tokens = kwargs.pop("suppress_tokens", None)
self.begin_suppress_tokens = kwargs.pop("begin_suppress_tokens", None)

# Fine-tuning task arguments
self.architectures = kwargs.pop("architectures", None)
Expand Down