Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Switch transformers #19323

Merged
merged 112 commits into from Nov 15, 2022
Merged
Show file tree
Hide file tree
Changes from 65 commits
Commits
Show all changes
112 commits
Select commit Hold shift + click to select a range
d45930b
first commit
younesbelkada Oct 4, 2022
59c6512
add more comments
younesbelkada Oct 4, 2022
0906870
add router v1
younesbelkada Oct 4, 2022
9c7643c
clean up
younesbelkada Oct 4, 2022
cddfce7
clean up
younesbelkada Oct 4, 2022
85c34e9
clean up
younesbelkada Oct 4, 2022
d5d092c
v0 routers
younesbelkada Oct 4, 2022
62d34bd
added more router
younesbelkada Oct 6, 2022
2eea820
last router
younesbelkada Oct 6, 2022
a65c7e4
improved docstring
younesbelkada Oct 8, 2022
7f1026d
v0 sparse mlp
younesbelkada Oct 9, 2022
7e888c9
Merge branch 'main' of https://github.com/huggingface/transformers in…
ArthurZucker Oct 12, 2022
4847670
replace wrong naming
ArthurZucker Oct 12, 2022
eeb2877
forward pass run
younesbelkada Oct 13, 2022
5f42b6b
update MOE layer
ArthurZucker Oct 14, 2022
60ab566
small router update
ArthurZucker Oct 14, 2022
ae2fbc4
fixup
ArthurZucker Oct 14, 2022
d7ba596
consistency
ArthurZucker Oct 14, 2022
1ae5563
remove scatter router
ArthurZucker Oct 14, 2022
60ec299
remove abstract layer
ArthurZucker Oct 14, 2022
6181bfa
update test and model for integration testing
ArthurZucker Oct 14, 2022
a6a7d57
v1 conversion
younesbelkada Oct 20, 2022
2e2be49
update
ArthurZucker Oct 20, 2022
6ede608
hardcode hack
ArthurZucker Oct 20, 2022
b9cac05
all keys match
Oct 24, 2022
6276ce7
add gin conversion, without additional libraries
ArthurZucker Oct 24, 2022
d30c6f4
Merge branch 'add_switch_transformers' of https://github.com/younesbe…
ArthurZucker Oct 24, 2022
0c4e54a
update conversion sctipy
ArthurZucker Oct 24, 2022
3b0ee25
delete router file
ArthurZucker Oct 24, 2022
5bd7a62
update tests wrt router deletion
ArthurZucker Oct 24, 2022
751cfdc
fix router issues
ArthurZucker Oct 24, 2022
76d0199
update expert code
ArthurZucker Oct 25, 2022
4fde649
update, logits match, code needsREFACTORING
ArthurZucker Oct 25, 2022
c526376
Refactor code
ArthurZucker Oct 25, 2022
5673476
add generate tests
ArthurZucker Oct 26, 2022
25ec9b6
add support for router loss
ArthurZucker Oct 26, 2022
55ab162
fix forward error
younesbelkada Oct 26, 2022
fa47eef
refactor a bit
younesbelkada Oct 26, 2022
afb3d37
remove `FlaxSwitchTransformers` modules
younesbelkada Oct 26, 2022
ccaaf61
more tests pass
younesbelkada Oct 26, 2022
805fe69
Update code
ArthurZucker Oct 27, 2022
955a811
fixup
ArthurZucker Oct 27, 2022
0152722
fix tests
younesbelkada Oct 27, 2022
5bcf84f
fix doc
younesbelkada Oct 27, 2022
aeff41c
fix doc + tokenization
younesbelkada Oct 27, 2022
58b9426
fix tokenizer test
younesbelkada Oct 27, 2022
ba8bf87
fix test
younesbelkada Oct 27, 2022
41dadd1
fix loss output
younesbelkada Oct 28, 2022
e8bff00
update code for backward pass
ArthurZucker Oct 28, 2022
7a43eae
Merge branch 'add_switch_transformers' of https://github.com/younesbe…
ArthurZucker Oct 28, 2022
84a6447
add loss support
younesbelkada Oct 28, 2022
5890744
update documentation
ArthurZucker Oct 28, 2022
7c0fa4b
fix documentation, clean tokenizer
ArthurZucker Oct 28, 2022
f426067
Merge branch 'add_switch_transformers' of https://github.com/younesbe…
ArthurZucker Oct 28, 2022
57acea7
more doc fix, cleanup example_switch
ArthurZucker Oct 28, 2022
2e3d7b1
fix failing test
younesbelkada Oct 28, 2022
d21f9e0
fix test
younesbelkada Oct 28, 2022
de60172
fix test
younesbelkada Oct 28, 2022
bae848a
fix loss issue
younesbelkada Oct 28, 2022
1f6b91a
move layer
ArthurZucker Oct 28, 2022
8d52978
Merge branch 'add_switch_transformers' of https://github.com/younesbe…
ArthurZucker Oct 28, 2022
cdb7768
update doc and fix router capacity usage
ArthurZucker Oct 28, 2022
aac7137
fixup
ArthurZucker Oct 28, 2022
7129478
add sparse mlp index for documentation on hub
ArthurZucker Oct 28, 2022
56dd559
fixup
ArthurZucker Oct 28, 2022
00186a8
test sparse mix architecture
ArthurZucker Oct 28, 2022
9d62a07
Apply suggestions from code review
ArthurZucker Oct 28, 2022
7827d08
Update docs/source/en/model_doc/switch_transformers.mdx
ArthurZucker Oct 28, 2022
a2f725d
fixup on update
ArthurZucker Oct 28, 2022
20b076e
fix tests
younesbelkada Oct 28, 2022
5e694ea
Merge branch 'add_switch_transformers' of https://github.com/younesbe…
younesbelkada Oct 28, 2022
b19e392
fix another test
younesbelkada Oct 28, 2022
26f5387
attempt fix
younesbelkada Oct 28, 2022
4444688
Update src/transformers/models/switch_transformers/configuration_swit…
younesbelkada Oct 28, 2022
44b8a81
Update src/transformers/models/switch_transformers/convert_switch_tra…
younesbelkada Oct 28, 2022
32903e3
try
younesbelkada Oct 28, 2022
bcff9e4
all tests pass
younesbelkada Oct 28, 2022
da48000
fix jitter noise
younesbelkada Oct 28, 2022
fe9c6b9
Apply suggestions from code review
younesbelkada Oct 31, 2022
0f8139e
doc tests pass
younesbelkada Oct 31, 2022
7ad488d
Update src/transformers/models/switch_transformers/modeling_switch_tr…
younesbelkada Nov 2, 2022
deb2b47
Update src/transformers/models/switch_transformers/modeling_switch_tr…
younesbelkada Nov 2, 2022
88e68b5
remove assert
younesbelkada Nov 2, 2022
16e23c4
change config order
younesbelkada Nov 2, 2022
203f053
Merge remote-tracking branch 'upstream_2/main' into add_switch_transf…
younesbelkada Nov 2, 2022
1231e2b
fix readme japanese
younesbelkada Nov 2, 2022
b4360d2
Apply suggestions from code review
younesbelkada Nov 3, 2022
0240906
remove parallelizable tests + add one liners
younesbelkada Nov 3, 2022
0978871
remove ONNX config
younesbelkada Nov 3, 2022
ccc28a9
fix nits
younesbelkada Nov 3, 2022
ef1fa19
remove `_get_router`
younesbelkada Nov 3, 2022
e2dc2b6
remove asserts
younesbelkada Nov 3, 2022
a2d786b
add check in test for `router_dtype`
younesbelkada Nov 3, 2022
ab67f48
add `SwitchTransformersConfig` in `run_pipeline_test`
younesbelkada Nov 3, 2022
7c3e5aa
Update tests/pipelines/test_pipelines_summarization.py
younesbelkada Nov 4, 2022
1326126
add huge model conversion script
ArthurZucker Nov 4, 2022
a3269a2
Merge branch 'add_switch_transformers' of https://github.com/younesbe…
ArthurZucker Nov 4, 2022
c7dec49
fix slow tests
younesbelkada Nov 4, 2022
edc7723
add make dir
ArthurZucker Nov 4, 2022
4fc8b67
Merge branch 'add_switch_transformers' of https://github.com/younesbe…
ArthurZucker Nov 4, 2022
d3a7795
style on new script
ArthurZucker Nov 4, 2022
437eef7
fix nits
younesbelkada Nov 4, 2022
f82f0cf
Update src/transformers/models/switch_transformers/configuration_swit…
younesbelkada Nov 7, 2022
cbb4c77
add google as authors
younesbelkada Nov 7, 2022
2f58069
fix year
younesbelkada Nov 7, 2022
2dd9cff
remove last `assert` statements
younesbelkada Nov 7, 2022
16e7ff5
standardize vertical spaces
younesbelkada Nov 7, 2022
98c62ef
Merge remote-tracking branch 'upstream/main' into add_switch_transfor…
younesbelkada Nov 15, 2022
1208b86
fix failing import
younesbelkada Nov 15, 2022
fc4921d
fix another failing test
younesbelkada Nov 15, 2022
21010f7
Remove strange àuthorized_keys`
ArthurZucker Nov 15, 2022
0188acd
removing todo and padding that is never used
ArthurZucker Nov 15, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Expand Up @@ -373,6 +373,7 @@ Current number of checkpoints: ![](https://img.shields.io/endpoint?url=https://h
1. **[SqueezeBERT](https://huggingface.co/docs/transformers/model_doc/squeezebert)** (from Berkeley) released with the paper [SqueezeBERT: What can computer vision teach NLP about efficient neural networks?](https://arxiv.org/abs/2006.11316) by Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, and Kurt W. Keutzer.
1. **[Swin Transformer](https://huggingface.co/docs/transformers/model_doc/swin)** (from Microsoft) released with the paper [Swin Transformer: Hierarchical Vision Transformer using Shifted Windows](https://arxiv.org/abs/2103.14030) by Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo.
1. **[Swin Transformer V2](https://huggingface.co/docs/transformers/model_doc/swinv2)** (from Microsoft) released with the paper [Swin Transformer V2: Scaling Up Capacity and Resolution](https://arxiv.org/abs/2111.09883) by Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, Furu Wei, Baining Guo.
1. **[SwitchTransformers](https://huggingface.co/docs/transformers/main/model_doc/switch_transformers)** (from Google) released with the paper [Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity](https://arxiv.org/abs/2101.03961) by William Fedus, Barret Zoph, Noam Shazeer.
1. **[T5](https://huggingface.co/docs/transformers/model_doc/t5)** (from Google AI) released with the paper [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/abs/1910.10683) by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
1. **[T5v1.1](https://huggingface.co/docs/transformers/model_doc/t5v1.1)** (from Google AI) released in the repository [google-research/text-to-text-transfer-transformer](https://github.com/google-research/text-to-text-transfer-transformer/blob/main/released_checkpoints.md#t511) by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
1. **[TAPAS](https://huggingface.co/docs/transformers/model_doc/tapas)** (from Google AI) released with the paper [TAPAS: Weakly Supervised Table Parsing via Pre-training](https://arxiv.org/abs/2004.02349) by Jonathan Herzig, Paweł Krzysztof Nowak, Thomas Müller, Francesco Piccinno and Julian Martin Eisenschlos.
Expand Down
1 change: 1 addition & 0 deletions README_es.md
Expand Up @@ -373,6 +373,7 @@ Número actual de puntos de control: ![](https://img.shields.io/endpoint?url=htt
1. **[SqueezeBERT](https://huggingface.co/docs/transformers/model_doc/squeezebert)** (from Berkeley) released with the paper [SqueezeBERT: What can computer vision teach NLP about efficient neural networks?](https://arxiv.org/abs/2006.11316) by Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, and Kurt W. Keutzer.
1. **[Swin Transformer](https://huggingface.co/docs/transformers/model_doc/swin)** (from Microsoft) released with the paper [Swin Transformer: Hierarchical Vision Transformer using Shifted Windows](https://arxiv.org/abs/2103.14030) by Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo.
1. **[Swin Transformer V2](https://huggingface.co/docs/transformers/model_doc/swinv2)** (from Microsoft) released with the paper [Swin Transformer V2: Scaling Up Capacity and Resolution](https://arxiv.org/abs/2111.09883) by Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, Furu Wei, Baining Guo.
1. **[SwitchTransformers](https://huggingface.co/docs/transformers/main/model_doc/switch_transformers)** (from Google) released with the paper [Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity](https://arxiv.org/abs/2101.03961) by William Fedus, Barret Zoph, Noam Shazeer.
1. **[T5](https://huggingface.co/docs/transformers/model_doc/t5)** (from Google AI) released with the paper [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/abs/1910.10683) by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
1. **[T5v1.1](https://huggingface.co/docs/transformers/model_doc/t5v1.1)** (from Google AI) released in the repository [google-research/text-to-text-transfer-transformer](https://github.com/google-research/text-to-text-transfer-transformer/blob/main/released_checkpoints.md#t511) by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
1. **[TAPAS](https://huggingface.co/docs/transformers/model_doc/tapas)** (from Google AI) released with the paper [TAPAS: Weakly Supervised Table Parsing via Pre-training](https://arxiv.org/abs/2004.02349) by Jonathan Herzig, Paweł Krzysztof Nowak, Thomas Müller, Francesco Piccinno and Julian Martin Eisenschlos.
Expand Down
1 change: 1 addition & 0 deletions README_ko.md
Expand Up @@ -323,6 +323,7 @@ Flax, PyTorch, TensorFlow 설치 페이지에서 이들을 conda로 설치하는
1. **[SqueezeBERT](https://huggingface.co/docs/transformers/model_doc/squeezebert)** (from Berkeley) released with the paper [SqueezeBERT: What can computer vision teach NLP about efficient neural networks?](https://arxiv.org/abs/2006.11316) by Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, and Kurt W. Keutzer.
1. **[Swin Transformer](https://huggingface.co/docs/transformers/model_doc/swin)** (from Microsoft) released with the paper [Swin Transformer: Hierarchical Vision Transformer using Shifted Windows](https://arxiv.org/abs/2103.14030) by Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo.
1. **[Swin Transformer V2](https://huggingface.co/docs/transformers/model_doc/swinv2)** (from Microsoft) released with the paper [Swin Transformer V2: Scaling Up Capacity and Resolution](https://arxiv.org/abs/2111.09883) by Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, Furu Wei, Baining Guo.
1. **[SwitchTransformers](https://huggingface.co/docs/transformers/main/model_doc/switch_transformers)** (from Google) released with the paper [Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity](https://arxiv.org/abs/2101.03961) by William Fedus, Barret Zoph, Noam Shazeer.
1. **[T5](https://huggingface.co/docs/transformers/model_doc/t5)** (from Google AI) released with the paper [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/abs/1910.10683) by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
1. **[T5v1.1](https://huggingface.co/docs/transformers/model_doc/t5v1.1)** (from Google AI) released in the repository [google-research/text-to-text-transfer-transformer](https://github.com/google-research/text-to-text-transfer-transformer/blob/main/released_checkpoints.md#t511) by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
1. **[TAPAS](https://huggingface.co/docs/transformers/model_doc/tapas)** (from Google AI) released with the paper [TAPAS: Weakly Supervised Table Parsing via Pre-training](https://arxiv.org/abs/2004.02349) by Jonathan Herzig, Paweł Krzysztof Nowak, Thomas Müller, Francesco Piccinno and Julian Martin Eisenschlos.
Expand Down
1 change: 1 addition & 0 deletions README_zh-hans.md
Expand Up @@ -347,6 +347,7 @@ conda install -c huggingface transformers
1. **[SqueezeBERT](https://huggingface.co/docs/transformers/model_doc/squeezebert)** (来自 Berkeley) 伴随论文 [SqueezeBERT: What can computer vision teach NLP about efficient neural networks?](https://arxiv.org/abs/2006.11316) 由 Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, and Kurt W. Keutzer 发布。
1. **[Swin Transformer](https://huggingface.co/docs/transformers/model_doc/swin)** (来自 Microsoft) 伴随论文 [Swin Transformer: Hierarchical Vision Transformer using Shifted Windows](https://arxiv.org/abs/2103.14030) 由 Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo 发布。
1. **[Swin Transformer V2](https://huggingface.co/docs/transformers/model_doc/swinv2)** (来自 Microsoft) 伴随论文 [Swin Transformer V2: Scaling Up Capacity and Resolution](https://arxiv.org/abs/2111.09883) 由 Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, Furu Wei, Baining Guo 发布。
1. **[SwitchTransformers](https://huggingface.co/docs/transformers/main/model_doc/switch_transformers)** (from Google) released with the paper [Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity](https://arxiv.org/abs/2101.03961) by William Fedus, Barret Zoph, Noam Shazeer.
1. **[T5](https://huggingface.co/docs/transformers/model_doc/t5)** (来自 Google AI) 伴随论文 [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/abs/1910.10683) 由 Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu 发布。
1. **[T5v1.1](https://huggingface.co/docs/transformers/model_doc/t5v1.1)** (来自 Google AI) 伴随论文 [google-research/text-to-text-transfer-transformer](https://github.com/google-research/text-to-text-transfer-transformer/blob/main/released_checkpoints.md#t511) 由 Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu 发布。
1. **[TAPAS](https://huggingface.co/docs/transformers/model_doc/tapas)** (来自 Google AI) 伴随论文 [TAPAS: Weakly Supervised Table Parsing via Pre-training](https://arxiv.org/abs/2004.02349) 由 Jonathan Herzig, Paweł Krzysztof Nowak, Thomas Müller, Francesco Piccinno and Julian Martin Eisenschlos 发布。
Expand Down
1 change: 1 addition & 0 deletions README_zh-hant.md
Expand Up @@ -359,6 +359,7 @@ conda install -c huggingface transformers
1. **[SqueezeBERT](https://huggingface.co/docs/transformers/model_doc/squeezebert)** (from Berkeley) released with the paper [SqueezeBERT: What can computer vision teach NLP about efficient neural networks?](https://arxiv.org/abs/2006.11316) by Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, and Kurt W. Keutzer.
1. **[Swin Transformer](https://huggingface.co/docs/transformers/model_doc/swin)** (from Microsoft) released with the paper [Swin Transformer: Hierarchical Vision Transformer using Shifted Windows](https://arxiv.org/abs/2103.14030) by Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo.
1. **[Swin Transformer V2](https://huggingface.co/docs/transformers/model_doc/swinv2)** (from Microsoft) released with the paper [Swin Transformer V2: Scaling Up Capacity and Resolution](https://arxiv.org/abs/2111.09883) by Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, Furu Wei, Baining Guo.
1. **[SwitchTransformers](https://huggingface.co/docs/transformers/main/model_doc/switch_transformers)** (from Google) released with the paper [Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity](https://arxiv.org/abs/2101.03961) by William Fedus, Barret Zoph, Noam Shazeer.
1. **[T5](https://huggingface.co/docs/transformers/model_doc/t5)** (from Google AI) released with the paper [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/abs/1910.10683) by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
1. **[T5v1.1](https://huggingface.co/docs/transformers/model_doc/t5v1.1)** (from Google AI) released with the paper [google-research/text-to-text-transfer-transformer](https://github.com/google-research/text-to-text-transfer-transformer/blob/main/released_checkpoints.md#t511) by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
1. **[TAPAS](https://huggingface.co/docs/transformers/model_doc/tapas)** (from Google AI) released with the paper [TAPAS: Weakly Supervised Table Parsing via Pre-training](https://arxiv.org/abs/2004.02349) by Jonathan Herzig, Paweł Krzysztof Nowak, Thomas Müller, Francesco Piccinno and Julian Martin Eisenschlos.
Expand Down
2 changes: 2 additions & 0 deletions docs/source/en/_toctree.yml
Expand Up @@ -343,6 +343,8 @@
title: Splinter
- local: model_doc/squeezebert
title: SqueezeBERT
- local: model_doc/switch_transformers
title: SwitchTransformers
- local: model_doc/t5
title: T5
- local: model_doc/t5v1.1
Expand Down
2 changes: 2 additions & 0 deletions docs/source/en/index.mdx
Expand Up @@ -162,6 +162,7 @@ The documentation is organized into five sections:
1. **[SqueezeBERT](model_doc/squeezebert)** (from Berkeley) released with the paper [SqueezeBERT: What can computer vision teach NLP about efficient neural networks?](https://arxiv.org/abs/2006.11316) by Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, and Kurt W. Keutzer.
1. **[Swin Transformer](model_doc/swin)** (from Microsoft) released with the paper [Swin Transformer: Hierarchical Vision Transformer using Shifted Windows](https://arxiv.org/abs/2103.14030) by Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo.
1. **[Swin Transformer V2](model_doc/swinv2)** (from Microsoft) released with the paper [Swin Transformer V2: Scaling Up Capacity and Resolution](https://arxiv.org/abs/2111.09883) by Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, Furu Wei, Baining Guo.
1. **[SwitchTransformers](model_doc/switch_transformers)** (from Google) released with the paper [Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity](https://arxiv.org/abs/2101.03961) by William Fedus, Barret Zoph, Noam Shazeer.
1. **[T5](model_doc/t5)** (from Google AI) released with the paper [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/abs/1910.10683) by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
1. **[T5v1.1](model_doc/t5v1.1)** (from Google AI) released in the repository [google-research/text-to-text-transfer-transformer](https://github.com/google-research/text-to-text-transfer-transformer/blob/main/released_checkpoints.md#t511) by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
1. **[TAPAS](model_doc/tapas)** (from Google AI) released with the paper [TAPAS: Weakly Supervised Table Parsing via Pre-training](https://arxiv.org/abs/2004.02349) by Jonathan Herzig, Paweł Krzysztof Nowak, Thomas Müller, Francesco Piccinno and Julian Martin Eisenschlos.
Expand Down Expand Up @@ -312,6 +313,7 @@ Flax), PyTorch, and/or TensorFlow.
| SqueezeBERT | ✅ | ✅ | ✅ | ❌ | ❌ |
| Swin Transformer | ❌ | ❌ | ✅ | ✅ | ❌ |
| Swin Transformer V2 | ❌ | ❌ | ✅ | ❌ | ❌ |
| SwitchTransformers | ❌ | ❌ | ✅ | ❌ | ❌ |
| T5 | ✅ | ✅ | ✅ | ✅ | ✅ |
| TAPAS | ✅ | ❌ | ✅ | ✅ | ❌ |
| Time Series Transformer | ❌ | ❌ | ✅ | ❌ | ❌ |
Expand Down
64 changes: 64 additions & 0 deletions docs/source/en/model_doc/switch_transformers.mdx
@@ -0,0 +1,64 @@
<!--Copyright 2022 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->

# SwitchTransformers

## Overview

The SwitchTransformers model was proposed in [Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity](https://arxiv.org/abs/2101.03961) by William Fedus, Barret Zoph, Noam Shazeer.

The Switch Transformer model uses a sparse T5 encoder-decoder architure, where the MLP are replace by a Mixture of Expert (MOE). A routing mecanism (top 1 in this case) associates each token to one of the expert, where each expert is a dense MLP. While switch tranformers have a lot more weights than their equivalent dense models, the sparsity allows for better scaling.
younesbelkada marked this conversation as resolved.
Show resolved Hide resolved
During a forward pass, only a fraction of the weights are used. The routing mecanism allows the model to select relavant weights on the fly which increases the model capacity. #TODO add the intuition about moving the loss curve.
ArthurZucker marked this conversation as resolved.
Show resolved Hide resolved


The abstract from the paper is the following:

*In deep learning, models typically reuse the same parameters for all inputs. Mixture of Experts (MoE) defies this and instead selects different parameters for each incoming example. The result is a sparsely-activated model -- with outrageous numbers of parameters -- but a constant computational cost. However, despite several notable successes of MoE, widespread adoption has been hindered by complexity, communication costs and training instability -- we address these with the Switch Transformer. We simplify the MoE routing algorithm and design intuitive improved models with reduced communication and computational costs. Our proposed training techniques help wrangle the instabilities and we show large sparse models may be trained, for the first time, with lower precision (bfloat16) formats. We design models based off T5-Base and T5-Large to obtain up to 7x increases in pre-training speed with the same computational resources. These improvements extend into multilingual settings where we measure gains over the mT5-Base version across all 101 languages. Finally, we advance the current scale of language models by pre-training up to trillion parameter models on the "Colossal Clean Crawled Corpus" and achieve a 4x speedup over the T5-XXL model.*

Tips:

- SwitchTransformers uses the T5Tokenizer, which can be loaded directly from each model's repository.
younesbelkada marked this conversation as resolved.
Show resolved Hide resolved
- The released weights are pretrained on English [Masked Language Modeling](https://moon-ci-docs.huggingface.co/docs/transformers/pr_19323/en/glossary#general-terms) task, and should be finetuned.

This model was contributed by [Younes Belkada](https://huggingface.co/ybelkada) and [Arthur Zucker](https://huggingface.co/ArtZucker) .
The original code can be found [here](https://github.com/google/flaxformer/tree/main/flaxformer/architectures/moe).


## SwitchTransformersConfig

[[autodoc]] SwitchTransformersConfig

## SwitchTransformersTop1Router

[[autodoc]] SwitchTransformersTop1Router
- _compute_router_probabilities
- forward

## SwitchTransformersSparseMLP

[[autodoc]] SwitchTransformersSparseMLP
- forward

## SwitchTransformersModel

[[autodoc]] SwitchTransformersModel
- forward

## SwitchTransformersForConditionalGeneration

[[autodoc]] SwitchTransformersForConditionalGeneration
- forward

## SwitchTransformersEncoderModel

[[autodoc]] SwitchTransformersEncoderModel
- forward
1 change: 1 addition & 0 deletions docs/source/en/serialization.mdx
Expand Up @@ -95,6 +95,7 @@ Ready-made configurations include the following architectures:
- SegFormer
- SqueezeBERT
- Swin Transformer
- SwitchTransformers
- T5
- Vision Encoder decoder
- ViT
Expand Down