Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add BEiT #12994

Merged
merged 28 commits into from Aug 4, 2021
Merged

Add BEiT #12994

Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
3219cb1
First pass
NielsRogge Jul 28, 2021
98f8e04
Make conversion script work
NielsRogge Jul 29, 2021
feb6f0b
Improve conversion script
NielsRogge Jul 30, 2021
1bbaf73
Fix bug, conversion script working
NielsRogge Jul 30, 2021
10408e1
Improve conversion script, implement BEiTFeatureExtractor
NielsRogge Aug 2, 2021
bfac3d5
Make conversion script work based on URL
NielsRogge Aug 2, 2021
3f306eb
Improve conversion script
NielsRogge Aug 2, 2021
da3de43
Add tests, add documentation
NielsRogge Aug 2, 2021
1daf6b9
Fix bug in conversion script
NielsRogge Aug 2, 2021
188b442
Fix another bug
NielsRogge Aug 2, 2021
c3683e3
Add support for converting masked image modeling model
NielsRogge Aug 2, 2021
ec0608b
Add support for converting masked image modeling
NielsRogge Aug 3, 2021
98651a8
Fix bug
NielsRogge Aug 3, 2021
40c0e73
Add print statement for debugging
NielsRogge Aug 3, 2021
1b83592
Fix another bug
NielsRogge Aug 3, 2021
f30d05e
Make conversion script finally work for masked image modeling models
NielsRogge Aug 3, 2021
291dbc6
Move id2label for datasets to JSON files on the hub
NielsRogge Aug 3, 2021
e07fd07
Make sure id's are read in as integers
NielsRogge Aug 3, 2021
c9978ee
Add integration tests
NielsRogge Aug 3, 2021
5bc33e8
Make style & quality
NielsRogge Aug 3, 2021
1cb1ea5
Fix test, add BEiT to README
NielsRogge Aug 3, 2021
ad790ec
Apply suggestions from @sgugger's review
NielsRogge Aug 4, 2021
6ca6486
Apply suggestions from code review
NielsRogge Aug 4, 2021
12831d1
Make quality
NielsRogge Aug 4, 2021
25cdcc2
Replace nielsr by microsoft in tests, add docs
NielsRogge Aug 4, 2021
0dd112c
Rename BEiT to Beit
NielsRogge Aug 4, 2021
c0e9237
Minor fix
NielsRogge Aug 4, 2021
f2796fe
Fix docs of BeitForMaskedImageModeling
NielsRogge Aug 4, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Expand Up @@ -211,6 +211,7 @@ Current number of checkpoints: ![](https://img.shields.io/endpoint?url=https://h
1. **[ALBERT](https://huggingface.co/transformers/model_doc/albert.html)** (from Google Research and the Toyota Technological Institute at Chicago) released with the paper [ALBERT: A Lite BERT for Self-supervised Learning of Language Representations](https://arxiv.org/abs/1909.11942), by Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut.
1. **[BART](https://huggingface.co/transformers/model_doc/bart.html)** (from Facebook) released with the paper [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension](https://arxiv.org/pdf/1910.13461.pdf) by Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov and Luke Zettlemoyer.
1. **[BARThez](https://huggingface.co/transformers/model_doc/barthez.html)** (from École polytechnique) released with the paper [BARThez: a Skilled Pretrained French Sequence-to-Sequence Model](https://arxiv.org/abs/2010.12321) by Moussa Kamal Eddine, Antoine J.-P. Tixier, Michalis Vazirgiannis.
1. **[BEiT](https://huggingface.co/transformers/master/model_doc/beit.html)** (from Microsoft) released with the paper [BEiT: BERT Pre-Training of Image Transformers](https://arxiv.org/abs/2106.08254) by Hangbo Bao, Li Dong, Furu Wei.
1. **[BERT](https://huggingface.co/transformers/model_doc/bert.html)** (from Google) released with the paper [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805) by Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova.
1. **[BERT For Sequence Generation](https://huggingface.co/transformers/model_doc/bertgeneration.html)** (from Google) released with the paper [Leveraging Pre-trained Checkpoints for Sequence Generation Tasks](https://arxiv.org/abs/1907.12461) by Sascha Rothe, Shashi Narayan, Aliaksei Severyn.
1. **[BigBird-RoBERTa](https://huggingface.co/transformers/model_doc/bigbird.html)** (from Google Research) released with the paper [Big Bird: Transformers for Longer Sequences](https://arxiv.org/abs/2007.14062) by Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, Amr Ahmed.
Expand Down
132 changes: 69 additions & 63 deletions docs/source/index.rst

Large diffs are not rendered by default.

97 changes: 97 additions & 0 deletions docs/source/model_doc/beit.rst
@@ -0,0 +1,97 @@
..
Copyright 2021 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.

BEiT
-----------------------------------------------------------------------------------------------------------------------

Overview
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The BEiT model was proposed in `BEiT: BERT Pre-Training of Image Transformers <https://arxiv.org/abs/2106.08254>`__ by
Hangbo Bao, Li Dong and Furu Wei. Inspired by BERT, BEiT is the first paper that makes self-supervised pre-training of
Vision Transformers (ViTs) outperform supervised pre-training. Rather than pre-training the model to predict the class
of an image (as done in the `original ViT paper <https://arxiv.org/abs/2010.11929>`__), BEiT models are pre-trained to
predict visual tokens from the codebook of OpenAI's `DALL-E model <https://arxiv.org/abs/2102.12092>`__ given masked
patches.

The abstract from the paper is the following:

*We introduce a self-supervised vision representation model BEiT, which stands for Bidirectional Encoder representation
from Image Transformers. Following BERT developed in the natural language processing area, we propose a masked image
modeling task to pretrain vision Transformers. Specifically, each image has two views in our pre-training, i.e, image
patches (such as 16x16 pixels), and visual tokens (i.e., discrete tokens). We first "tokenize" the original image into
visual tokens. Then we randomly mask some image patches and fed them into the backbone Transformer. The pre-training
objective is to recover the original visual tokens based on the corrupted image patches. After pre-training BEiT, we
directly fine-tune the model parameters on downstream tasks by appending task layers upon the pretrained encoder.
Experimental results on image classification and semantic segmentation show that our model achieves competitive results
with previous pre-training methods. For example, base-size BEiT achieves 83.2% top-1 accuracy on ImageNet-1K,
significantly outperforming from-scratch DeiT training (81.8%) with the same setup. Moreover, large-size BEiT obtains
86.3% only using ImageNet-1K, even outperforming ViT-L with supervised pre-training on ImageNet-22K (85.2%).*

Tips:

- BEiT models are regular Vision Transformers, but pre-trained in a self-supervised way rather than supervised. They
outperform both the original model (ViT) as well as Data-efficient Image Transformers (DeiT) when fine-tuned on
ImageNet-1K and CIFAR-100.
- As the BEiT models expect each image to be of the same size (resolution), one can use
:class:`~transformers.BeitFeatureExtractor` to resize (or rescale) and normalize images for the model.
- Both the patch resolution and image resolution used during pre-training or fine-tuning are reflected in the name of
each checkpoint. For example, :obj:`microsoft/beit-base-patch16-224` refers to a base-sized architecture with patch
resolution of 16x16 and fine-tuning resolution of 224x224. All checkpoints can be found on the `hub
<https://huggingface.co/models?search=microsoft/beit>`__.
- The available checkpoints are either (1) pre-trained on `ImageNet-22k <http://www.image-net.org/>`__ (a collection of
14 million images and 22k classes) only, (2) also fine-tuned on ImageNet-22k or (3) also fine-tuned on `ImageNet-1k
<http://www.image-net.org/challenges/LSVRC/2012/>`__ (also referred to as ILSVRC 2012, a collection of 1.3 million
images and 1,000 classes).
- BEiT uses relative position embeddings, inspired by the T5 model. During pre-training, the authors shared the
relative position bias among the several self-attention layers. During fine-tuning, each layer's relative position
bias is initialized with the shared relative position bias obtained after pre-training. Note that, if one wants to
pre-train a model from scratch, one needs to either set the :obj:`use_relative_position_bias` or the
:obj:`use_relative_position_bias` attribute of :class:`~transformers.BeitConfig` to :obj:`True` in order to add
position embeddings.

This model was contributed by `nielsr <https://huggingface.co/nielsr>`__. The original code can be found `here
<https://github.com/microsoft/unilm/tree/master/beit>`__.

BeitConfig
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. autoclass:: transformers.BeitConfig
:members:


BeitFeatureExtractor
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. autoclass:: transformers.BeitFeatureExtractor
:members: __call__


BeitModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. autoclass:: transformers.BeitModel
:members: forward


BeitForMaskedImageModeling
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. autoclass:: transformers.BeitForMaskedImageModeling
:members: forward


BeitForImageClassification
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. autoclass:: transformers.BeitForImageClassification
:members: forward
22 changes: 20 additions & 2 deletions src/transformers/__init__.py
Expand Up @@ -147,6 +147,7 @@
],
"models.bart": ["BartConfig", "BartTokenizer"],
"models.barthez": [],
"models.beit": ["BEIT_PRETRAINED_CONFIG_ARCHIVE_MAP", "BeitConfig"],
"models.bert": [
"BERT_PRETRAINED_CONFIG_ARCHIVE_MAP",
"BasicTokenizer",
Expand Down Expand Up @@ -412,6 +413,7 @@
# Vision-specific objects
if is_vision_available():
_import_structure["image_utils"] = ["ImageFeatureExtractionMixin"]
_import_structure["models.beit"].append("BeitFeatureExtractor")
_import_structure["models.clip"].append("CLIPFeatureExtractor")
_import_structure["models.clip"].append("CLIPProcessor")
_import_structure["models.deit"].append("DeiTFeatureExtractor")
Expand Down Expand Up @@ -510,7 +512,6 @@
"load_tf_weights_in_albert",
]
)

_import_structure["models.auto"].extend(
[
"MODEL_FOR_CAUSAL_LM_MAPPING",
Expand Down Expand Up @@ -542,7 +543,6 @@
"AutoModelWithLMHead",
]
)

_import_structure["models.bart"].extend(
[
"BART_PRETRAINED_MODEL_ARCHIVE_LIST",
Expand All @@ -555,6 +555,15 @@
"PretrainedBartModel",
]
)
_import_structure["models.beit"].extend(
[
"BEIT_PRETRAINED_MODEL_ARCHIVE_LIST",
"BeitForImageClassification",
"BeitForMaskedImageModeling",
"BeitModel",
"BeitPreTrainedModel",
]
)
_import_structure["models.bert"].extend(
[
"BERT_PRETRAINED_MODEL_ARCHIVE_LIST",
Expand Down Expand Up @@ -1813,6 +1822,7 @@
AutoTokenizer,
)
from .models.bart import BartConfig, BartTokenizer
from .models.beit import BEIT_PRETRAINED_CONFIG_ARCHIVE_MAP, BeitConfig
from .models.bert import (
BERT_PRETRAINED_CONFIG_ARCHIVE_MAP,
BasicTokenizer,
Expand Down Expand Up @@ -2048,6 +2058,7 @@

if is_vision_available():
from .image_utils import ImageFeatureExtractionMixin
from .models.beit import BeitFeatureExtractor
from .models.clip import CLIPFeatureExtractor, CLIPProcessor
from .models.deit import DeiTFeatureExtractor
from .models.detr import DetrFeatureExtractor
Expand Down Expand Up @@ -2170,6 +2181,13 @@
BartPretrainedModel,
PretrainedBartModel,
)
from .models.beit import (
BEIT_PRETRAINED_MODEL_ARCHIVE_LIST,
BeitForImageClassification,
BeitForMaskedImageModeling,
BeitModel,
BeitPreTrainedModel,
)
from .models.bert import (
BERT_PRETRAINED_MODEL_ARCHIVE_LIST,
BertForMaskedLM,
Expand Down
2 changes: 2 additions & 0 deletions src/transformers/image_utils.py
Expand Up @@ -21,6 +21,8 @@

IMAGENET_DEFAULT_MEAN = [0.485, 0.456, 0.406]
IMAGENET_DEFAULT_STD = [0.229, 0.224, 0.225]
IMAGENET_STANDARD_MEAN = [0.5, 0.5, 0.5]
IMAGENET_STANDARD_STD = [0.5, 0.5, 0.5]


def is_torch_tensor(obj):
Expand Down
1 change: 1 addition & 0 deletions src/transformers/models/__init__.py
Expand Up @@ -21,6 +21,7 @@
auto,
bart,
barthez,
beit,
bert,
bert_generation,
bert_japanese,
Expand Down
4 changes: 4 additions & 0 deletions src/transformers/models/auto/configuration_auto.py
Expand Up @@ -20,6 +20,7 @@
from ...configuration_utils import PretrainedConfig
from ..albert.configuration_albert import ALBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, AlbertConfig
from ..bart.configuration_bart import BART_PRETRAINED_CONFIG_ARCHIVE_MAP, BartConfig
from ..beit.configuration_beit import BEIT_PRETRAINED_CONFIG_ARCHIVE_MAP, BeitConfig
from ..bert.configuration_bert import BERT_PRETRAINED_CONFIG_ARCHIVE_MAP, BertConfig
from ..bert_generation.configuration_bert_generation import BertGenerationConfig
from ..big_bird.configuration_big_bird import BIG_BIRD_PRETRAINED_CONFIG_ARCHIVE_MAP, BigBirdConfig
Expand Down Expand Up @@ -97,6 +98,7 @@
(key, value)
for pretrained_map in [
# Add archive maps here
BEIT_PRETRAINED_CONFIG_ARCHIVE_MAP,
REMBERT_PRETRAINED_CONFIG_ARCHIVE_MAP,
VISUAL_BERT_PRETRAINED_CONFIG_ARCHIVE_MAP,
CANINE_PRETRAINED_CONFIG_ARCHIVE_MAP,
Expand Down Expand Up @@ -158,6 +160,7 @@
CONFIG_MAPPING = OrderedDict(
[
# Add configs here
("beit", BeitConfig),
("rembert", RemBertConfig),
("visual_bert", VisualBertConfig),
("canine", CanineConfig),
Expand Down Expand Up @@ -225,6 +228,7 @@
MODEL_NAMES_MAPPING = OrderedDict(
[
# Add full (and cased) model names here
("beit", "BeiT"),
("rembert", "RemBERT"),
("visual_bert", "VisualBert"),
("canine", "Canine"),
Expand Down
5 changes: 3 additions & 2 deletions src/transformers/models/auto/feature_extraction_auto.py
Expand Up @@ -17,9 +17,9 @@
import os
from collections import OrderedDict

from transformers import DeiTFeatureExtractor, Speech2TextFeatureExtractor, ViTFeatureExtractor
from transformers import BeitFeatureExtractor, DeiTFeatureExtractor, Speech2TextFeatureExtractor, ViTFeatureExtractor

from ... import DeiTConfig, PretrainedConfig, Speech2TextConfig, ViTConfig, Wav2Vec2Config
from ... import BeitConfig, DeiTConfig, PretrainedConfig, Speech2TextConfig, ViTConfig, Wav2Vec2Config
from ...feature_extraction_utils import FeatureExtractionMixin

# Build the list of all feature extractors
Expand All @@ -30,6 +30,7 @@

FEATURE_EXTRACTOR_MAPPING = OrderedDict(
[
(BeitConfig, BeitFeatureExtractor),
(DeiTConfig, DeiTFeatureExtractor),
(Speech2TextConfig, Speech2TextFeatureExtractor),
(ViTConfig, ViTFeatureExtractor),
Expand Down
4 changes: 4 additions & 0 deletions src/transformers/models/auto/modeling_auto.py
Expand Up @@ -37,6 +37,7 @@
BartForSequenceClassification,
BartModel,
)
from ..beit.modeling_beit import BeitForImageClassification, BeitModel
from ..bert.modeling_bert import (
BertForMaskedLM,
BertForMultipleChoice,
Expand Down Expand Up @@ -321,6 +322,7 @@
from .configuration_auto import (
AlbertConfig,
BartConfig,
BeitConfig,
BertConfig,
BertGenerationConfig,
BigBirdConfig,
Expand Down Expand Up @@ -388,6 +390,7 @@
MODEL_MAPPING = OrderedDict(
[
# Base model mapping
(BeitConfig, BeitModel),
(RemBertConfig, RemBertModel),
(VisualBertConfig, VisualBertModel),
(CanineConfig, CanineModel),
Expand Down Expand Up @@ -579,6 +582,7 @@
# Model for Image Classification mapping
(ViTConfig, ViTForImageClassification),
(DeiTConfig, (DeiTForImageClassification, DeiTForImageClassificationWithTeacher)),
(BeitConfig, BeitForImageClassification),
]
)

Expand Down
59 changes: 59 additions & 0 deletions src/transformers/models/beit/__init__.py
@@ -0,0 +1,59 @@
# flake8: noqa
# There's no way to ignore "F401 '...' imported but unused" warnings in this
# module, but to preserve other warnings. So, don't check this module at all.

# Copyright 2021 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

from typing import TYPE_CHECKING

from ...file_utils import _LazyModule, is_torch_available, is_vision_available


_import_structure = {
"configuration_beit": ["BEIT_PRETRAINED_CONFIG_ARCHIVE_MAP", "BeitConfig"],
}

if is_vision_available():
_import_structure["feature_extraction_beit"] = ["BeitFeatureExtractor"]

if is_torch_available():
_import_structure["modeling_beit"] = [
"BEIT_PRETRAINED_MODEL_ARCHIVE_LIST",
"BeitForImageClassification",
"BeitForMaskedImageModeling",
"BeitModel",
"BeitPreTrainedModel",
]

if TYPE_CHECKING:
from .configuration_beit import BEIT_PRETRAINED_CONFIG_ARCHIVE_MAP, BeitConfig

if is_vision_available():
from .feature_extraction_beit import BeitFeatureExtractor

if is_torch_available():
from .modeling_beit import (
BEIT_PRETRAINED_MODEL_ARCHIVE_LIST,
BeitForImageClassification,
BeitForMaskedImageModeling,
BeitModel,
BeitPreTrainedModel,
)


else:
import sys

sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure)