[feat] Add FLAVA model #16654

apsdehal · 2022-04-07T16:00:44Z

This PR aims to add FLAVA model to the transformers repo.

Following checklist delineates the list of things to be done for this PR
to be complete:

HuggingFaceDocBuilderDev · 2022-04-14T22:13:45Z

The documentation is not available anymore as the PR was closed or merged.

sgugger

Main comment is a little bit below. Sorry, I've clicked review by mistake 😬 .

sgugger · 2022-04-19T21:08:43Z

README.md

@@ -265,6 +265,7 @@ Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih.
 1. **[EncoderDecoder](https://huggingface.co/docs/transformers/model_doc/encoder-decoder)** (from Google Research) released with the paper [Leveraging Pre-trained Checkpoints for Sequence Generation Tasks](https://arxiv.org/abs/1907.12461) by Sascha Rothe, Shashi Narayan, Aliaksei Severyn.
 1. **[ELECTRA](https://huggingface.co/docs/transformers/model_doc/electra)** (from Google Research/Stanford University) released with the paper [ELECTRA: Pre-training text encoders as discriminators rather than generators](https://arxiv.org/abs/2003.10555) by Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning.
 1. **[FlauBERT](https://huggingface.co/docs/transformers/model_doc/flaubert)** (from CNRS) released with the paper [FlauBERT: Unsupervised Language Model Pre-training for French](https://arxiv.org/abs/1912.05372) by Hang Le, Loïc Vial, Jibril Frej, Vincent Segonne, Maximin Coavoux, Benjamin Lecouteux, Alexandre Allauzen, Benoît Crabbé, Laurent Besacier, Didier Schwab.
+1. **[FLAVA](https://huggingface.co/docs/transformers/model_doc_flava)** (from Facebook AI) released with the paper [FLAVA: A Foundational Language And Vision Alignment Model](https://arxiv.org/abs/2112.04482) by Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach, and Douwe Kiela.


Suggested change

1. **[FLAVA](https://huggingface.co/docs/transformers/model_doc_flava)** (from Facebook AI) released with the paper [FLAVA: A Foundational Language And Vision Alignment Model](https://arxiv.org/abs/2112.04482) by Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach, and Douwe Kiela.

1. **[FLAVA](https://huggingface.co/docs/transformers/main/model_doc/flava)** (from Facebook AI) released with the paper [FLAVA: A Foundational Language And Vision Alignment Model](https://arxiv.org/abs/2112.04482) by Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach, and Douwe Kiela.

Typo + the model will only be in the main branch until next release, so the link needs to point to the main version of the doc.

sgugger · 2022-04-19T21:09:18Z

docs/source/en/model_doc/flava.mdx

@@ -0,0 +1,105 @@
+<!--Copyright 2021 The HuggingFace Team. All rights reserved.


Suggested change

<!--Copyright 2021 The HuggingFace Team. All rights reserved.

<!--Copyright 2022 The HuggingFace Team. All rights reserved.

sgugger · 2022-04-19T21:09:41Z

docs/source/en/model_doc/flava.mdx

+State-of-the-art vision and vision-and-language models rely on large-scale visio-linguistic pretraining for obtaining good performance on a variety 
+of downstream tasks. Generally, such models are often either cross-modal (contrastive) or multi-modal 
+(with earlier fusion) but not both; and they often only target specific modalities or tasks. A promising 
+direction would be to use a single holistic universal model, as a "foundation", that targets all modalities 
+at once -- a true vision and language foundation model should be good at vision tasks, language tasks, and 
+cross- and multi-modal vision and language tasks. We introduce FLAVA as such a model and demonstrate 
+impressive performance on a wide range of 35 tasks spanning these target modalities.


Suggested change

State-of-the-art vision and vision-and-language models rely on large-scale visio-linguistic pretraining for obtaining good performance on a variety

of downstream tasks. Generally, such models are often either cross-modal (contrastive) or multi-modal

(with earlier fusion) but not both; and they often only target specific modalities or tasks. A promising

direction would be to use a single holistic universal model, as a "foundation", that targets all modalities

at once -- a true vision and language foundation model should be good at vision tasks, language tasks, and

cross- and multi-modal vision and language tasks. We introduce FLAVA as such a model and demonstrate

impressive performance on a wide range of 35 tasks spanning these target modalities.

*State-of-the-art vision and vision-and-language models rely on large-scale visio-linguistic pretraining for obtaining good performance on a variety

of downstream tasks. Generally, such models are often either cross-modal (contrastive) or multi-modal

(with earlier fusion) but not both; and they often only target specific modalities or tasks. A promising

direction would be to use a single holistic universal model, as a "foundation", that targets all modalities

at once -- a true vision and language foundation model should be good at vision tasks, language tasks, and

cross- and multi-modal vision and language tasks. We introduce FLAVA as such a model and demonstrate

impressive performance on a wide range of 35 tasks spanning these target modalities.*

sgugger · 2022-04-19T21:09:56Z

docs/source/en/model_doc/flava.mdx

+impressive performance on a wide range of 35 tasks spanning these target modalities.
+
+
+<!-- Tips: -->


Needs to be filled.

sgugger · 2022-04-19T21:10:03Z

docs/source/en/model_doc/flava.mdx

+<!-- Tips: -->
+
+This model was contributed by [aps](https://huggingface.co/aps).
+<!-- The original code can be found [here](<INSERT LINK TO GITHUB REPO HERE>).-->


Needs to be filled.

sgugger · 2022-04-19T21:10:58Z

docs/source/en/model_doc/flava.mdx

+## FLAVAConfig
+
+[[autodoc]] FLAVAConfig
+    - from_configs


Not sure it's necessary here. Will be automatically added.

Suggested change

- from_configs

sgugger · 2022-04-19T21:12:35Z

docs/source/en/model_doc/flava.mdx

+<!-- The original code can be found [here](<INSERT LINK TO GITHUB REPO HERE>).-->
+
+
+## FLAVAConfig


Will put in the general comment as well but all names need to be re-case to FlavaXxx. We don't use BERTConfig, BERTModel etc., but BertConfig, BertModel...

re on this topic: There are multiple examples in codebase that actually don't follow this signature. XLM, WavLM, ResNet, GPT2, CLIP, CTRL, and many others. Overall, choice of case seems random and not really enforced everywhere. I am not sure why these models have this privilege and FLAVA can't?

ResNet and WaLM are not all caps so I don't really see your point there. If FLAVA stands for two different words, we can capitalize the second one but I don't think that's the case.
This is not a privilege, this is from the beginnings of Transformers. We then got user feedback that those casings made it difficult for them to find the model classes, so we stopped.

sgugger · 2022-04-19T21:13:23Z

src/transformers/__init__.py

+    _import_structure["models.flava"].append("FLAVAFeatureExtractor")
+    _import_structure["models.flava"].append("FLAVACodebookFeatureExtractor")
+    _import_structure["models.flava"].append("FLAVAProcessor")


All in one line with an extend

sgugger · 2022-04-19T21:14:09Z

src/transformers/models/auto/tokenization_auto.py

+            # (
+            #     "flava",
+            #     (
+            #         "CLIPTokenizer",
+            #         "CLIPTokenizerFast" if is_tokenizers_available() else None,
+            #     ),
+            # ),


Should be cleaned up or fixed. Looking at the processor file, it seems to be using the BertTokenizer?

sgugger · 2022-04-19T21:14:21Z

src/transformers/models/flava/__init__.py

+# There's no way to ignore "F401 '...' imported but unused" warnings in this
+# module, but to preserve other warnings. So, don't check this module at all.
+
+# Copyright 2021 Meta Platforms authors and The HuggingFace Team. All rights reserved.


Suggested change

# Copyright 2021 Meta Platforms authors and The HuggingFace Team. All rights reserved.

# Copyright 2022 Meta Platforms authors and The HuggingFace Team. All rights reserved.

sgugger

Thanks for your PR, exciting to see this model arrive soon in the library!

One thing is that all names need to be re-case to FlavaXxx. We don't use BERTConfig, BERTModel etc., but BertConfig, BertModel... Likewise we should have FlavaModel, FlavaConfig etc.

You use Meta or Facebook in several places. You should pick one and stick to it.

sgugger · 2022-04-19T21:15:24Z

src/transformers/models/flava/configuration_flava.py

@@ -0,0 +1,611 @@
+# coding=utf-8
+# Copyright 2021 Meta Platforms authors and The HuggingFace Team. All rights reserved.


Suggested change

# Copyright 2021 Meta Platforms authors and The HuggingFace Team. All rights reserved.

# Copyright 2022 Meta Platforms authors and The HuggingFace Team. All rights reserved.

sgugger · 2022-04-19T21:15:35Z

src/transformers/models/flava/configuration_flava.py

+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" FLAVA model configuration"""


Suggested change

""" FLAVA model configuration"""

""" FLAVA model configurations"""

sgugger · 2022-04-19T21:16:45Z

src/transformers/models/flava/configuration_flava.py

+logger = logging.get_logger(__name__)
+
+FLAVA_PRETRAINED_CONFIG_ARCHIVE_MAP = {
+    "flava-full": "https://huggingface.co/aps/flava-full/resolve/main/config.json",


Suggested change

"flava-full": "https://huggingface.co/aps/flava-full/resolve/main/config.json",

"aps/flava-full": "https://huggingface.co/aps/flava-full/resolve/main/config.json",

The namespace also needs to be in the key.

sgugger · 2022-04-19T21:16:58Z

src/transformers/models/flava/configuration_flava.py

+    This is the configuration class to store the configuration of a [`FLAVAImageModel`]. It is used to instantiate an
+    FLAVA model according to the specified arguments, defining the model architecture. Instantiating a configuration
+    with the defaults will yield a similar configuration to that of the FLAVA
+    [full](https://huggingface.co/aps/flava-full) architecture.


Suggested change

[full](https://huggingface.co/aps/flava-full) architecture.

[aps/flava-full](https://huggingface.co/aps/flava-full) architecture.

I would leave this as it is to avoid ambiguity in the meaning of the sentence itself.

No, it is ambiguous as it doesn't show the real name of the checkpoint.

sgugger · 2022-04-19T21:17:21Z

src/transformers/models/flava/configuration_flava.py

+        image_size (`int`, *optional*, defaults to `224`):
+            The size (resolution) of each image.
+        patch_size (`int`, *optional*, defaults to `16`):
+            The size (resolution) of each patch.
+        num_channels (`int`, *optional*, defaults to `3`):


Numbers for default values should not be between backticks.

sgugger · 2022-04-19T21:49:26Z

tests/flava/test_modeling_flava.py

@@ -0,0 +1,1234 @@
+# coding=utf-8
+# Copyright 2021 The HuggingFace Inc. team. All rights reserved.


Suggested change

# Copyright 2021 The HuggingFace Inc. team. All rights reserved.

# Copyright 2022 The HuggingFace Inc. team. All rights reserved.

sgugger · 2022-04-19T21:49:34Z

tests/flava/test_feature_extraction_flava.py

@@ -0,0 +1,416 @@
+# coding=utf-8
+# Copyright 2021 HuggingFace Inc.


Suggested change

# Copyright 2021 HuggingFace Inc.

# Copyright 2022 HuggingFace Inc.

sgugger · 2022-04-19T21:50:33Z

tests/flava/test_modeling_flava.py

+    def __init__(
+        self,
+        parent,
+        batch_size=12,
+        image_size=112,
+        num_channels=3,
+    ):


Can all fit in one line.

sgugger · 2022-04-19T21:50:52Z

tests/flava/test_modeling_flava.py

+    def _test_model(
+        self,
+        config,
+        inputs,
+        test_image=False,
+        test_text=False,
+    ):


Can all fit in one line.

sgugger · 2022-04-19T21:51:55Z

tests/flava/test_processor_flava.py

+        vocab_tokens = [
+            "[UNK]",
+            "[CLS]",
+            "[SEP]",
+            "[PAD]",
+            "[MASK]",
+            "want",
+            "##want",
+            "##ed",
+            "wa",
+            "un",
+            "runn",
+            "##ing",
+            ",",
+            "low",
+            "lowest",
+        ]


Should be all in one line and deactivate black for it to make it more readable (#fmt: off before and #fmt: on after).

apsdehal · 2022-04-19T23:19:19Z

@sgugger Thanks for your review on the PR. About lowercasing things, I followed what I saw in the repo as not all models were lower case. An example of this would be CLIP where classes are named as CLIPConfig, CLIPTextEncoder and so on. Would it be possible to keep FLAVA as uppercase?

sgugger · 2022-04-20T11:27:42Z

CLIP was a mistake IMO, and it makes it harder on users to guess the right names of the models if we have different casings all the time. It's not because we make a mistake once we should repeat it again ang again just because there is a precedent ;-)

patrickvonplaten · 2022-04-20T19:08:18Z

tests/flava/test_modeling_flava.py

+@require_torch
+class FLAVAForPreTrainingIntegrationTest(unittest.TestCase):
+    @slow
+    def test_inference(self):


tests/flava/test_modeling_flava.py

patrickvonplaten · 2022-04-20T19:13:15Z

tests/flava/test_modeling_flava.py

+        model_name = "aps/flava-full"
+        codebook_name = "aps/flava-codebook"
+        model = FLAVAForPreTraining.from_pretrained(model_name).to(torch_device)
+        codebook = FLAVACodebook.from_pretrained(codebook_name).to(torch_device)


Hmm, don't really like that flava-full and flava-codebook are in two different repositories. All of our models usually work with one and only one modeling file. Can't we merge the FlavaCodebook a model into the FlavaForPreTraining class? Don't like that inference is dependent on two different repos

I would ideally like to avoid that because codebook is somewhat independent of FLAVA itself as it is mostly just custom DALLE encoder. A different codebook can be potentially used with flava. If I join them into one, I will lose that benefit. wdyt?

I see, I guess we have kind of a composition model problem here (maybe similar to Blenderbot2 or RAG cc @sgugger )
My opinion here is the following. The paper only used the DALLE encoder (no?) and that's what we expect people to use when reproducing the paper. In our experience few people expect things that have not been done in the paper to work out of the box.

The way it's currently implemented doesn't really allow the user to be able to use another encoder out-of-the-box either. E.g. the user would have to code up a new encoder class which respects the logic of:

mim_labels = codebook.get_codebook_indices(**codebook_fe([image, image], return_tensors="pt").to(torch_device))

i.e. the model would also require a get_codebook_indices(...) function etc... This requires already quite some knowledge about the model in which case it's probably as easy to fork modeling_flava.py and replace:

self.codebook = FlavaCodebook inside the ForPretrainingClass with a different encoder (might even be easier to understand.

So my opinion here is that we should put FlavaCodebook inside the modeling classes that are needed so that we only have a single pytorch_model.bin file and which makes it easier to reproduce / reuse the default Flava as described in the paper

In general "self-containment" is more important then "flexibility of modularization" for Transformers in my opinion.

If we would have multiple pretrained checkpoints for different codebook classes we could have added a codebook_type=dalle flag to the config that would have switched in the correct codebook class, but nevertheless I think we should put in inside the FlavaForPreTraining class.

Let's see what @sgugger thinks here

The way it's currently implemented doesn't really allow the user to be able to use another encoder out-of-the-box either. E.g. the user would have to code up a new encoder class which respects the logic of:

Not really, because that is just one interface on how to create mim_labels. What FLAVAPretrainedModel is concerned about is mim_labels not where they came from or how they were created? I would ideally like to avoid creating mim_labels inside the pretraining class if possible.

patrickvonplaten · 2022-04-20T19:21:20Z

tests/flava/test_modeling_flava.py

+
+@require_vision
+@require_torch
+class FLAVAModelIntegrationTest(unittest.TestCase):


Shouldn't Flava also work for Text classification (GLUE) or image classification? Can't be add those heads here as well?

And the corresponding classes?

They should! They are planned for a separate PR. I am trying to unblock some use cases by first adding the pretraining classes.

Ok that's fine by me!

patrickvonplaten · 2022-04-20T19:24:39Z

src/transformers/models/flava/modeling_flava.py

+    """,
+    FLAVA_START_DOCSTRING.format(config="FLAVACodebookConfig"),
+)
+class FLAVACodebook(FLAVAPreTrainedModel):


Just to understand this a bit better - would this model class every be used as a stand-alone model or in inference? Or does it purely exist for pretraining?

If this model is only uses for pretraining (MIM), I'd prefer to not make it its own class, but to just add it as a submodule of ...ForPreTraining. It's quite important that we only have one modeling file IMO and don't create dependencies between different model repos that can get out of sync.

The way I currently understand it is that this model class should never be used as a standalone model

There are multiple reasons:

Codebook generates mim labels only which are not needed for running the inference on the model. It will unnecessarily bloat the model if we added codebook to the pretrained model even if the user doesn't need. As others have discussed, logic for generating mim_labels should be independent of the model itself.

There can be any other codebook used with flava not just this one. Keeping it separate helps with that.

FLAVACodebook can also be used independently of FLAVA for other projects. It is very similar to DALLE's Encoder.

Ok but then let's just add codebook to ForPreTraining but not to Model then it would not bloat inference no?

same for ForImageClassification - no need to add it there, but think it's better to add it to ForPretraining so that a single forward pass of ForPretraining is enough to get a loss

A lot of tasks such as WinoGround, zero-shot classification, ITM retrieval use ForPretraining out-of-the-box and that's what I meant by inference. I would really like to avoid generating MIM labels unnecessarily inside the model as they are not even needed for inference. This is in general the case for targets no? Models are provided targets they don't generate them, unless it is a teacher-student model but this case it is not. Wdyt?

Ok I might not have been very clear here.

For 2) the codebook should not generate the labels, it should process the image to the labels (the codebook is an encoder if I understand correctly), which means that by passing a mask that states which part of the codebook output is masked and pixel_values or whatever will be processed by the codebook, we get the labels. It's not like the codebook model generates labels out of nothing no?

That's what is done for Wav2Vec2

For me having a single pytorch file / single model repo is important which is what I'm after here.

I don't think we have any architectures that require multiple model repos / model directories for inference, fine-tuning , or pretraining (no?). I'd like to keep it that way. RAG, CLIP are all composite models, but they are self-contained in a single model repo.

cc @patil-suraj @LysandreJik as well here

Copy pasting some offline discussion that I had with Patrick for completeness:

P:

Yeah I think, we're more or less on the same page. Also think we should try to avoid unnecessary using memory and labels should not be created inside the model. But if the creation of labels depends on pretrained weights (which seems to be the case here and which is also the case for Wav2Vec2) IMO we should add the weights to the model architecture and pass something like mim_label_mask_indices that will be the indices that are used to mask the output of codebook
This has the following advantages:

Label creation is deterministic because it all depends on mim_label_mask_indices

Single pytorch file for training (that's very important for us). Otherwise people don't have a self-contained model folder that contains all the weights when doing model.save_pretrained("...") and it creates difficult dependencies between repos

We don't really try to have a modularized architecture style, rather a self-contained style

A:

I think that is bool_masked_pos.
What you are saying is correct. Codebook generates mim_labels but they are masked using bool_masked_pos to understand which one we need to predict.

P:

Cool couldn't we have bool_masked_pos be an input to ForPretraining ? Just like in Wav2Vec2?

A:

We have it as of now. mim_labels are masked using this later on. (before Cross Entropy calculation)

For the moment, I have been using same pretrained weights for FlavaModel and FlavaForPretraining. Using the FlavaForPretraining weights to load the FlavaModel will throw a lot of unexpected keys error but that it fine I guess?

P:

This is expected - that's happens everytime I call Wav2Vec2Model.from_pretrained(...)
Also happens everytime you do BertModel.from_pretrained("bert-base-cased")

A:

Sounds good. I think I am okay with this. I will also add an option add_codebook which is true by default then?
I would also allow user to pass mim_labels to the forward. If they do, we use those otherwise we use codebook to generate the labels and mask them.
Only thing that is left is how do we generate mlm_labels more elegantly.
The DataCollator doesn't work for me as it just updates input_ids which we need as it is. Maybe a new data collator, I am not sure. There should be something simpler.

P:

That's what I do for Wav2Vec2:

transformers/src/transformers/models/wav2vec2/modeling_wav2vec2.py

Line 313 in 67ed0e4

def _sample_negative_indices(

Then this:

transformers/examples/pytorch/speech-pretraining/run_wav2vec2_pretraining_no_trainer.py

Line 326 in 67ed0e4

sampled_negative_indices = _sample_negative_indices(

for pretraining

A:

I also remembered that the feature extractor will also need to now output separate processed image for code book and forward needs to take it as a parameter. That was one more reason to keep them separate.

P:

Ah interesting, so you need 1 tokenizer and 2 feature extractors for flava?
IMO, the best approach would then be to add a return_pixel_values_for_codebook parameter to the feature extractor
Actually it'd be good if we would discuss this public on GitHub so that Sylvain, Lysandre and Suraj could also take a look and give their opinion 🙂

so after that we are back to GitHub.

Would be good to have your opinions @sgugger @LysandreJik @patil-suraj

I agree that everything should be in one repo, we never use two separate repos for one model, so that does imply the codebook should be part of the model that uses it, to be in the same checkpoint (and if the same codebook weights can be reused for different FLAVA models, it's completely supported if the same codebook weights are in all the checkpoints).

The only alternatives (while keeping a single repo) would be to have the weights of the codebook with some adhoc prefix corresponding to the base prefix of the codebook and different than the base prefix of the other FLAVA models, so rather a ugly hack; or use a different branch for the codebook, which would result in codebook = Xxx.from_pretrainec(checkpoint, revision="codebook" ).

patrickvonplaten

Model looks super cool! Very excited to try it out!

My main remarks are:

1. Why don't we add any heads for image classification or text classification? FLAVA got pretty good results on those no?
1. Not a fan of loading checkpoints from two different repos when training / inference. Think FlavaCodebook should be refactored as a submodule to all classes that actually require it (just pretraining?)

kshitijkg · 2022-05-07T06:02:28Z

Hi! I was wondering if there are any updates on this and if this will be merged soon?

This PR aims to add [FLAVA](ihttps://arxiv.org/abs/2112.04482) model to the transformers repo. Following checklist delineates the list of things to be done for this PR to be complete: [x] Flava init [x] Flava base models [x] Flava layers [x] Flava Configs [x] Flava encoders [x] Flava pretraining models [ ] Flava classification/retrieval models (in progress) [x] Documentation updates (in progress) [x] Imports updates (in progress) [x] Argstring updates [x] Flava pretrained checkpoints (in progress) [x] Flava tests [x] Flava processors (in progress) [x] Sanity check [x] Lint

apsdehal · 2022-05-10T07:20:42Z

@kshitijkg This is almost ready and should land in the core soon.

@sgugger @patrickvonplaten This is ready for another round of review. I have resolved all of your comments except a few where I have asked clarifications. Hoping to land it soon.

sgugger

Thanks a lot for addressing all comments. I'll let @patrickvonplaten comment on the part he felt strongly about, for me all looks good apart from the two preprocessing functions inside the modeling file.

sgugger · 2022-05-10T14:12:58Z

docs/source/en/model_doc/flava.mdx

+The FLAVA model was proposed in [FLAVA: A Foundational Language And Vision Alignment Model
+](https://arxiv.org/abs/2112.04482) by Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach, and Douwe Kiela and is accepted at CVPR 2022.


Suggested change

The FLAVA model was proposed in [FLAVA: A Foundational Language And Vision Alignment Model

](https://arxiv.org/abs/2112.04482) by Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach, and Douwe Kiela and is accepted at CVPR 2022.

The FLAVA model was proposed in [FLAVA: A Foundational Language And Vision Alignment Model](https://arxiv.org/abs/2112.04482) by Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach, and Douwe Kiela and is accepted at CVPR 2022.

sgugger · 2022-05-10T14:16:39Z

src/transformers/models/flava/modeling_flava.py

+        return tuple(self[k] if k not in transformer_outputs else getattr(self, k).to_tuple() for k in self.keys())
+
+
+def add_masked_text(


This is and add_whole_word_masked_text are not linked to the model per se, so they should go somewhere else (probably in an example script showing how to use FLAVA). It's as if the word masking code was in the BERT file (which it's not).

That's only relevant for pretraining no?

So, I am somewhat conflicted about these. I think Flava is incomplete without these and I am also using them in the tests to verify that pretraining loss calculation logic is working as expected. These are actually similar in spirit to the processing functions in modeling_wav2vec2

transformers/src/transformers/models/wav2vec2/modeling_wav2vec2.py

Line 194 in 259eeb6

def _compute_mask_indices(

. I would ideally like to keep them if possible and I can move them to processing_flava so as not to pollute modeling code if that helps. What do you think?

Note that while _compute_mask_indices is also used for pretraining in Wav2Vec2, it's main purpose is to apply SpecAugment during fine-tuning (it's called within the modedl) which is similar in spirit to dropout. Also no tokenizer logic is used in _compute_mask_indices, it's therefore not dependent on any settings of the tokenizer, but just the model.

Still would think it's better to not put them there mainly because we want to be Flava have a similar design as BERT no? BERT's mlm code can be found in the examples

The main points for me are:

add_masked_text to me seems only be used in Pretraining

is not called within the model, but prepares input and labels

depends on the tokenizer

BERT's mlm is in the examples

=> because of those points, I'd favor not putting it in the modeling file

The logic for mlm is inside a data collator and people usually these days want to prove the pretrain model for their pretraining objectives as well (I want to probe it actually as well), so that’s why I believe having this as a separate function would make sense (even if only in processing_flava file). Another question, how would I do the integration test for users without these function, hardcode masked indices?

I have removed these for now. I will look into some other alternative to provide them in easier way later.

sgugger · 2022-05-10T14:31:20Z

src/transformers/models/flava/modeling_flava.py

+)
+class FLAVAImageModel(FLAVAPreTrainedModel):
+    config_class = FLAVAImageConfig
+    base_model_prefix = "flava.image_model"


Ah, I understand what you're trying to do. This is a current limitation of the PreTrainedModel API, as base_model_prefix can only be one value, not several depending on the model. So this is indeed the best way to achieve this.

sgugger · 2022-05-10T14:35:20Z

src/transformers/models/flava/processing_flava.py

+                "Image.Image",  # noqa
+                np.ndarray,
+                "torch.Tensor",  # noqa
+                List["Image.Image"],  # noqa
+                List[np.ndarray],
+                List["torch.Tensor"],  # noqa


I hate those kinds of type annotation anyway (not sure who it's actually helping honestly, when it's long like this), so do as you prefer :-)

patrickvonplaten · 2022-05-10T15:47:23Z

tests/models/flava/test_modeling_flava.py

+    @slow
+    def test_inference(self):
+        model_name = "facebook/flava-full"
+        model = FlavaModel.from_pretrained(model_name).to(torch_device)


Thanks for putting everything in one model!

patrickvonplaten · 2022-05-10T15:48:49Z

src/transformers/models/flava/modeling_flava.py

+        embeddings = self.dropout(embeddings)
+        return embeddings
+
+


Can we add a # Copied from ... BertSelfAttention here? Aren't they exactly the same?

Ok BERT allows one to also pass encoder hidden states in case it's used as a decoder cross attention layer. Think we could have the same logic here. But also ok for me to not add it if you think it's doesn't fit Flava

Thanks for the comment. I think that is not useful for FLAVA so I would avoid it for now.

patrickvonplaten · 2022-05-10T15:51:05Z

src/transformers/models/flava/modeling_flava.py

+        return outputs
+
+
+class FlavaIntermediate(nn.Module):


Add a # Copied from ... BERT?

patrickvonplaten

Very cool! PR is also good to go for me. Agree with @sgugger that preprocessing logic for pretraining should not go into the model file, but just live in an example (pretraining_flava.py under examples). For now maybe we'll just remove it and keep it alive it a PR in case we want to add an example?

Also, we could add more # Copied from statements from BERT which would make the code more robust and easier to maintain, but I understand if you don't want this since it would add encoder_attention_mask, encoder_hidden_states logic which would only be useful when using Flava as a decoder => so also ok for me to not do it

* [WIP] Add FLAVA model This PR aims to add [FLAVA](ihttps://arxiv.org/abs/2112.04482) model to the transformers repo. Following checklist delineates the list of things to be done for this PR to be complete: [x] Flava init [x] Flava base models [x] Flava layers [x] Flava Configs [x] Flava encoders [x] Flava pretraining models [ ] Flava classification/retrieval models (To be added in a separate PR) [x] Documentation updates [x] Imports updates [x] Argstring updates [x] Flava pretrained checkpoints [x] Flava tests [x] Flava processors [x] Sanity check [x] Lint

apsdehal force-pushed the flava_integration branch 8 times, most recently from eb4e9d5 to c1489f5 Compare April 14, 2022 21:59

apsdehal force-pushed the flava_integration branch 11 times, most recently from 8ef39ad to 5c2db5d Compare April 19, 2022 16:03

apsdehal marked this pull request as ready for review April 19, 2022 16:38

apsdehal requested review from LysandreJik and patil-suraj April 19, 2022 16:38

apsdehal changed the title ~~[WIP] Add FLAVA model~~ [feat] Add FLAVA model Apr 19, 2022

LysandreJik requested review from sgugger and patrickvonplaten April 19, 2022 20:48

sgugger reviewed Apr 19, 2022

View reviewed changes

patrickvonplaten reviewed Apr 20, 2022

View reviewed changes

tests/flava/test_modeling_flava.py Outdated Show resolved Hide resolved

patrickvonplaten reviewed Apr 20, 2022

View reviewed changes

apsdehal force-pushed the flava_integration branch 5 times, most recently from 814ed8f to 806a056 Compare May 9, 2022 21:36

apsdehal added 2 commits May 9, 2022 21:52

Address comments

29953bb

apsdehal force-pushed the flava_integration branch from 806a056 to 29953bb Compare May 9, 2022 22:26

apsdehal requested review from patrickvonplaten and sgugger May 10, 2022 07:13

sgugger reviewed May 10, 2022

View reviewed changes

patrickvonplaten reviewed May 10, 2022

View reviewed changes

patrickvonplaten approved these changes May 10, 2022

View reviewed changes

Address final set of comments

12248ec

apsdehal merged commit a10f618 into huggingface:main May 11, 2022

apsdehal deleted the flava_integration branch May 11, 2022 21:56

	1. [FLAVA](https://huggingface.co/docs/transformers/model_doc_flava) (from Facebook AI) released with the paper [FLAVA: A Foundational Language And Vision Alignment Model](https://arxiv.org/abs/2112.04482) by Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach, and Douwe Kiela.
	1. [FLAVA](https://huggingface.co/docs/transformers/main/model_doc/flava) (from Facebook AI) released with the paper [FLAVA: A Foundational Language And Vision Alignment Model](https://arxiv.org/abs/2112.04482) by Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach, and Douwe Kiela.

		@@ -0,0 +1,105 @@
		<!--Copyright 2021 The HuggingFace Team. All rights reserved.

	<!--Copyright 2021 The HuggingFace Team. All rights reserved.
	<!--Copyright 2022 The HuggingFace Team. All rights reserved.

		impressive performance on a wide range of 35 tasks spanning these target modalities.


		<!-- Tips: -->

		<!-- The original code can be found [here](<INSERT LINK TO GITHUB REPO HERE>).-->


		## FLAVAConfig

	# Copyright 2021 Meta Platforms authors and The HuggingFace Team. All rights reserved.
	# Copyright 2022 Meta Platforms authors and The HuggingFace Team. All rights reserved.

		@@ -0,0 +1,611 @@
		# coding=utf-8
		# Copyright 2021 Meta Platforms authors and The HuggingFace Team. All rights reserved.

	""" FLAVA model configuration"""
	""" FLAVA model configurations"""

	"flava-full": "https://huggingface.co/aps/flava-full/resolve/main/config.json",
	"aps/flava-full": "https://huggingface.co/aps/flava-full/resolve/main/config.json",

	[full](https://huggingface.co/aps/flava-full) architecture.
	[aps/flava-full](https://huggingface.co/aps/flava-full) architecture.

		@@ -0,0 +1,1234 @@
		# coding=utf-8
		# Copyright 2021 The HuggingFace Inc. team. All rights reserved.

	# Copyright 2021 The HuggingFace Inc. team. All rights reserved.
	# Copyright 2022 The HuggingFace Inc. team. All rights reserved.

		@@ -0,0 +1,416 @@
		# coding=utf-8
		# Copyright 2021 HuggingFace Inc.

	# Copyright 2021 HuggingFace Inc.
	# Copyright 2022 HuggingFace Inc.

		The FLAVA model was proposed in [FLAVA: A Foundational Language And Vision Alignment Model
		](https://arxiv.org/abs/2112.04482) by Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach, and Douwe Kiela and is accepted at CVPR 2022.

		return tuple(self[k] if k not in transformer_outputs else getattr(self, k).to_tuple() for k in self.keys())


		def add_masked_text(

[feat] Add FLAVA model #16654

[feat] Add FLAVA model #16654

Conversation

apsdehal commented Apr 7, 2022 • edited

HuggingFaceDocBuilderDev commented Apr 14, 2022 • edited

sgugger left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

apsdehal May 3, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sgugger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

apsdehal commented Apr 19, 2022

sgugger commented Apr 20, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

patrickvonplaten Apr 20, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

patrickvonplaten left a comment

Choose a reason for hiding this comment

kshitijkg commented May 7, 2022 • edited

apsdehal commented May 10, 2022

sgugger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

patrickvonplaten May 10, 2022 • edited

Choose a reason for hiding this comment

patrickvonplaten May 10, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

apsdehal commented Apr 7, 2022 •

edited

HuggingFaceDocBuilderDev commented Apr 14, 2022 •

edited

sgugger left a comment •

edited

apsdehal May 3, 2022 •

edited

patrickvonplaten Apr 20, 2022 •

edited

kshitijkg commented May 7, 2022 •

edited

patrickvonplaten May 10, 2022 •

edited

patrickvonplaten May 10, 2022 •

edited