Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feat] Add FLAVA model #16654

Merged
merged 3 commits into from May 11, 2022
Merged

Conversation

apsdehal
Copy link
Contributor

@apsdehal apsdehal commented Apr 7, 2022

This PR aims to add FLAVA model to the transformers repo.

Following checklist delineates the list of things to be done for this PR
to be complete:

  • Flava init
  • Flava base models
  • Flava layers
  • Flava Configs
  • Flava encoders
  • Flava pretraining models
  • Flava codebook
  • Flava feature extractors
  • Flava classification/retrieval models (in progress)
  • Documentation updates
  • Encoding utilities
  • Imports updates
  • Argstring updates
  • Flava pretrained checkpoints
  • Flava unit tests
  • Flava integration tests
  • Sanity check
  • Lint
  • Flava Processors

@apsdehal apsdehal force-pushed the flava_integration branch 8 times, most recently from eb4e9d5 to c1489f5 Compare April 14, 2022 21:59
@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Apr 14, 2022

The documentation is not available anymore as the PR was closed or merged.

@apsdehal apsdehal force-pushed the flava_integration branch 11 times, most recently from 8ef39ad to 5c2db5d Compare April 19, 2022 16:03
@apsdehal apsdehal marked this pull request as ready for review April 19, 2022 16:38
@apsdehal apsdehal changed the title [WIP] Add FLAVA model [feat] Add FLAVA model Apr 19, 2022
Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Main comment is a little bit below. Sorry, I've clicked review by mistake 😬 .

README.md Outdated
@@ -265,6 +265,7 @@ Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih.
1. **[EncoderDecoder](https://huggingface.co/docs/transformers/model_doc/encoder-decoder)** (from Google Research) released with the paper [Leveraging Pre-trained Checkpoints for Sequence Generation Tasks](https://arxiv.org/abs/1907.12461) by Sascha Rothe, Shashi Narayan, Aliaksei Severyn.
1. **[ELECTRA](https://huggingface.co/docs/transformers/model_doc/electra)** (from Google Research/Stanford University) released with the paper [ELECTRA: Pre-training text encoders as discriminators rather than generators](https://arxiv.org/abs/2003.10555) by Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning.
1. **[FlauBERT](https://huggingface.co/docs/transformers/model_doc/flaubert)** (from CNRS) released with the paper [FlauBERT: Unsupervised Language Model Pre-training for French](https://arxiv.org/abs/1912.05372) by Hang Le, Loïc Vial, Jibril Frej, Vincent Segonne, Maximin Coavoux, Benjamin Lecouteux, Alexandre Allauzen, Benoît Crabbé, Laurent Besacier, Didier Schwab.
1. **[FLAVA](https://huggingface.co/docs/transformers/model_doc_flava)** (from Facebook AI) released with the paper [FLAVA: A Foundational Language And Vision Alignment Model](https://arxiv.org/abs/2112.04482) by Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach, and Douwe Kiela.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
1. **[FLAVA](https://huggingface.co/docs/transformers/model_doc_flava)** (from Facebook AI) released with the paper [FLAVA: A Foundational Language And Vision Alignment Model](https://arxiv.org/abs/2112.04482) by Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach, and Douwe Kiela.
1. **[FLAVA](https://huggingface.co/docs/transformers/main/model_doc/flava)** (from Facebook AI) released with the paper [FLAVA: A Foundational Language And Vision Alignment Model](https://arxiv.org/abs/2112.04482) by Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach, and Douwe Kiela.

Typo + the model will only be in the main branch until next release, so the link needs to point to the main version of the doc.

@@ -0,0 +1,105 @@
<!--Copyright 2021 The HuggingFace Team. All rights reserved.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<!--Copyright 2021 The HuggingFace Team. All rights reserved.
<!--Copyright 2022 The HuggingFace Team. All rights reserved.

Comment on lines 25 to 31
State-of-the-art vision and vision-and-language models rely on large-scale visio-linguistic pretraining for obtaining good performance on a variety
of downstream tasks. Generally, such models are often either cross-modal (contrastive) or multi-modal
(with earlier fusion) but not both; and they often only target specific modalities or tasks. A promising
direction would be to use a single holistic universal model, as a "foundation", that targets all modalities
at once -- a true vision and language foundation model should be good at vision tasks, language tasks, and
cross- and multi-modal vision and language tasks. We introduce FLAVA as such a model and demonstrate
impressive performance on a wide range of 35 tasks spanning these target modalities.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
State-of-the-art vision and vision-and-language models rely on large-scale visio-linguistic pretraining for obtaining good performance on a variety
of downstream tasks. Generally, such models are often either cross-modal (contrastive) or multi-modal
(with earlier fusion) but not both; and they often only target specific modalities or tasks. A promising
direction would be to use a single holistic universal model, as a "foundation", that targets all modalities
at once -- a true vision and language foundation model should be good at vision tasks, language tasks, and
cross- and multi-modal vision and language tasks. We introduce FLAVA as such a model and demonstrate
impressive performance on a wide range of 35 tasks spanning these target modalities.
*State-of-the-art vision and vision-and-language models rely on large-scale visio-linguistic pretraining for obtaining good performance on a variety
of downstream tasks. Generally, such models are often either cross-modal (contrastive) or multi-modal
(with earlier fusion) but not both; and they often only target specific modalities or tasks. A promising
direction would be to use a single holistic universal model, as a "foundation", that targets all modalities
at once -- a true vision and language foundation model should be good at vision tasks, language tasks, and
cross- and multi-modal vision and language tasks. We introduce FLAVA as such a model and demonstrate
impressive performance on a wide range of 35 tasks spanning these target modalities.*

impressive performance on a wide range of 35 tasks spanning these target modalities.


<!-- Tips: -->
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs to be filled.

<!-- Tips: -->

This model was contributed by [aps](https://huggingface.co/aps).
<!-- The original code can be found [here](<INSERT LINK TO GITHUB REPO HERE>).-->
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs to be filled.

## FLAVAConfig

[[autodoc]] FLAVAConfig
- from_configs
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure it's necessary here. Will be automatically added.

Suggested change
- from_configs

<!-- The original code can be found [here](<INSERT LINK TO GITHUB REPO HERE>).-->


## FLAVAConfig
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will put in the general comment as well but all names need to be re-case to FlavaXxx. We don't use BERTConfig, BERTModel etc., but BertConfig, BertModel...

Copy link
Contributor Author

@apsdehal apsdehal May 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re on this topic: There are multiple examples in codebase that actually don't follow this signature. XLM, WavLM, ResNet, GPT2, CLIP, CTRL, and many others. Overall, choice of case seems random and not really enforced everywhere. I am not sure why these models have this privilege and FLAVA can't?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ResNet and WaLM are not all caps so I don't really see your point there. If FLAVA stands for two different words, we can capitalize the second one but I don't think that's the case.
This is not a privilege, this is from the beginnings of Transformers. We then got user feedback that those casings made it difficult for them to find the model classes, so we stopped.

Comment on lines 549 to 551
_import_structure["models.flava"].append("FLAVAFeatureExtractor")
_import_structure["models.flava"].append("FLAVACodebookFeatureExtractor")
_import_structure["models.flava"].append("FLAVAProcessor")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All in one line with an extend

Comment on lines 222 to 228
# (
# "flava",
# (
# "CLIPTokenizer",
# "CLIPTokenizerFast" if is_tokenizers_available() else None,
# ),
# ),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be cleaned up or fixed. Looking at the processor file, it seems to be using the BertTokenizer?

# There's no way to ignore "F401 '...' imported but unused" warnings in this
# module, but to preserve other warnings. So, don't check this module at all.

# Copyright 2021 Meta Platforms authors and The HuggingFace Team. All rights reserved.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Copyright 2021 Meta Platforms authors and The HuggingFace Team. All rights reserved.
# Copyright 2022 Meta Platforms authors and The HuggingFace Team. All rights reserved.

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your PR, exciting to see this model arrive soon in the library!

One thing is that all names need to be re-case to FlavaXxx. We don't use BERTConfig, BERTModel etc., but BertConfig, BertModel... Likewise we should have FlavaModel, FlavaConfig etc.

You use Meta or Facebook in several places. You should pick one and stick to it.

@@ -0,0 +1,611 @@
# coding=utf-8
# Copyright 2021 Meta Platforms authors and The HuggingFace Team. All rights reserved.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Copyright 2021 Meta Platforms authors and The HuggingFace Team. All rights reserved.
# Copyright 2022 Meta Platforms authors and The HuggingFace Team. All rights reserved.

# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" FLAVA model configuration"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
""" FLAVA model configuration"""
""" FLAVA model configurations"""

logger = logging.get_logger(__name__)

FLAVA_PRETRAINED_CONFIG_ARCHIVE_MAP = {
"flava-full": "https://huggingface.co/aps/flava-full/resolve/main/config.json",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"flava-full": "https://huggingface.co/aps/flava-full/resolve/main/config.json",
"aps/flava-full": "https://huggingface.co/aps/flava-full/resolve/main/config.json",

The namespace also needs to be in the key.

This is the configuration class to store the configuration of a [`FLAVAImageModel`]. It is used to instantiate an
FLAVA model according to the specified arguments, defining the model architecture. Instantiating a configuration
with the defaults will yield a similar configuration to that of the FLAVA
[full](https://huggingface.co/aps/flava-full) architecture.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
[full](https://huggingface.co/aps/flava-full) architecture.
[aps/flava-full](https://huggingface.co/aps/flava-full) architecture.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would leave this as it is to avoid ambiguity in the meaning of the sentence itself.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it is ambiguous as it doesn't show the real name of the checkpoint.

Comment on lines 63 to 67
image_size (`int`, *optional*, defaults to `224`):
The size (resolution) of each image.
patch_size (`int`, *optional*, defaults to `16`):
The size (resolution) of each patch.
num_channels (`int`, *optional*, defaults to `3`):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Numbers for default values should not be between backticks.

@@ -0,0 +1,1234 @@
# coding=utf-8
# Copyright 2021 The HuggingFace Inc. team. All rights reserved.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Copyright 2021 The HuggingFace Inc. team. All rights reserved.
# Copyright 2022 The HuggingFace Inc. team. All rights reserved.

@@ -0,0 +1,416 @@
# coding=utf-8
# Copyright 2021 HuggingFace Inc.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Copyright 2021 HuggingFace Inc.
# Copyright 2022 HuggingFace Inc.

Comment on lines 630 to 636
def __init__(
self,
parent,
batch_size=12,
image_size=112,
num_channels=3,
):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can all fit in one line.

Comment on lines 795 to 801
def _test_model(
self,
config,
inputs,
test_image=False,
test_text=False,
):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can all fit in one line.

Comment on lines 41 to 57
vocab_tokens = [
"[UNK]",
"[CLS]",
"[SEP]",
"[PAD]",
"[MASK]",
"want",
"##want",
"##ed",
"wa",
"un",
"runn",
"##ing",
",",
"low",
"lowest",
]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be all in one line and deactivate black for it to make it more readable (#fmt: off before and #fmt: on after).

@apsdehal
Copy link
Contributor Author

@sgugger Thanks for your review on the PR. About lowercasing things, I followed what I saw in the repo as not all models were lower case. An example of this would be CLIP where classes are named as CLIPConfig, CLIPTextEncoder and so on. Would it be possible to keep FLAVA as uppercase?

@sgugger
Copy link
Collaborator

sgugger commented Apr 20, 2022

CLIP was a mistake IMO, and it makes it harder on users to guess the right names of the models if we have different casings all the time. It's not because we make a mistake once we should repeat it again ang again just because there is a precedent ;-)

@require_torch
class FLAVAForPreTrainingIntegrationTest(unittest.TestCase):
@slow
def test_inference(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool test!

model_name = "aps/flava-full"
codebook_name = "aps/flava-codebook"
model = FLAVAForPreTraining.from_pretrained(model_name).to(torch_device)
codebook = FLAVACodebook.from_pretrained(codebook_name).to(torch_device)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, don't really like that flava-full and flava-codebook are in two different repositories. All of our models usually work with one and only one modeling file. Can't we merge the FlavaCodebook a model into the FlavaForPreTraining class? Don't like that inference is dependent on two different repos

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would ideally like to avoid that because codebook is somewhat independent of FLAVA itself as it is mostly just custom DALLE encoder. A different codebook can be potentially used with flava. If I join them into one, I will lose that benefit. wdyt?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, I guess we have kind of a composition model problem here (maybe similar to Blenderbot2 or RAG cc @sgugger )
My opinion here is the following. The paper only used the DALLE encoder (no?) and that's what we expect people to use when reproducing the paper. In our experience few people expect things that have not been done in the paper to work out of the box.

The way it's currently implemented doesn't really allow the user to be able to use another encoder out-of-the-box either. E.g. the user would have to code up a new encoder class which respects the logic of:

mim_labels = codebook.get_codebook_indices(**codebook_fe([image, image], return_tensors="pt").to(torch_device))

i.e. the model would also require a get_codebook_indices(...) function etc... This requires already quite some knowledge about the model in which case it's probably as easy to fork modeling_flava.py and replace:

self.codebook = FlavaCodebook inside the ForPretrainingClass with a different encoder (might even be easier to understand.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So my opinion here is that we should put FlavaCodebook inside the modeling classes that are needed so that we only have a single pytorch_model.bin file and which makes it easier to reproduce / reuse the default Flava as described in the paper

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general "self-containment" is more important then "flexibility of modularization" for Transformers in my opinion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we would have multiple pretrained checkpoints for different codebook classes we could have added a codebook_type=dalle flag to the config that would have switched in the correct codebook class, but nevertheless I think we should put in inside the FlavaForPreTraining class.

Let's see what @sgugger thinks here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way it's currently implemented doesn't really allow the user to be able to use another encoder out-of-the-box either. E.g. the user would have to code up a new encoder class which respects the logic of:

Not really, because that is just one interface on how to create mim_labels. What FLAVAPretrainedModel is concerned about is mim_labels not where they came from or how they were created? I would ideally like to avoid creating mim_labels inside the pretraining class if possible.


@require_vision
@require_torch
class FLAVAModelIntegrationTest(unittest.TestCase):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't Flava also work for Text classification (GLUE) or image classification? Can't be add those heads here as well?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And the corresponding classes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They should! They are planned for a separate PR. I am trying to unblock some use cases by first adding the pretraining classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok that's fine by me!

""",
FLAVA_START_DOCSTRING.format(config="FLAVACodebookConfig"),
)
class FLAVACodebook(FLAVAPreTrainedModel):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to understand this a bit better - would this model class every be used as a stand-alone model or in inference? Or does it purely exist for pretraining?

If this model is only uses for pretraining (MIM), I'd prefer to not make it its own class, but to just add it as a submodule of ...ForPreTraining. It's quite important that we only have one modeling file IMO and don't create dependencies between different model repos that can get out of sync.

The way I currently understand it is that this model class should never be used as a standalone model

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are multiple reasons:

  1. Codebook generates mim labels only which are not needed for running the inference on the model. It will unnecessarily bloat the model if we added codebook to the pretrained model even if the user doesn't need. As others have discussed, logic for generating mim_labels should be independent of the model itself.
  2. There can be any other codebook used with flava not just this one. Keeping it separate helps with that.
  3. FLAVACodebook can also be used independently of FLAVA for other projects. It is very similar to DALLE's Encoder.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok but then let's just add codebook to ForPreTraining but not to Model then it would not bloat inference no?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same for ForImageClassification - no need to add it there, but think it's better to add it to ForPretraining so that a single forward pass of ForPretraining is enough to get a loss

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of tasks such as WinoGround, zero-shot classification, ITM retrieval use ForPretraining out-of-the-box and that's what I meant by inference. I would really like to avoid generating MIM labels unnecessarily inside the model as they are not even needed for inference. This is in general the case for targets no? Models are provided targets they don't generate them, unless it is a teacher-student model but this case it is not. Wdyt?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I might not have been very clear here.

For 2) the codebook should not generate the labels, it should process the image to the labels (the codebook is an encoder if I understand correctly), which means that by passing a mask that states which part of the codebook output is masked and pixel_values or whatever will be processed by the codebook, we get the labels. It's not like the codebook model generates labels out of nothing no?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what is done for Wav2Vec2

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me having a single pytorch file / single model repo is important which is what I'm after here.

I don't think we have any architectures that require multiple model repos / model directories for inference, fine-tuning , or pretraining (no?). I'd like to keep it that way. RAG, CLIP are all composite models, but they are self-contained in a single model repo.

cc @patil-suraj @LysandreJik as well here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy pasting some offline discussion that I had with Patrick for completeness:

P:

Yeah I think, we're more or less on the same page. Also think we should try to avoid unnecessary using memory and labels should not be created inside the model. But if the creation of labels depends on pretrained weights (which seems to be the case here and which is also the case for Wav2Vec2) IMO we should add the weights to the model architecture and pass something like mim_label_mask_indices that will be the indices that are used to mask the output of codebook
This has the following advantages:

  • Label creation is deterministic because it all depends on mim_label_mask_indices
  • Single pytorch file for training (that's very important for us). Otherwise people don't have a self-contained model folder that contains all the weights when doing model.save_pretrained("...") and it creates difficult dependencies between repos
  • We don't really try to have a modularized architecture style, rather a self-contained style

A:

I think that is bool_masked_pos.
What you are saying is correct. Codebook generates mim_labels but they are masked using bool_masked_pos to understand which one we need to predict.

P:

Cool couldn't we have bool_masked_pos be an input to ForPretraining ? Just like in Wav2Vec2?

A:

We have it as of now. mim_labels are masked using this later on. (before Cross Entropy calculation)

For the moment, I have been using same pretrained weights for FlavaModel and FlavaForPretraining. Using the FlavaForPretraining weights to load the FlavaModel will throw a lot of unexpected keys error but that it fine I guess?

P:

This is expected - that's happens everytime I call Wav2Vec2Model.from_pretrained(...)
Also happens everytime you do BertModel.from_pretrained("bert-base-cased")

A:

Sounds good. I think I am okay with this. I will also add an option add_codebook which is true by default then?
I would also allow user to pass mim_labels to the forward. If they do, we use those otherwise we use codebook to generate the labels and mask them.
Only thing that is left is how do we generate mlm_labels more elegantly.
The DataCollator doesn't work for me as it just updates input_ids which we need as it is. Maybe a new data collator, I am not sure. There should be something simpler.

P:

That's what I do for Wav2Vec2:


Then this:
sampled_negative_indices = _sample_negative_indices(
for pretraining

A:

I also remembered that the feature extractor will also need to now output separate processed image for code book and forward needs to take it as a parameter. That was one more reason to keep them separate.

P:

Ah interesting, so you need 1 tokenizer and 2 feature extractors for flava?
IMO, the best approach would then be to add a return_pixel_values_for_codebook parameter to the feature extractor
Actually it'd be good if we would discuss this public on GitHub so that Sylvain, Lysandre and Suraj could also take a look and give their opinion 🙂

so after that we are back to GitHub.

Would be good to have your opinions @sgugger @LysandreJik @patil-suraj

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that everything should be in one repo, we never use two separate repos for one model, so that does imply the codebook should be part of the model that uses it, to be in the same checkpoint (and if the same codebook weights can be reused for different FLAVA models, it's completely supported if the same codebook weights are in all the checkpoints).

The only alternatives (while keeping a single repo) would be to have the weights of the codebook with some adhoc prefix corresponding to the base prefix of the codebook and different than the base prefix of the other FLAVA models, so rather a ugly hack; or use a different branch for the codebook, which would result in codebook = Xxx.from_pretrainec(checkpoint, revision="codebook" ).

Copy link
Contributor

@patrickvonplaten patrickvonplaten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Model looks super cool! Very excited to try it out!

My main remarks are:

    1. Why don't we add any heads for image classification or text classification? FLAVA got pretty good results on those no?
    1. Not a fan of loading checkpoints from two different repos when training / inference. Think FlavaCodebook should be refactored as a submodule to all classes that actually require it (just pretraining?)

@kshitijkg
Copy link

kshitijkg commented May 7, 2022

Hi! I was wondering if there are any updates on this and if this will be merged soon?

@apsdehal apsdehal force-pushed the flava_integration branch 5 times, most recently from 814ed8f to 806a056 Compare May 9, 2022 21:36
This PR aims to add [FLAVA](ihttps://arxiv.org/abs/2112.04482) model to the transformers repo.

Following checklist delineates the list of things to be done for this PR
to be complete:

[x] Flava init
[x] Flava base models
[x] Flava layers
[x] Flava Configs
[x] Flava encoders
[x] Flava pretraining models
[ ] Flava classification/retrieval models (in progress)
[x] Documentation updates (in progress)
[x] Imports updates (in progress)
[x] Argstring updates
[x] Flava pretrained checkpoints (in progress)
[x] Flava tests
[x] Flava processors (in progress)
[x] Sanity check
[x] Lint
@apsdehal
Copy link
Contributor Author

@kshitijkg This is almost ready and should land in the core soon.

@sgugger @patrickvonplaten This is ready for another round of review. I have resolved all of your comments except a few where I have asked clarifications. Hoping to land it soon.

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for addressing all comments. I'll let @patrickvonplaten comment on the part he felt strongly about, for me all looks good apart from the two preprocessing functions inside the modeling file.

Comment on lines 17 to 18
The FLAVA model was proposed in [FLAVA: A Foundational Language And Vision Alignment Model
](https://arxiv.org/abs/2112.04482) by Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach, and Douwe Kiela and is accepted at CVPR 2022.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The FLAVA model was proposed in [FLAVA: A Foundational Language And Vision Alignment Model
](https://arxiv.org/abs/2112.04482) by Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach, and Douwe Kiela and is accepted at CVPR 2022.
The FLAVA model was proposed in [FLAVA: A Foundational Language And Vision Alignment Model](https://arxiv.org/abs/2112.04482) by Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach, and Douwe Kiela and is accepted at CVPR 2022.

return tuple(self[k] if k not in transformer_outputs else getattr(self, k).to_tuple() for k in self.keys())


def add_masked_text(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is and add_whole_word_masked_text are not linked to the model per se, so they should go somewhere else (probably in an example script showing how to use FLAVA). It's as if the word masking code was in the BERT file (which it's not).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's only relevant for pretraining no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I am somewhat conflicted about these. I think Flava is incomplete without these and I am also using them in the tests to verify that pretraining loss calculation logic is working as expected. These are actually similar in spirit to the processing functions in modeling_wav2vec2

. I would ideally like to keep them if possible and I can move them to processing_flava so as not to pollute modeling code if that helps. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that while _compute_mask_indices is also used for pretraining in Wav2Vec2, it's main purpose is to apply SpecAugment during fine-tuning (it's called within the modedl) which is similar in spirit to dropout. Also no tokenizer logic is used in _compute_mask_indices, it's therefore not dependent on any settings of the tokenizer, but just the model.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still would think it's better to not put them there mainly because we want to be Flava have a similar design as BERT no? BERT's mlm code can be found in the examples

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main points for me are:

  • add_masked_text to me seems only be used in Pretraining
  • is not called within the model, but prepares input and labels
  • depends on the tokenizer
  • BERT's mlm is in the examples

=> because of those points, I'd favor not putting it in the modeling file

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic for mlm is inside a data collator and people usually these days want to prove the pretrain model for their pretraining objectives as well (I want to probe it actually as well), so that’s why I believe having this as a separate function would make sense (even if only in processing_flava file). Another question, how would I do the integration test for users without these function, hardcode masked indices?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed these for now. I will look into some other alternative to provide them in easier way later.

)
class FLAVAImageModel(FLAVAPreTrainedModel):
config_class = FLAVAImageConfig
base_model_prefix = "flava.image_model"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I understand what you're trying to do. This is a current limitation of the PreTrainedModel API, as base_model_prefix can only be one value, not several depending on the model. So this is indeed the best way to achieve this.

"Image.Image", # noqa
np.ndarray,
"torch.Tensor", # noqa
List["Image.Image"], # noqa
List[np.ndarray],
List["torch.Tensor"], # noqa
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hate those kinds of type annotation anyway (not sure who it's actually helping honestly, when it's long like this), so do as you prefer :-)

@slow
def test_inference(self):
model_name = "facebook/flava-full"
model = FlavaModel.from_pretrained(model_name).to(torch_device)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting everything in one model!

embeddings = self.dropout(embeddings)
return embeddings


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a # Copied from ... BertSelfAttention here? Aren't they exactly the same?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok BERT allows one to also pass encoder hidden states in case it's used as a decoder cross attention layer. Think we could have the same logic here. But also ok for me to not add it if you think it's doesn't fit Flava

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comment. I think that is not useful for FLAVA so I would avoid it for now.

return outputs


class FlavaIntermediate(nn.Module):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a # Copied from ... BERT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

Copy link
Contributor

@patrickvonplaten patrickvonplaten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool! PR is also good to go for me. Agree with @sgugger that preprocessing logic for pretraining should not go into the model file, but just live in an example (pretraining_flava.py under examples). For now maybe we'll just remove it and keep it alive it a PR in case we want to add an example?

Also, we could add more # Copied from statements from BERT which would make the code more robust and easier to maintain, but I understand if you don't want this since it would add encoder_attention_mask, encoder_hidden_states logic which would only be useful when using Flava as a decoder => so also ok for me to not do it

@apsdehal apsdehal merged commit a10f618 into huggingface:main May 11, 2022
@apsdehal apsdehal deleted the flava_integration branch May 11, 2022 21:56
Narsil pushed a commit to Narsil/transformers that referenced this pull request May 12, 2022
* [WIP] Add FLAVA model

This PR aims to add [FLAVA](ihttps://arxiv.org/abs/2112.04482) model to the transformers repo.

Following checklist delineates the list of things to be done for this PR
to be complete:

[x] Flava init
[x] Flava base models
[x] Flava layers
[x] Flava Configs
[x] Flava encoders
[x] Flava pretraining models
[ ] Flava classification/retrieval models (To be added in a separate PR)
[x] Documentation updates 
[x] Imports updates 
[x] Argstring updates
[x] Flava pretrained checkpoints 
[x] Flava tests
[x] Flava processors 
[x] Sanity check
[x] Lint
ArthurZucker pushed a commit to ArthurZucker/transformers that referenced this pull request May 12, 2022
* [WIP] Add FLAVA model

This PR aims to add [FLAVA](ihttps://arxiv.org/abs/2112.04482) model to the transformers repo.

Following checklist delineates the list of things to be done for this PR
to be complete:

[x] Flava init
[x] Flava base models
[x] Flava layers
[x] Flava Configs
[x] Flava encoders
[x] Flava pretraining models
[ ] Flava classification/retrieval models (To be added in a separate PR)
[x] Documentation updates 
[x] Imports updates 
[x] Argstring updates
[x] Flava pretrained checkpoints 
[x] Flava tests
[x] Flava processors 
[x] Sanity check
[x] Lint
elusenji pushed a commit to elusenji/transformers that referenced this pull request Jun 12, 2022
* [WIP] Add FLAVA model

This PR aims to add [FLAVA](ihttps://arxiv.org/abs/2112.04482) model to the transformers repo.

Following checklist delineates the list of things to be done for this PR
to be complete:

[x] Flava init
[x] Flava base models
[x] Flava layers
[x] Flava Configs
[x] Flava encoders
[x] Flava pretraining models
[ ] Flava classification/retrieval models (To be added in a separate PR)
[x] Documentation updates 
[x] Imports updates 
[x] Argstring updates
[x] Flava pretrained checkpoints 
[x] Flava tests
[x] Flava processors 
[x] Sanity check
[x] Lint
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants