Add OWL-ViT model for zero-shot object detection #17938

alaradirik · 2022-06-29T13:35:08Z

What does this PR do?

Adds OwlViT model for open-vocabulary object detection. Model takes in one or multiple text queries per image as input.

Original repo:
https://github.com/google-research/scenic/tree/a41d24676f64a2158bfcd7cb79b0a87673aa875b/scenic/projects/owl_vit

Test notebook:
https://colab.research.google.com/drive/1IMPWZcnlMy-tdnTDrUcOZU3oiGg-hTem?usp=sharing

@sgugger could you review my draft PR, please?

amyeroberts

Looking good :D This is a great contribution and impressive amount of work!

Most comments are about the logic in the Processor or nits. I can see there's still tests to be added, so haven't reviewed that throughly yet. Could you add tests for the processor too?

src/transformers/models/owlvit/configuration_owlvit.py

src/transformers/models/owlvit/convert_owlvit_original_flax_to_hf.py

src/transformers/models/owlvit/modeling_owlvit.py

src/transformers/models/owlvit/processing_owlvit.py

src/transformers/models/owlvit/modeling_owlvit.py

amyeroberts

LGTM! ❤️

src/transformers/models/clip/configuration_clip.py

src/transformers/models/owlvit/feature_extraction_owlvit.py

src/transformers/models/owlvit/modeling_owlvit.py

NielsRogge · 2022-07-22T07:39:02Z

src/transformers/models/owlvit/modeling_owlvit.py

+        text_model_last_hidden_states = None
+        vision_model_last_hidden_states = None
+
+        if output_hidden_states:


It seems that if a user specifies output_hidden_states, the input_ids and pixel_values are forwarded twice through the model?

NielsRogge

My remaining comments are:

when output_hidden_states=True for the object detection model, is a forward pass performed twice?
would be great to add an integration test for the processor, where you take for instance the cats image with 2 texts, and have some expected input_ids (similar to this test)

* add owlvit model skeleton * add class and box predictor heads * convert modified flax clip to pytorch * fix box and class predictors * add OwlViTImageTextEmbedder * convert class and box head checkpoints * convert image text embedder checkpoints * add object detection head * fix bugs * update conversion script * update conversion script * fix q,v,k,out weight conversion conversion * add owlvit object detection output * fix bug in image embedder * fix bugs in text embedder * fix positional embeddings * fix bug in inference mode vision pooling * update docs, init tokenizer and processor files * support batch processing * add OwlViTProcessor * remove merge conflicts * readd owlvit imports * fix bug in OwlViTProcessor imports * fix bugs in processor * update docs * fix bugs in processor * update owlvit docs * add OwlViTFeatureExtractor * style changes, add postprocess method to feature extractor * add feature extractor and processor tests * add object detection tests * update conversion script * update config paths * update config paths * fix configuration paths and bugs * fix bugs in OwlViT tests * add import checks to processor * fix docs and minor issues * fix docs and minor issues * fix bugs and issues * fix bugs and issues * fix bugs and issues * fix bugs and issues * update docs and examples * fix bugs and issues * update conversion script, fix positional embeddings * process 2D input ids, update tests * fix style and quality issues * update docs * update docs and imports * update OWL-ViT index.md * fix bug in OwlViT feature ext tests * fix code examples, return_dict by default * return_dict by default * minor fixes, add tests to processor * small fixes * add output_attentions arg to main model * fix bugs * remove output_hidden_states arg from main model * update self.config variables * add option to return last_hidden_states * fix bug in config variables * fix copied from statements * fix small issues and bugs * fix bugs * fix bugs, support greyscale images * run fixup * update repo name * merge OwlViTImageTextEmbedder with obj detection head * fix merge conflict * fix merge conflict * make fixup * fix bugs * fix bugs * add additional processor test

innat · 2022-08-05T17:43:12Z

Any plan to extend it for TensorFlow version?
There seems to be conversion script officially.

amyeroberts · 2022-08-05T17:49:23Z

Hi @innat. Yes, @alaradirik is already working on it! The PR is here: #18450

You can find out which models are being implemented by searching the open issues and PRs for example

* add owlvit model skeleton * add class and box predictor heads * convert modified flax clip to pytorch * fix box and class predictors * add OwlViTImageTextEmbedder * convert class and box head checkpoints * convert image text embedder checkpoints * add object detection head * fix bugs * update conversion script * update conversion script * fix q,v,k,out weight conversion conversion * add owlvit object detection output * fix bug in image embedder * fix bugs in text embedder * fix positional embeddings * fix bug in inference mode vision pooling * update docs, init tokenizer and processor files * support batch processing * add OwlViTProcessor * remove merge conflicts * readd owlvit imports * fix bug in OwlViTProcessor imports * fix bugs in processor * update docs * fix bugs in processor * update owlvit docs * add OwlViTFeatureExtractor * style changes, add postprocess method to feature extractor * add feature extractor and processor tests * add object detection tests * update conversion script * update config paths * update config paths * fix configuration paths and bugs * fix bugs in OwlViT tests * add import checks to processor * fix docs and minor issues * fix docs and minor issues * fix bugs and issues * fix bugs and issues * fix bugs and issues * fix bugs and issues * update docs and examples * fix bugs and issues * update conversion script, fix positional embeddings * process 2D input ids, update tests * fix style and quality issues * update docs * update docs and imports * update OWL-ViT index.md * fix bug in OwlViT feature ext tests * fix code examples, return_dict by default * return_dict by default * minor fixes, add tests to processor * small fixes * add output_attentions arg to main model * fix bugs * remove output_hidden_states arg from main model * update self.config variables * add option to return last_hidden_states * fix bug in config variables * fix copied from statements * fix small issues and bugs * fix bugs * fix bugs, support greyscale images * run fixup * update repo name * merge OwlViTImageTextEmbedder with obj detection head * fix merge conflict * fix merge conflict * make fixup * fix bugs * fix bugs * add additional processor test

alaradirik and others added 29 commits June 16, 2022 16:16

add owlvit model skeleton

bd08fd0

add class and box predictor heads

cff1597

convert modified flax clip to pytorch

3fb93b5

fix box and class predictors

6b80535

add OwlViTImageTextEmbedder

a57c8c3

convert class and box head checkpoints

298acc4

convert image text embedder checkpoints

aa62cf3

add object detection head

eed0c47

fix bugs

9dfae2e

update conversion script

12b3554

update conversion script

6e88bdc

fix q,v,k,out weight conversion conversion

d342a81

add owlvit object detection output

5a15207

fix bug in image embedder

6adfabd

fix bugs in text embedder

ef94525

fix positional embeddings

d4315a3

fix bug in inference mode vision pooling

e385e33

update docs, init tokenizer and processor files

985025e

support batch processing

6653465

add OwlViTProcessor

5e6e8b4

remove merge conflicts

2e63dde

Merge branch 'huggingface:main' into owlvit

79083c5

readd owlvit imports

35f9f31

fix bug in OwlViTProcessor imports

78b7837

fix bugs in processor

d919422

update docs

4635688

fix bugs in processor

8a1c825

update owlvit docs

363f4d5

add OwlViTFeatureExtractor

161cb2a

amyeroberts reviewed Jul 1, 2022

View reviewed changes

alaradirik added 5 commits July 20, 2022 17:27

fix bugs

c6cd321

fix bugs, support greyscale images

57c2cb8

run fixup

7ba2c41

update repo name

8c560cb

merge OwlViTImageTextEmbedder with obj detection head

ef2b4f5

NielsRogge reviewed Jul 21, 2022

View reviewed changes

src/transformers/models/owlvit/modeling_owlvit.py Show resolved Hide resolved

alaradirik and others added 4 commits July 21, 2022 16:20

fix merge conflict

dfbc6b5

Merge branch 'huggingface:main' into owlvit

27a5ce5

fix merge conflict

405685a

make fixup

a66a879

amyeroberts approved these changes Jul 21, 2022

View reviewed changes