Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add OWL-ViT model for zero-shot object detection #17938

Merged
merged 87 commits into from Jul 22, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
87 commits
Select commit Hold shift + click to select a range
bd08fd0
add owlvit model skeleton
alaradirik Jun 16, 2022
cff1597
add class and box predictor heads
alaradirik Jun 17, 2022
3fb93b5
convert modified flax clip to pytorch
alaradirik Jun 21, 2022
6b80535
fix box and class predictors
alaradirik Jun 22, 2022
a57c8c3
add OwlViTImageTextEmbedder
alaradirik Jun 22, 2022
298acc4
convert class and box head checkpoints
alaradirik Jun 23, 2022
aa62cf3
convert image text embedder checkpoints
alaradirik Jun 23, 2022
eed0c47
add object detection head
alaradirik Jun 23, 2022
9dfae2e
fix bugs
alaradirik Jun 27, 2022
12b3554
update conversion script
alaradirik Jun 27, 2022
6e88bdc
update conversion script
alaradirik Jun 27, 2022
d342a81
fix q,v,k,out weight conversion conversion
alaradirik Jun 27, 2022
5a15207
add owlvit object detection output
alaradirik Jun 28, 2022
6adfabd
fix bug in image embedder
alaradirik Jun 28, 2022
ef94525
fix bugs in text embedder
alaradirik Jun 28, 2022
d4315a3
fix positional embeddings
alaradirik Jun 28, 2022
e385e33
fix bug in inference mode vision pooling
alaradirik Jun 29, 2022
985025e
update docs, init tokenizer and processor files
alaradirik Jun 29, 2022
6653465
support batch processing
alaradirik Jun 30, 2022
5e6e8b4
add OwlViTProcessor
alaradirik Jun 30, 2022
2e63dde
remove merge conflicts
alaradirik Jul 1, 2022
79083c5
Merge branch 'huggingface:main' into owlvit
alaradirik Jul 1, 2022
35f9f31
readd owlvit imports
alaradirik Jul 1, 2022
78b7837
fix bug in OwlViTProcessor imports
alaradirik Jul 1, 2022
d919422
fix bugs in processor
alaradirik Jul 1, 2022
4635688
update docs
alaradirik Jul 1, 2022
8a1c825
fix bugs in processor
alaradirik Jul 1, 2022
363f4d5
update owlvit docs
alaradirik Jul 1, 2022
161cb2a
add OwlViTFeatureExtractor
alaradirik Jul 1, 2022
58aa6ce
style changes, add postprocess method to feature extractor
alaradirik Jul 4, 2022
37e3281
add feature extractor and processor tests
alaradirik Jul 4, 2022
261ed39
add object detection tests
alaradirik Jul 4, 2022
cf0591c
update conversion script
alaradirik Jul 5, 2022
02f3a00
update config paths
alaradirik Jul 5, 2022
ab0be98
update config paths
alaradirik Jul 5, 2022
2b215f5
fix configuration paths and bugs
alaradirik Jul 5, 2022
f97d3de
fix bugs in OwlViT tests
alaradirik Jul 5, 2022
1949b63
Merge branch 'huggingface:main' into owlvit
alaradirik Jul 5, 2022
8680f13
add import checks to processor
alaradirik Jul 5, 2022
e6f51de
fix docs and minor issues
alaradirik Jul 6, 2022
e15988d
fix docs and minor issues
alaradirik Jul 6, 2022
b73a66d
fix bugs and issues
alaradirik Jul 7, 2022
68dd41d
fix bugs and issues
alaradirik Jul 7, 2022
11d5928
fix bugs and issues
alaradirik Jul 7, 2022
cef935d
fix bugs and issues
alaradirik Jul 8, 2022
34069b0
update docs and examples
alaradirik Jul 8, 2022
c4aa766
fix bugs and issues
alaradirik Jul 8, 2022
40a6504
update conversion script, fix positional embeddings
alaradirik Jul 8, 2022
9ce1942
process 2D input ids, update tests
alaradirik Jul 11, 2022
b330dfa
fix style and quality issues
alaradirik Jul 11, 2022
051aea6
update docs
alaradirik Jul 11, 2022
bf903f9
update docs and imports
alaradirik Jul 11, 2022
3592af5
Merge branch 'huggingface:main' into owlvit
alaradirik Jul 11, 2022
60749fe
update OWL-ViT index.md
alaradirik Jul 11, 2022
ee007d6
Merge branch 'huggingface:main' into owlvit
alaradirik Jul 12, 2022
6f1aa2d
fix bug in OwlViT feature ext tests
alaradirik Jul 12, 2022
6af7248
Merge branch 'huggingface:main' into owlvit
alaradirik Jul 12, 2022
ba03dbf
Merge branch 'huggingface:main' into owlvit
alaradirik Jul 12, 2022
865510c
Merge branch 'huggingface:main' into owlvit
alaradirik Jul 13, 2022
df9313d
fix code examples, return_dict by default
alaradirik Jul 13, 2022
57d1b68
return_dict by default
alaradirik Jul 13, 2022
253af8b
minor fixes, add tests to processor
alaradirik Jul 13, 2022
3e180da
small fixes
alaradirik Jul 13, 2022
43c04af
add output_attentions arg to main model
alaradirik Jul 13, 2022
efc1ad3
Merge branch 'huggingface:main' into owlvit
alaradirik Jul 13, 2022
8ceea4e
fix bugs
alaradirik Jul 13, 2022
4d416fe
Merge branch 'huggingface:main' into owlvit
alaradirik Jul 14, 2022
4099199
remove output_hidden_states arg from main model
alaradirik Jul 14, 2022
e73b129
update self.config variables
alaradirik Jul 14, 2022
0f3d56f
add option to return last_hidden_states
alaradirik Jul 14, 2022
47c55ea
fix bug in config variables
alaradirik Jul 14, 2022
db70aee
fix copied from statements
alaradirik Jul 14, 2022
ea1452b
Merge branch 'huggingface:main' into owlvit
alaradirik Jul 20, 2022
456bbb3
fix small issues and bugs
alaradirik Jul 20, 2022
c6cd321
fix bugs
alaradirik Jul 20, 2022
57c2cb8
fix bugs, support greyscale images
alaradirik Jul 21, 2022
7ba2c41
run fixup
alaradirik Jul 21, 2022
8c560cb
update repo name
alaradirik Jul 21, 2022
ef2b4f5
merge OwlViTImageTextEmbedder with obj detection head
alaradirik Jul 21, 2022
dfbc6b5
fix merge conflict
alaradirik Jul 21, 2022
27a5ce5
Merge branch 'huggingface:main' into owlvit
alaradirik Jul 21, 2022
405685a
fix merge conflict
alaradirik Jul 21, 2022
a66a879
make fixup
alaradirik Jul 21, 2022
32525bd
fix bugs
alaradirik Jul 22, 2022
1f931eb
fix bugs
alaradirik Jul 22, 2022
1867147
Merge branch 'huggingface:main' into owlvit
alaradirik Jul 22, 2022
75e5ccf
add additional processor test
alaradirik Jul 22, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Expand Up @@ -332,6 +332,7 @@ Current number of checkpoints: ![](https://img.shields.io/endpoint?url=https://h
1. **[NLLB](https://huggingface.co/docs/transformers/main/model_doc/nllb)** (from Meta) released with the paper [No Language Left Behind: Scaling Human-Centered Machine Translation](https://arxiv.org/abs/2207.04672) by the NLLB team.
1. **[Nyströmformer](https://huggingface.co/docs/transformers/model_doc/nystromformer)** (from the University of Wisconsin - Madison) released with the paper [Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention](https://arxiv.org/abs/2102.03902) by Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, Vikas Singh.
1. **[OPT](https://huggingface.co/docs/transformers/master/model_doc/opt)** (from Meta AI) released with the paper [OPT: Open Pre-trained Transformer Language Models](https://arxiv.org/abs/2205.01068) by Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen et al.
1. **[OWL-ViT](https://huggingface.co/docs/transformers/main/model_doc/owlvit)** (from Google AI) released with the paper [Simple Open-Vocabulary Object Detection with Vision Transformers](https://arxiv.org/abs/2205.06230) by Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, Xiao Wang, Xiaohua Zhai, Thomas Kipf, and Neil Houlsby.
1. **[Pegasus](https://huggingface.co/docs/transformers/model_doc/pegasus)** (from Google) released with the paper [PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization](https://arxiv.org/abs/1912.08777) by Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu.
1. **[Perceiver IO](https://huggingface.co/docs/transformers/model_doc/perceiver)** (from Deepmind) released with the paper [Perceiver IO: A General Architecture for Structured Inputs & Outputs](https://arxiv.org/abs/2107.14795) by Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, Olivier Hénaff, Matthew M. Botvinick, Andrew Zisserman, Oriol Vinyals, João Carreira.
1. **[PhoBERT](https://huggingface.co/docs/transformers/model_doc/phobert)** (from VinAI Research) released with the paper [PhoBERT: Pre-trained language models for Vietnamese](https://www.aclweb.org/anthology/2020.findings-emnlp.92/) by Dat Quoc Nguyen and Anh Tuan Nguyen.
Expand Down
1 change: 1 addition & 0 deletions README_ko.md
Expand Up @@ -288,6 +288,7 @@ Flax, PyTorch, TensorFlow 설치 페이지에서 이들을 conda로 설치하는
1. **[NLLB](https://huggingface.co/docs/transformers/main/model_doc/nllb)** (from Meta) released with the paper [No Language Left Behind: Scaling Human-Centered Machine Translation](https://arxiv.org/abs/2207.04672) by the NLLB team.
1. **[Nyströmformer](https://huggingface.co/docs/transformers/model_doc/nystromformer)** (from the University of Wisconsin - Madison) released with the paper [Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention](https://arxiv.org/abs/2102.03902) by Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, Vikas Singh.
1. **[OPT](https://huggingface.co/docs/transformers/master/model_doc/opt)** (from Meta AI) released with the paper [OPT: Open Pre-trained Transformer Language Models](https://arxiv.org/abs/2205.01068) by Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen et al.
1. **[OWL-ViT](https://huggingface.co/docs/transformers/main/model_doc/owlvit)** (from Google AI) released with the paper [Simple Open-Vocabulary Object Detection with Vision Transformers](https://arxiv.org/abs/2205.06230) by Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, Xiao Wang, Xiaohua Zhai, Thomas Kipf, and Neil Houlsby.
1. **[Pegasus](https://huggingface.co/docs/transformers/model_doc/pegasus)** (from Google) released with the paper [PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization](https://arxiv.org/abs/1912.08777) by Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu.
1. **[Perceiver IO](https://huggingface.co/docs/transformers/model_doc/perceiver)** (from Deepmind) released with the paper [Perceiver IO: A General Architecture for Structured Inputs & Outputs](https://arxiv.org/abs/2107.14795) by Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, Olivier Hénaff, Matthew M. Botvinick, Andrew Zisserman, Oriol Vinyals, João Carreira.
1. **[PhoBERT](https://huggingface.co/docs/transformers/model_doc/phobert)** (from VinAI Research) released with the paper [PhoBERT: Pre-trained language models for Vietnamese](https://www.aclweb.org/anthology/2020.findings-emnlp.92/) by Dat Quoc Nguyen and Anh Tuan Nguyen.
Expand Down
1 change: 1 addition & 0 deletions README_zh-hans.md
Expand Up @@ -312,6 +312,7 @@ conda install -c huggingface transformers
1. **[NLLB](https://huggingface.co/docs/transformers/main/model_doc/nllb)** (来自 Meta) 伴随论文 [No Language Left Behind: Scaling Human-Centered Machine Translation](https://arxiv.org/abs/2207.04672) 由 the NLLB team 发布。
1. **[Nyströmformer](https://huggingface.co/docs/transformers/model_doc/nystromformer)** (来自 the University of Wisconsin - Madison) 伴随论文 [Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention](https://arxiv.org/abs/2102.03902) 由 Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, Vikas Singh 发布。
1. **[OPT](https://huggingface.co/docs/transformers/master/model_doc/opt)** (来自 Meta AI) 伴随论文 [OPT: Open Pre-trained Transformer Language Models](https://arxiv.org/abs/2205.01068) 由 Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen et al 发布。
1. **[OWL-ViT](https://huggingface.co/docs/transformers/main/model_doc/owlvit)** (来自 Google AI) 伴随论文 [Simple Open-Vocabulary Object Detection with Vision Transformers](https://arxiv.org/abs/2205.06230) 由 Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, Xiao Wang, Xiaohua Zhai, Thomas Kipf, and Neil Houlsby 发布。
1. **[Pegasus](https://huggingface.co/docs/transformers/model_doc/pegasus)** (来自 Google) 伴随论文 [PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization](https://arxiv.org/abs/1912.08777) 由 Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu 发布。
1. **[Perceiver IO](https://huggingface.co/docs/transformers/model_doc/perceiver)** (来自 Deepmind) 伴随论文 [Perceiver IO: A General Architecture for Structured Inputs & Outputs](https://arxiv.org/abs/2107.14795) 由 Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, Olivier Hénaff, Matthew M. Botvinick, Andrew Zisserman, Oriol Vinyals, João Carreira 发布。
1. **[PhoBERT](https://huggingface.co/docs/transformers/model_doc/phobert)** (来自 VinAI Research) 伴随论文 [PhoBERT: Pre-trained language models for Vietnamese](https://www.aclweb.org/anthology/2020.findings-emnlp.92/) 由 Dat Quoc Nguyen and Anh Tuan Nguyen 发布。
Expand Down
1 change: 1 addition & 0 deletions README_zh-hant.md
Expand Up @@ -324,6 +324,7 @@ conda install -c huggingface transformers
1. **[NLLB](https://huggingface.co/docs/transformers/main/model_doc/nllb)** (from Meta) released with the paper [No Language Left Behind: Scaling Human-Centered Machine Translation](https://arxiv.org/abs/2207.04672) by the NLLB team.
1. **[Nyströmformer](https://huggingface.co/docs/transformers/model_doc/nystromformer)** (from the University of Wisconsin - Madison) released with the paper [Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention](https://arxiv.org/abs/2102.03902) by Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, Vikas Singh.
1. **[OPT](https://huggingface.co/docs/transformers/master/model_doc/opt)** (from Meta AI) released with the paper [OPT: Open Pre-trained Transformer Language Models](https://arxiv.org/abs/2205.01068) by Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen et al.
1. **[OWL-ViT](https://huggingface.co/docs/transformers/main/model_doc/owlvit)** (from Google AI) released with the paper [Simple Open-Vocabulary Object Detection with Vision Transformers](https://arxiv.org/abs/2205.06230) by Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, Xiao Wang, Xiaohua Zhai, Thomas Kipf, and Neil Houlsby.
1. **[Pegasus](https://huggingface.co/docs/transformers/model_doc/pegasus)** (from Google) released with the paper [PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization](https://arxiv.org/abs/1912.08777) by Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu.
1. **[Perceiver IO](https://huggingface.co/docs/transformers/model_doc/perceiver)** (from Deepmind) released with the paper [Perceiver IO: A General Architecture for Structured Inputs & Outputs](https://arxiv.org/abs/2107.14795) by Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, Olivier Hénaff, Matthew M. Botvinick, Andrew Zisserman, Oriol Vinyals, João Carreira.
1. **[PhoBERT](https://huggingface.co/docs/transformers/model_doc/phobert)** (from VinAI Research) released with the paper [PhoBERT: Pre-trained language models for Vietnamese](https://www.aclweb.org/anthology/2020.findings-emnlp.92/) by Dat Quoc Nguyen and Anh Tuan Nguyen.
Expand Down
2 changes: 2 additions & 0 deletions docs/source/en/_toctree.yml
Expand Up @@ -326,6 +326,8 @@
title: Nyströmformer
- local: model_doc/opt
title: OPT
- local: model_doc/owlvit
title: OWL-ViT
- local: model_doc/pegasus
title: Pegasus
- local: model_doc/perceiver
Expand Down
2 changes: 2 additions & 0 deletions docs/source/en/index.mdx
Expand Up @@ -130,6 +130,7 @@ The library currently contains JAX, PyTorch and TensorFlow implementations, pret
1. **[NLLB](model_doc/nllb)** (from Meta) released with the paper [No Language Left Behind: Scaling Human-Centered Machine Translation](https://arxiv.org/abs/2207.04672) by the NLLB team.
1. **[Nyströmformer](model_doc/nystromformer)** (from the University of Wisconsin - Madison) released with the paper [Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention](https://arxiv.org/abs/2102.03902) by Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, Vikas Singh.
1. **[OPT](master/model_doc/opt)** (from Meta AI) released with the paper [OPT: Open Pre-trained Transformer Language Models](https://arxiv.org/abs/2205.01068) by Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen et al.
1. **[OWL-ViT](model_doc/owlvit)** (from Google AI) released with the paper [Simple Open-Vocabulary Object Detection with Vision Transformers](https://arxiv.org/abs/2205.06230) by Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, Xiao Wang, Xiaohua Zhai, Thomas Kipf, and Neil Houlsby.
1. **[Pegasus](model_doc/pegasus)** (from Google) released with the paper [PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization](https://arxiv.org/abs/1912.08777) by Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu.
1. **[Perceiver IO](model_doc/perceiver)** (from Deepmind) released with the paper [Perceiver IO: A General Architecture for Structured Inputs & Outputs](https://arxiv.org/abs/2107.14795) by Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, Olivier Hénaff, Matthew M. Botvinick, Andrew Zisserman, Oriol Vinyals, João Carreira.
1. **[PhoBERT](model_doc/phobert)** (from VinAI Research) released with the paper [PhoBERT: Pre-trained language models for Vietnamese](https://www.aclweb.org/anthology/2020.findings-emnlp.92/) by Dat Quoc Nguyen and Anh Tuan Nguyen.
Expand Down Expand Up @@ -263,6 +264,7 @@ Flax), PyTorch, and/or TensorFlow.
| OpenAI GPT | ✅ | ✅ | ✅ | ✅ | ❌ |
| OpenAI GPT-2 | ✅ | ✅ | ✅ | ✅ | ✅ |
| OPT | ❌ | ❌ | ✅ | ✅ | ✅ |
| OWL-ViT | ❌ | ❌ | ✅ | ❌ | ❌ |
| Pegasus | ✅ | ✅ | ✅ | ✅ | ✅ |
| Perceiver | ✅ | ❌ | ✅ | ❌ | ❌ |
| PLBart | ✅ | ❌ | ✅ | ❌ | ❌ |
Expand Down
101 changes: 101 additions & 0 deletions docs/source/en/model_doc/owlvit.mdx
@@ -0,0 +1,101 @@
<!--Copyright 2022 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
sgugger marked this conversation as resolved.
Show resolved Hide resolved

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->

# OWL-ViT

## Overview

The OWL-ViT (short for Vision Transformer for Open-World Localization) was proposed in [Simple Open-Vocabulary Object Detection with Vision Transformers](https://arxiv.org/abs/2205.06230) by Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, Xiao Wang, Xiaohua Zhai, Thomas Kipf, and Neil Houlsby. OWL-ViT is an open-vocabulary object detection network trained on a variety of (image, text) pairs. It can be used to query an image with one or multiple text queries to search for and detect target objects described in text.
alaradirik marked this conversation as resolved.
Show resolved Hide resolved

The abstract from the paper is the following:

*Combining simple architectures with large-scale pre-training has led to massive improvements in image classification. For object detection, pre-training and scaling approaches are less well established, especially in the long-tailed and open-vocabulary setting, where training data is relatively scarce. In this paper, we propose a strong recipe for transferring image-text models to open-vocabulary object detection. We use a standard Vision Transformer architecture with minimal modifications, contrastive image-text pre-training, and end-to-end detection fine-tuning. Our analysis of the scaling properties of this setup shows that increasing image-level pre-training and model size yield consistent improvements on the downstream detection task. We provide the adaptation strategies and regularizations needed to attain very strong performance on zero-shot text-conditioned and one-shot image-conditioned object detection. Code and models are available on GitHub.*

## Usage

OWL-ViT is a zero-shot text-conditioned object detection model. OWL-ViT uses [CLIP](clip) as its multi-modal backbone, with a ViT-like Transformer to get visual features and a causal language model to get the text features. To use CLIP for detection, OWL-ViT removes the final token pooling layer of the vision model and attaches a lightweight classification and box head to each transformer output token. Open-vocabulary classification is enabled by replacing the fixed classification layer weights with the class-name embeddings obtained from the text model. The authors first train CLIP from scratch and fine-tune it end-to-end with the classification and box heads on standard detection datasets using a bipartite matching loss. One or multiple text queries per image can be used to perform zero-shot text-conditioned object detection.

[`OwlViTFeatureExtractor`] can be used to resize (or rescale) and normalize images for the model and [`CLIPTokenizer`] is used to encode the text. [`OwlViTProcessor`] wraps [`OwlViTFeatureExtractor`] and [`CLIPTokenizer`] into a single instance to both encode the text and prepare the images. The following example shows how to perform object detection using [`OwlViTProcessor`] and [`OwlViTForObjectDetection`].


```python
>>> import requests
>>> from PIL import Image
>>> import torch

>>> from transformers import OwlViTProcessor, OwlViTForObjectDetection

>>> processor = OwlViTProcessor.from_pretrained("google/owlvit-base-patch32")
>>> model = OwlViTForObjectDetection.from_pretrained("google/owlvit-base-patch32")

>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)

>>> inputs = processor(text=[["a photo of a cat", "a photo of a dog"]], images=image, return_tensors="pt")

>>> outputs = model(**inputs)
>>> logits = outputs["logits"] # Prediction logits of shape [batch_size, num_patches, num_max_text_queries]
>>> boxes = outputs["pred_boxes"] # Object box boundaries of shape [batch_size, num_patches, 4]

>>> batch_size = boxes.shape[0]
>>> for i in range(batch_size): # Loop over sets of images and text queries
... boxes = outputs["pred_boxes"][i]
... logits = torch.max(outputs["logits"][i], dim=-1)
... scores = torch.sigmoid(logits.values)
... labels = logits.indices
```

This model was contributed by [adirik](https://huggingface.co/adirik). The original code can be found [here](https://github.com/google-research/scenic/tree/main/scenic/projects/owl_vit).

## OwlViTConfig

[[autodoc]] OwlViTConfig
- from_text_vision_configs

## OwlViTTextConfig

[[autodoc]] OwlViTTextConfig

## OwlViTVisionConfig

[[autodoc]] OwlViTVisionConfig

## OwlViTFeatureExtractor

[[autodoc]] OwlViTFeatureExtractor
- __call__

## OwlViTProcessor

[[autodoc]] OwlViTProcessor
alaradirik marked this conversation as resolved.
Show resolved Hide resolved

## OwlViTModel

[[autodoc]] OwlViTModel
- forward
- get_text_features
- get_image_features

## OwlViTTextModel

[[autodoc]] OwlViTTextModel
- forward

## OwlViTVisionModel

[[autodoc]] OwlViTVisionModel
- forward

## OwlViTForObjectDetection

[[autodoc]] OwlViTForObjectDetection
- forward