Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add visual prompt to processor of CLIPSeg model #20816

Merged
merged 7 commits into from
Dec 21, 2022

Conversation

idilsulo
Copy link
Contributor

@idilsulo idilsulo commented Dec 18, 2022

What does this PR do?

Currently, integrated CLIPSeg model only supports textual prompts. However, a main advantage of CLIPSeg is that one can provide visual prompts instead of textual prompts in order to do semantic segmentation. For further details, you can refer to the original Image Segmentation Using Text and Image Prompts (CVPR 2022) paper here.

This change can easily be adapted to current CLIPSegProcessor by just providing an additional parameter which processes the visual prompt via image processor and returns the embedding with an additional key, i.e. conditional_pixel_values.

This PR complements the work done in this previous pull request.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case. -> Not discussed, but only requires a minor change to fully support CLIPSeg model.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests? -> Previous tokenizer and image processor tests apply.

Who can review?

Anyone in the community is free to review the PR.

Feel free to tag members/contributors who may be interested in your PR. @NielsRogge @sgugger @alaradirik

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Dec 18, 2022

The documentation is not available anymore as the PR was closed or merged.

Copy link
Contributor

@alaradirik alaradirik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @idilsulo, thank you for working on this! The changes look good to me.

Could you add a test to test_processor_clipseg.py as well?

cc @NielsRogge @sgugger

@idilsulo
Copy link
Contributor Author

Hi @alaradirik, thanks for the review! Added a test to test_processor_clipseg.py as well.

Copy link
Contributor

@NielsRogge NielsRogge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @idilsulo, thanks a lot for your contribution!

FYI, we're adding a blog post on CLIPSeg, and there @tobiascornille also showcases the usage of visual prompts. At the moment, we're doing:

encoded_image = processor(images=[image], return_tensors="pt")
encoded_prompt = processor(images=[prompt], return_tensors="pt")
# predict
with torch.no_grad():
  outputs = model(**encoded_image, conditional_pixel_values=encoded_prompt.pixel_values)

to make it work, but it would indeed be cleaner to do it this way :)

However I'll wait for @sgugger to approve as adding the visual_prompt argument is a slight breaking change (as people currently using the processor assume images is the second argument of the call method). So it might be one needs to use the aforementioned method to prepare visual prompts for the model.

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, please pass the visual_prompt as the last argument to avoid breaking the use

processor(text, images)

(without the argument names). Thanks!

@idilsulo
Copy link
Contributor Author

idilsulo commented Dec 20, 2022

Hello @sgugger - I am aware that the argument can be passed at the end, but this also opens ways for faulty usage to users who do not know how CLIPSeg model processes their input.

Let's see a working example:

import torch
from transformers import CLIPSegProcessor, CLIPSegForImageSegmentation
processor = CLIPSegProcessor.from_pretrained("CIDAS/clipseg-rd64-refined")
model = CLIPSegForImageSegmentation.from_pretrained("CIDAS/clipseg-rd64-refined")

from PIL import Image
import requests
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
text = ["background", "cat"]
images = [image]*2
inputs = processor(text, images, return_tensors="pt") # the processor also returns the text embedding (which should not be used)
with torch.no_grad():
    outputs = model(**inputs, conditional_pixel_values=inputs.pixel_values)

What did the model process in the above line? Is it visual prompt + image or text prompt + image? It seems like it is still processing the textual prompt + image pair.

Why? Let's try to fail it:

inputs = processor(text, images, return_tensors="pt")
visual_prompt_input = processor(images=[image], return_tensors="pt") # Additional prompt with length 1
with torch.no_grad():
    outputs = model(**inputs, conditional_pixel_values=visual_prompt_input.pixel_values)

Here first processor computes text and images arguments with length 2. Second one, however, only takes a single image. This does not fail the model as it still processes a text prompt + image pair rather than visual prompt (the one passed via conditional_pixel_values).

Side note: The processor of OWL-ViT also has an additional argument (i.e. query_images) in addition to images and text. An idea might be to add visual_prompt as the third argument (as done in OWL-ViT) so that it would not break anything as @NielsRogge suggested.

Thanks for taking your time!

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for iterating!

@alaradirik alaradirik merged commit 0ae5820 into huggingface:main Dec 21, 2022
MKhalusova pushed a commit to MKhalusova/transformers that referenced this pull request Dec 28, 2022
Adds visual_prompt argument to CLIPSegProcessor to enable image-guided segmentation
amyeroberts pushed a commit to amyeroberts/transformers that referenced this pull request Jan 4, 2023
Adds visual_prompt argument to CLIPSegProcessor to enable image-guided segmentation
silverriver pushed a commit to silverriver/transformers that referenced this pull request Jan 6, 2023
Adds visual_prompt argument to CLIPSegProcessor to enable image-guided segmentation
venkat-natchi pushed a commit to venkat-natchi/transformers that referenced this pull request Jan 22, 2023
Adds visual_prompt argument to CLIPSegProcessor to enable image-guided segmentation
miyu386 pushed a commit to miyu386/transformers that referenced this pull request Feb 9, 2023
Adds visual_prompt argument to CLIPSegProcessor to enable image-guided segmentation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants