Add visual prompt to processor of CLIPSeg model #20816

idilsulo · 2022-12-18T16:08:54Z

What does this PR do?

Currently, integrated CLIPSeg model only supports textual prompts. However, a main advantage of CLIPSeg is that one can provide visual prompts instead of textual prompts in order to do semantic segmentation. For further details, you can refer to the original Image Segmentation Using Text and Image Prompts (CVPR 2022) paper here.

This change can easily be adapted to current CLIPSegProcessor by just providing an additional parameter which processes the visual prompt via image processor and returns the embedding with an additional key, i.e. conditional_pixel_values.

This PR complements the work done in this previous pull request.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case. -> Not discussed, but only requires a minor change to fully support CLIPSeg model.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests? -> Previous tokenizer and image processor tests apply.

Who can review?

Anyone in the community is free to review the PR.

Feel free to tag members/contributors who may be interested in your PR. @NielsRogge @sgugger @alaradirik

HuggingFaceDocBuilderDev · 2022-12-18T16:27:11Z

The documentation is not available anymore as the PR was closed or merged.

alaradirik

Hi @idilsulo, thank you for working on this! The changes look good to me.

Could you add a test to test_processor_clipseg.py as well?

cc @NielsRogge @sgugger

idilsulo · 2022-12-19T14:34:23Z

Hi @alaradirik, thanks for the review! Added a test to test_processor_clipseg.py as well.

NielsRogge

Hi @idilsulo, thanks a lot for your contribution!

FYI, we're adding a blog post on CLIPSeg, and there @tobiascornille also showcases the usage of visual prompts. At the moment, we're doing:

encoded_image = processor(images=[image], return_tensors="pt")
encoded_prompt = processor(images=[prompt], return_tensors="pt")
# predict
with torch.no_grad():
  outputs = model(**encoded_image, conditional_pixel_values=encoded_prompt.pixel_values)

to make it work, but it would indeed be cleaner to do it this way :)

However I'll wait for @sgugger to approve as adding the visual_prompt argument is a slight breaking change (as people currently using the processor assume images is the second argument of the call method). So it might be one needs to use the aforementioned method to prepare visual prompts for the model.

sgugger

Indeed, please pass the visual_prompt as the last argument to avoid breaking the use

processor(text, images)

(without the argument names). Thanks!

idilsulo · 2022-12-20T12:09:07Z

Hello @sgugger - I am aware that the argument can be passed at the end, but this also opens ways for faulty usage to users who do not know how CLIPSeg model processes their input.

Let's see a working example:

import torch
from transformers import CLIPSegProcessor, CLIPSegForImageSegmentation
processor = CLIPSegProcessor.from_pretrained("CIDAS/clipseg-rd64-refined")
model = CLIPSegForImageSegmentation.from_pretrained("CIDAS/clipseg-rd64-refined")

from PIL import Image
import requests
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
text = ["background", "cat"]
images = [image]*2

inputs = processor(text, images, return_tensors="pt") # the processor also returns the text embedding (which should not be used)
with torch.no_grad():
    outputs = model(**inputs, conditional_pixel_values=inputs.pixel_values)

What did the model process in the above line? Is it visual prompt + image or text prompt + image? It seems like it is still processing the textual prompt + image pair.

Why? Let's try to fail it:

inputs = processor(text, images, return_tensors="pt")
visual_prompt_input = processor(images=[image], return_tensors="pt") # Additional prompt with length 1
with torch.no_grad():
    outputs = model(**inputs, conditional_pixel_values=visual_prompt_input.pixel_values)

Here first processor computes text and images arguments with length 2. Second one, however, only takes a single image. This does not fail the model as it still processes a text prompt + image pair rather than visual prompt (the one passed via conditional_pixel_values).

Side note: The processor of OWL-ViT also has an additional argument (i.e. query_images) in addition to images and text. An idea might be to add visual_prompt as the third argument (as done in OWL-ViT) so that it would not break anything as @NielsRogge suggested.

Thanks for taking your time!

sgugger

Thanks for iterating!

Adds visual_prompt argument to CLIPSegProcessor to enable image-guided segmentation

idilsulo added 2 commits December 18, 2022 00:54

Add visual prompt to clipseg processor

502348f

Fix doc string

de5ebf0

idilsulo added 2 commits December 18, 2022 17:30

Fix visual prompt input cases

134127f

Fix case for all inputs

1e6799f

alaradirik approved these changes Dec 19, 2022

View reviewed changes

alaradirik requested a review from sgugger December 19, 2022 11:44

idilsulo added 2 commits December 19, 2022 15:04

Add processor test for visual prompt

59c8ac2

Fix key ordering

ad05376

NielsRogge approved these changes Dec 19, 2022

View reviewed changes

sgugger reviewed Dec 20, 2022

View reviewed changes

Arrange argument order

d62917d

idilsulo requested a review from sgugger December 20, 2022 14:07

sgugger approved these changes Dec 21, 2022

View reviewed changes

alaradirik merged commit 0ae5820 into huggingface:main Dec 21, 2022

NielsRogge mentioned this pull request Dec 21, 2022

Zero-shot image segmentation with CLIPSeg huggingface/blog#740

Merged

segments-tobias mentioned this pull request Dec 21, 2022

Use visual_prompt in CLIPSegProcessor huggingface/blog#748

Open

MKhalusova pushed a commit to MKhalusova/transformers that referenced this pull request Dec 28, 2022

Add visual prompt to processor of CLIPSeg model (huggingface#20816)

83aad6d

Adds visual_prompt argument to CLIPSegProcessor to enable image-guided segmentation

amyeroberts pushed a commit to amyeroberts/transformers that referenced this pull request Jan 4, 2023

Add visual prompt to processor of CLIPSeg model (huggingface#20816)

df2acf3

Adds visual_prompt argument to CLIPSegProcessor to enable image-guided segmentation

silverriver pushed a commit to silverriver/transformers that referenced this pull request Jan 6, 2023

Add visual prompt to processor of CLIPSeg model (huggingface#20816)

b3782bf

Adds visual_prompt argument to CLIPSegProcessor to enable image-guided segmentation

venkat-natchi pushed a commit to venkat-natchi/transformers that referenced this pull request Jan 22, 2023

Add visual prompt to processor of CLIPSeg model (huggingface#20816)

9c3e165

Adds visual_prompt argument to CLIPSegProcessor to enable image-guided segmentation

miyu386 pushed a commit to miyu386/transformers that referenced this pull request Feb 9, 2023

Add visual prompt to processor of CLIPSeg model (huggingface#20816)

05ea121

Adds visual_prompt argument to CLIPSegProcessor to enable image-guided segmentation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add visual prompt to processor of CLIPSeg model #20816

Add visual prompt to processor of CLIPSeg model #20816

idilsulo commented Dec 18, 2022 •

edited

HuggingFaceDocBuilderDev commented Dec 18, 2022 •

edited

alaradirik left a comment

idilsulo commented Dec 19, 2022

NielsRogge left a comment •

edited

sgugger left a comment •

edited

idilsulo commented Dec 20, 2022 •

edited

sgugger left a comment

Add visual prompt to processor of CLIPSeg model #20816

Add visual prompt to processor of CLIPSeg model #20816

Conversation

idilsulo commented Dec 18, 2022 • edited

What does this PR do?

Before submitting

Who can review?

HuggingFaceDocBuilderDev commented Dec 18, 2022 • edited

alaradirik left a comment

Choose a reason for hiding this comment

idilsulo commented Dec 19, 2022

NielsRogge left a comment • edited

Choose a reason for hiding this comment

sgugger left a comment • edited

Choose a reason for hiding this comment

idilsulo commented Dec 20, 2022 • edited

sgugger left a comment

Choose a reason for hiding this comment

idilsulo commented Dec 18, 2022 •

edited

HuggingFaceDocBuilderDev commented Dec 18, 2022 •

edited

NielsRogge left a comment •

edited

sgugger left a comment •

edited

idilsulo commented Dec 20, 2022 •

edited