-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add visual prompt to processor of CLIPSeg model #20816
Add visual prompt to processor of CLIPSeg model #20816
Conversation
The documentation is not available anymore as the PR was closed or merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @idilsulo, thank you for working on this! The changes look good to me.
Could you add a test to test_processor_clipseg.py as well?
Hi @alaradirik, thanks for the review! Added a test to test_processor_clipseg.py as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @idilsulo, thanks a lot for your contribution!
FYI, we're adding a blog post on CLIPSeg, and there @tobiascornille also showcases the usage of visual prompts. At the moment, we're doing:
encoded_image = processor(images=[image], return_tensors="pt")
encoded_prompt = processor(images=[prompt], return_tensors="pt")
# predict
with torch.no_grad():
outputs = model(**encoded_image, conditional_pixel_values=encoded_prompt.pixel_values)
to make it work, but it would indeed be cleaner to do it this way :)
However I'll wait for @sgugger to approve as adding the visual_prompt
argument is a slight breaking change (as people currently using the processor assume images
is the second argument of the call method). So it might be one needs to use the aforementioned method to prepare visual prompts for the model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, please pass the visual_prompt
as the last argument to avoid breaking the use
processor(text, images)
(without the argument names). Thanks!
Hello @sgugger - I am aware that the argument can be passed at the end, but this also opens ways for faulty usage to users who do not know how CLIPSeg model processes their input. Let's see a working example:
What did the model process in the above line? Is it visual prompt + image or text prompt + image? It seems like it is still processing the textual prompt + image pair. Why? Let's try to fail it:
Here first processor computes Side note: The processor of OWL-ViT also has an additional argument (i.e. Thanks for taking your time! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for iterating!
Adds visual_prompt argument to CLIPSegProcessor to enable image-guided segmentation
Adds visual_prompt argument to CLIPSegProcessor to enable image-guided segmentation
Adds visual_prompt argument to CLIPSegProcessor to enable image-guided segmentation
Adds visual_prompt argument to CLIPSegProcessor to enable image-guided segmentation
Adds visual_prompt argument to CLIPSegProcessor to enable image-guided segmentation
What does this PR do?
Currently, integrated CLIPSeg model only supports textual prompts. However, a main advantage of CLIPSeg is that one can provide visual prompts instead of textual prompts in order to do semantic segmentation. For further details, you can refer to the original Image Segmentation Using Text and Image Prompts (CVPR 2022) paper here.
This change can easily be adapted to current
CLIPSegProcessor
by just providing an additional parameter which processes the visual prompt via image processor and returns the embedding with an additional key, i.e.conditional_pixel_values
.This PR complements the work done in this previous pull request.
Before submitting
Pull Request section?
to it if that's the case. -> Not discussed, but only requires a minor change to fully support CLIPSeg model.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR.
Feel free to tag members/contributors who may be interested in your PR. @NielsRogge @sgugger @alaradirik