Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Chinese-CLIP implementation #20368

Merged
merged 123 commits into from Nov 30, 2022
Merged

Add Chinese-CLIP implementation #20368

merged 123 commits into from Nov 30, 2022

Conversation

yangapku
Copy link
Contributor

@yangapku yangapku commented Nov 22, 2022

What does this PR do?

This PR adds Chinese-CLIP model into Transformers repo. The Chinese-CLIP model was introduced in Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese. Chinese CLIP is an implementation and adaptation of CLIP (Radford et al., 2021) on a large-scale dataset of Chinese image-text pairs. It is capable of performing Chinese-based cross-modal retrieval and also playing as a vision backbone for vision tasks like zero-shot image classification, open-domain object detection, etc. This model was contributed by OFA-Sys. The original Github repo of Chinese-CLIP can be found at this link. Currently we have released our model weights on the Huggingface ModelHub. Compared with original OpenAI CLIP, we changed the text encoder to Chinese Roberta encoder, thus we reimplemented the config, modeling and preprocessor modules of Chinese-CLIP. Necessary unit tests and documents have been added.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@yangapku
Copy link
Contributor Author

@ydshieh All the comments mentioned above have been addressed.

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of final comments on my side and it should be good to merge!

yangapku and others added 3 commits November 29, 2022 22:07
…p.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
…p.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
@yangapku
Copy link
Contributor Author

@ydshieh Hi, is the PR able to be merged now? Thank you very much! ❤️

@ydshieh
Copy link
Collaborator

ydshieh commented Nov 29, 2022

@sgugger I feel there must be very good reason we want to have CLIPTextTransformer and CLIPVisionTransformer, and use these components in CLIPTextModel, CLIPVisionModel and CLIPModel.

(Potentially to avoid CLIPPreTrainedModel in another CLIPPreTrainedModel which might cause some issues - at least if we ever want to have a TF port).

Do you think here we need to avoid this line

self.text_model = ChineseCLIPTextModel(text_config, add_pooling_layer=False)

and to create ChineseCLIPTextTransformer and use it?

@ydshieh
Copy link
Collaborator

ydshieh commented Nov 29, 2022

Hi @yangapku , other than the above comment, LGTM! But let's wait @sgugger 's response.

There are a few places where I believe we can still use # copied from (probably need some tweaks) - I can help on this before merge.

@sgugger
Copy link
Collaborator

sgugger commented Nov 29, 2022

@ydshieh I think this was done this way just to be more aligned with the original checkpoints in the CLIP case. Here it works fine with the checkpoint, so I wouldn't over-complexify things

@ydshieh
Copy link
Collaborator

ydshieh commented Nov 29, 2022

@yangapku Before we merge, could you run

RUN_SLOW=1 python -m pytest -v tests/models/chinese_clip/

I got 5 failures. You can focus on the first/last ones at this moment. Let me know if you need help on fixing them 🙏

FAILED tests/models/chinese_clip/test_modeling_chinese_clip.py::ChineseCLIPTextModelTest::test_model_from_pretrained - AttributeError: 'ChineseCLIPConfig' object has no attribute 'vocab_size'
FAILED tests/models/chinese_clip/test_modeling_chinese_clip.py::ChineseCLIPModelTest::test_torchscript_output_attentions - AssertionError: Items in the second set but not the first:
FAILED tests/models/chinese_clip/test_modeling_chinese_clip.py::ChineseCLIPModelTest::test_torchscript_output_hidden_state - AssertionError: Items in the second set but not the first:
FAILED tests/models/chinese_clip/test_modeling_chinese_clip.py::ChineseCLIPModelTest::test_torchscript_simple - AssertionError: Items in the second set but not the first:
FAILED tests/models/chinese_clip/test_modeling_chinese_clip.py::ChineseCLIPModelIntegrationTest::test_inference - OSError: Can't load tokenizer for 'OFA-Sys/chinese-clip-vit-base-patch16'. If you were trying to load it

@yangapku
Copy link
Contributor Author

yangapku commented Nov 30, 2022

@yangapku Before we merge, could you run

RUN_SLOW=1 python -m pytest -v tests/models/chinese_clip/

I got 5 failures. You can focus on the first/last ones at this moment. Let me know if you need help on fixing them 🙏

FAILED tests/models/chinese_clip/test_modeling_chinese_clip.py::ChineseCLIPTextModelTest::test_model_from_pretrained - AttributeError: 'ChineseCLIPConfig' object has no attribute 'vocab_size'
FAILED tests/models/chinese_clip/test_modeling_chinese_clip.py::ChineseCLIPModelTest::test_torchscript_output_attentions - AssertionError: Items in the second set but not the first:
FAILED tests/models/chinese_clip/test_modeling_chinese_clip.py::ChineseCLIPModelTest::test_torchscript_output_hidden_state - AssertionError: Items in the second set but not the first:
FAILED tests/models/chinese_clip/test_modeling_chinese_clip.py::ChineseCLIPModelTest::test_torchscript_simple - AssertionError: Items in the second set but not the first:
FAILED tests/models/chinese_clip/test_modeling_chinese_clip.py::ChineseCLIPModelIntegrationTest::test_inference - OSError: Can't load tokenizer for 'OFA-Sys/chinese-clip-vit-base-patch16'. If you were trying to load it

Okay I will try to fix them today.

@yangapku
Copy link
Contributor Author

yangapku commented Nov 30, 2022

@yangapku Before we merge, could you run

RUN_SLOW=1 python -m pytest -v tests/models/chinese_clip/

I got 5 failures. You can focus on the first/last ones at this moment. Let me know if you need help on fixing them 🙏

FAILED tests/models/chinese_clip/test_modeling_chinese_clip.py::ChineseCLIPTextModelTest::test_model_from_pretrained - AttributeError: 'ChineseCLIPConfig' object has no attribute 'vocab_size'
FAILED tests/models/chinese_clip/test_modeling_chinese_clip.py::ChineseCLIPModelTest::test_torchscript_output_attentions - AssertionError: Items in the second set but not the first:
FAILED tests/models/chinese_clip/test_modeling_chinese_clip.py::ChineseCLIPModelTest::test_torchscript_output_hidden_state - AssertionError: Items in the second set but not the first:
FAILED tests/models/chinese_clip/test_modeling_chinese_clip.py::ChineseCLIPModelTest::test_torchscript_simple - AssertionError: Items in the second set but not the first:
FAILED tests/models/chinese_clip/test_modeling_chinese_clip.py::ChineseCLIPModelIntegrationTest::test_inference - OSError: Can't load tokenizer for 'OFA-Sys/chinese-clip-vit-base-patch16'. If you were trying to load it

Okay I will try to fix them today.

@ydshieh The first and last failed cases have been fixed. Now only the failed test cases with Torchscript still remain. Meanwhile, to fix the first failed case, I have to remove the copied from comment for ChineseCLIPTextModel, since it has diverged from BertModel with our customed config_class ChineseCLIPTextConfig.

@yangapku
Copy link
Contributor Author

@ydshieh Hi, is the PR able to be merged now? Do I have to fix the test cases related with Torchscript? If so, more help is needed since I am not so familiar with it 😢 .

@ydshieh
Copy link
Collaborator

ydshieh commented Nov 30, 2022

I will take a look on those 3 tests @yangapku .

@ydshieh
Copy link
Collaborator

ydshieh commented Nov 30, 2022

@yangapku I pushed the remaining fix. Will merge once the final CI is green 🚀 🚀 🚀

非常感謝您的工作! 💯

Copy link
Collaborator

@ydshieh ydshieh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again!

@ydshieh ydshieh merged commit 7217640 into huggingface:main Nov 30, 2022
@yangapku
Copy link
Contributor Author

yangapku commented Dec 1, 2022

Thank you very much for your brilliant support! @ydshieh @sgugger

@ydshieh
Copy link
Collaborator

ydshieh commented Dec 1, 2022

Hi @yangapku Just a follow up. From your branch , I see the file

convert_chinese_clip_original_pytorch_to_hf.py

is last modified on Nov 22. (The change on Nov 29 doesn't count). However, the modeling file changed quite a lot since then due to our review comments. I just want to make sure the conversion script still works correctly, and the original checkpoints and the the converted HF checkpoints still have the same outputs on some test examples.

It would be super nice if you can double check, but it's your call, it's just a suggestion.
(It's always good to make sure the users get the right checkpoints to use :-))

@yangapku
Copy link
Contributor Author

yangapku commented Dec 1, 2022

@ydshieh Hi, I have ensured that this conversion script works correctly 😄 . In fact, today we have also updated the other 3 model scales (ViT-L/14, ViT-L/14@336px, ViT-H/14) on our HF model hub, during which I have used this script to convert our original model to HF format. After the conversion, I have tested all the converted HF checkpoints (the pytorch_model.bin), all of them works in expectation.

@ydshieh
Copy link
Collaborator

ydshieh commented Dec 1, 2022

Thank you @yangapku !

amyeroberts pushed a commit to amyeroberts/transformers that referenced this pull request Dec 7, 2022
* init chinese-clip model from clip

* init model tests and docs

* implement chinese-clip into hf

* implement chinese-clip into hf

* implement chinese-clip into hf

* implement chinese-clip into hf

* implement chinese-clip into hf

* update usecase example in model implementation

* fix codestyle

* fix model_type typo in readme

* add placeholder in doc

* add placeholder in doc

* update the init script

* update usecase

* fix codestyle

* update testcase

* update testcase

* update testcase

* update testcase

* update testcase

* update testcase

* update testcase

* update testcase

* update testcase

* update testcase

* update testcase

* update testcase

* forward the convert_rgb

* update testcase

* update testcase

* update testcase

* merge the recent update from clip about model_input_name property

* update the doc

* update the doc

* update the doc

* update the doc

* remove unused imports

* reformat code style

* update the doc

* fix isort style

* bypass a weird failed unit test which is unrelated with my PR

* update the doc

* implement independent vision config class

* implement independent vision model class

* fix refactor bug

* fix refactor bug

* fix refactor bug

* make style

* fix refactor bug

* make style

* fix refactor bug

* fix refactor bug

* make style

* fix refactor bug

* fix refactor bug

* doc-build restyle

* implement independent text config class

* implement independent text model class

* implement independent text model class

* make style

* make fix-copies

* fix refactor bug

* fix refactor bug

* fix refactor bug

* fix refactor bug

* fix refactor bug

* fix refactor bug

* fix refactor bug

* fix refactor bug

* fix refactor bug

* fix refactor bug

* make style

* update doc

* black and isort

* update doc

* Update src/transformers/models/chinese_clip/configuration_chinese_clip.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/models/auto/tokenization_auto.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* modify the model type from chinese-clip to chinese_clip

* format the example comment of ChineseCLIPVisionConfig

* correct the copyright comment

* fix the tokenizer specification

* add copied from for loss function

* remove unused class

* update CHINESE_CLIP_TEXT_INPUTS_DOCSTRING

* update CHINESE_CLIP_INPUTS_DOCSTRING

* update doc

* update doc

* update code comment in config

* update copied from statement

* make style

* rename the doc file

* add copied statement

* remove unused attention_mask, causal_attention_mask in ChineseCLIPVisionEncoder

* remove ChineseCLIPTextPreTrainedModel

* fix bug

* fix bug

* fix bug

* update doc

* make style

* Update src/transformers/models/chinese_clip/configuration_chinese_clip.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/models/chinese_clip/configuration_chinese_clip.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* update ChineseCLIPImageProcessor in image_processing_auto

* fix config_class of chinesecliptextmodel

* fix the test case

* update the docs

* remove the copied from comment for ChineseCLIPTextModel, since it has diverged from BertModel with customed config_class

* update the testcase

* final fix

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
mpierrau pushed a commit to mpierrau/transformers that referenced this pull request Dec 15, 2022
* init chinese-clip model from clip

* init model tests and docs

* implement chinese-clip into hf

* implement chinese-clip into hf

* implement chinese-clip into hf

* implement chinese-clip into hf

* implement chinese-clip into hf

* update usecase example in model implementation

* fix codestyle

* fix model_type typo in readme

* add placeholder in doc

* add placeholder in doc

* update the init script

* update usecase

* fix codestyle

* update testcase

* update testcase

* update testcase

* update testcase

* update testcase

* update testcase

* update testcase

* update testcase

* update testcase

* update testcase

* update testcase

* update testcase

* forward the convert_rgb

* update testcase

* update testcase

* update testcase

* merge the recent update from clip about model_input_name property

* update the doc

* update the doc

* update the doc

* update the doc

* remove unused imports

* reformat code style

* update the doc

* fix isort style

* bypass a weird failed unit test which is unrelated with my PR

* update the doc

* implement independent vision config class

* implement independent vision model class

* fix refactor bug

* fix refactor bug

* fix refactor bug

* make style

* fix refactor bug

* make style

* fix refactor bug

* fix refactor bug

* make style

* fix refactor bug

* fix refactor bug

* doc-build restyle

* implement independent text config class

* implement independent text model class

* implement independent text model class

* make style

* make fix-copies

* fix refactor bug

* fix refactor bug

* fix refactor bug

* fix refactor bug

* fix refactor bug

* fix refactor bug

* fix refactor bug

* fix refactor bug

* fix refactor bug

* fix refactor bug

* make style

* update doc

* black and isort

* update doc

* Update src/transformers/models/chinese_clip/configuration_chinese_clip.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/models/auto/tokenization_auto.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* modify the model type from chinese-clip to chinese_clip

* format the example comment of ChineseCLIPVisionConfig

* correct the copyright comment

* fix the tokenizer specification

* add copied from for loss function

* remove unused class

* update CHINESE_CLIP_TEXT_INPUTS_DOCSTRING

* update CHINESE_CLIP_INPUTS_DOCSTRING

* update doc

* update doc

* update code comment in config

* update copied from statement

* make style

* rename the doc file

* add copied statement

* remove unused attention_mask, causal_attention_mask in ChineseCLIPVisionEncoder

* remove ChineseCLIPTextPreTrainedModel

* fix bug

* fix bug

* fix bug

* update doc

* make style

* Update src/transformers/models/chinese_clip/configuration_chinese_clip.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/models/chinese_clip/configuration_chinese_clip.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* update ChineseCLIPImageProcessor in image_processing_auto

* fix config_class of chinesecliptextmodel

* fix the test case

* update the docs

* remove the copied from comment for ChineseCLIPTextModel, since it has diverged from BertModel with customed config_class

* update the testcase

* final fix

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
miyu386 pushed a commit to miyu386/transformers that referenced this pull request Feb 9, 2023
* init chinese-clip model from clip

* init model tests and docs

* implement chinese-clip into hf

* implement chinese-clip into hf

* implement chinese-clip into hf

* implement chinese-clip into hf

* implement chinese-clip into hf

* update usecase example in model implementation

* fix codestyle

* fix model_type typo in readme

* add placeholder in doc

* add placeholder in doc

* update the init script

* update usecase

* fix codestyle

* update testcase

* update testcase

* update testcase

* update testcase

* update testcase

* update testcase

* update testcase

* update testcase

* update testcase

* update testcase

* update testcase

* update testcase

* forward the convert_rgb

* update testcase

* update testcase

* update testcase

* merge the recent update from clip about model_input_name property

* update the doc

* update the doc

* update the doc

* update the doc

* remove unused imports

* reformat code style

* update the doc

* fix isort style

* bypass a weird failed unit test which is unrelated with my PR

* update the doc

* implement independent vision config class

* implement independent vision model class

* fix refactor bug

* fix refactor bug

* fix refactor bug

* make style

* fix refactor bug

* make style

* fix refactor bug

* fix refactor bug

* make style

* fix refactor bug

* fix refactor bug

* doc-build restyle

* implement independent text config class

* implement independent text model class

* implement independent text model class

* make style

* make fix-copies

* fix refactor bug

* fix refactor bug

* fix refactor bug

* fix refactor bug

* fix refactor bug

* fix refactor bug

* fix refactor bug

* fix refactor bug

* fix refactor bug

* fix refactor bug

* make style

* update doc

* black and isort

* update doc

* Update src/transformers/models/chinese_clip/configuration_chinese_clip.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/models/auto/tokenization_auto.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* modify the model type from chinese-clip to chinese_clip

* format the example comment of ChineseCLIPVisionConfig

* correct the copyright comment

* fix the tokenizer specification

* add copied from for loss function

* remove unused class

* update CHINESE_CLIP_TEXT_INPUTS_DOCSTRING

* update CHINESE_CLIP_INPUTS_DOCSTRING

* update doc

* update doc

* update code comment in config

* update copied from statement

* make style

* rename the doc file

* add copied statement

* remove unused attention_mask, causal_attention_mask in ChineseCLIPVisionEncoder

* remove ChineseCLIPTextPreTrainedModel

* fix bug

* fix bug

* fix bug

* update doc

* make style

* Update src/transformers/models/chinese_clip/configuration_chinese_clip.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/models/chinese_clip/configuration_chinese_clip.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* update ChineseCLIPImageProcessor in image_processing_auto

* fix config_class of chinesecliptextmodel

* fix the test case

* update the docs

* remove the copied from comment for ChineseCLIPTextModel, since it has diverged from BertModel with customed config_class

* update the testcase

* final fix

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants