Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add task_type_id to BERT to support ERNIE-2.0 and ERNIE-3.0 models #18686

Merged
merged 23 commits into from Sep 9, 2022

Conversation

nghuyong
Copy link
Contributor

@nghuyong nghuyong commented Aug 18, 2022

What does this PR do?

ERNIE2.0 citation and ERNIE3.0 citation are a series of powerful models based on BERT, especially in Chinese tasks. These models introduce task_type_embeddings in the embedding layer, so this PR is to support this feature.

the config of ERNIE2.0 / ERNIE3.0 models have the following two params:

...
"task_type_vocab_size": 3,
"use_task_id": true
...

the released ERNIE2.0 / ERNIE3.0 models have the weight of bert.embeddings.task_type_embeddings.weight

Before submitting

import paddle
import torch
from paddlenlp.transformers import ErnieForMaskedLM, ErnieTokenizer
# this PR version
from transformers import BertTokenizer, BertForMaskedLM

tokenizer = BertTokenizer.from_pretrained('nghuyong/ernie-3.0-base-zh')
model = BertForMaskedLM.from_pretrained('nghuyong/ernie-3.0-base-zh')
input_ids = torch.tensor([tokenizer.encode(text="[MASK][MASK][MASK]是中国神魔小说的经典之作,与《三国演义》《水浒传》《红楼梦》并称为中国古典四大名著。",
                                           add_special_tokens=True)])
model.eval()
with torch.no_grad():
    predictions = model(input_ids)[0][0]
predicted_index = [torch.argmax(predictions[i]).item() for i in range(predictions.shape[0])]
predicted_token = [tokenizer._convert_id_to_token(predicted_index[i]) for i in
                   range(1, (predictions.shape[0] - 1))]
print('huggingface result')
print('predict result:\t', predicted_token)
print('[CLS] logit:\t', predictions[0].numpy())
tokenizer = ErnieTokenizer.from_pretrained("ernie-3.0-base-zh")
model = ErnieForMaskedLM.from_pretrained("ernie-3.0-base-zh")
inputs = tokenizer("[MASK][MASK][MASK]是中国神魔小说的经典之作,与《三国演义》《水浒传》《红楼梦》并称为中国古典四大名著。")
inputs = {k: paddle.to_tensor([v]) for (k, v) in inputs.items()}
model.eval()
with paddle.no_grad():
    predictions = model(**inputs)[0]
predicted_index = [paddle.argmax(predictions[i]).item() for i in range(predictions.shape[0])]
predicted_token = [tokenizer._convert_id_to_token(predicted_index[i]) for i in
                   range(1, (predictions.shape[0] - 1))]
print('paddle result')
print('predict result:\t', predicted_token)
print('[CLS] logit:\t', predictions[0].numpy())

“”“
huggingface result
predict result:	 ['西', '游', '记', '是', '中', '国', '神', '魔', '小', '说', '的', '经', '典', '之', '作', ',', '与', '《', '三', '国', '演', '义', '》', '《', '水', '浒', '传', '》', '《', '红', '楼', '梦', '》', '并', '称', '为', '中', '国', '古', '典', '四', '大', '名', '著', '。']
[CLS] logit:	 [-20.574057  -29.192085  -15.638802  ...  -1.9127564  -1.4329851  -1.8172828]

paddle result
predict result:	 ['西', '游', '记', '是', '中', '国', '神', '魔', '小', '说', '的', '经', '典', '之', '作', ',', '与', '《', '三', '国', '演', '义', '》', '《', '水', '浒', '传', '》', '《', '红', '楼', '梦', '》', '并', '称', '为', '中', '国', '古', '典', '四', '大', '名', '著', '。']
[CLS] logit:	 [-20.573637  -29.193172  -15.639115  ...  -1.9127647  -1.4330447  -1.816982 ]
”“”

Who can review?

@LysandreJik

@nghuyong nghuyong changed the title add task_type_id to BERT to support ERNIE model add task_type_embedding to BERT to support ERNIE model Aug 18, 2022
@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Aug 18, 2022

The documentation is not available anymore as the PR was closed or merged.

@nghuyong nghuyong changed the title add task_type_embedding to BERT to support ERNIE model add task_type_id to BERT to support ERNIE-3.0 models Aug 23, 2022
@nghuyong nghuyong changed the title add task_type_id to BERT to support ERNIE-3.0 models add task_type_id to BERT to support ERNIE-2.0 and ERNIE-3.0 models Aug 23, 2022
@LysandreJik
Copy link
Member

Hey @nghuyong, thanks for your PR!

In this situation, we'd rather have a new model class "Ernie" rather than modifying the "Bert" model class. This will result in a larger PR, but it should be very little additional work for you.

I encourage you to follow the following guide: https://github.com/huggingface/transformers/tree/main/templates/adding_a_new_model#add-new-model-like-command

It seems like it would be as simple as using the scfript to add a new model like BERT, but with the ERNIE name; then applying the changes above.

@ydshieh, would you be down to help @nghuyong if they run into any problem?

Thanks a lot!

@nghuyong
Copy link
Contributor Author

Thanks for your advice. I will try to do this work.

@ydshieh
Copy link
Collaborator

ydshieh commented Aug 24, 2022

Agree with @LysandreJik to have a new model file for this. And it should be fairly straightforward.

Glad to know you already have the checkpoints available! Let me me if you need any help, @nghuyong for the PR.
Looking forward to review it! Thanks, @nghuyong !

@nghuyong nghuyong force-pushed the add_task_type_id branch 2 times, most recently from 11c5c73 to 1e954cd Compare August 27, 2022 15:33
@nghuyong
Copy link
Contributor Author

@ydshieh @LysandreJik I have updated this PR and add Ernie model.
Please help to review it, if you have any questions, you can AT me, thanks!

@LysandreJik
Copy link
Member

Wonderful, thank you @nghuyong!

@ydshieh ydshieh self-assigned this Aug 30, 2022
@ydshieh
Copy link
Collaborator

ydshieh commented Aug 31, 2022

Hi @nghuyong, Great job! I will review the PR. Currently, the failing tests are caused by

E   AttributeError: module transformers.models.ernie has no attribute BasicTokenizer

which is from your change in

src/transformers/__init__.py

I will discuss my colleagues to see how should we do for the tokenizers.

@nghuyong
Copy link
Contributor Author

nghuyong commented Sep 4, 2022

@ydshieh So, is there anything that needs to be updated now?

@ydshieh
Copy link
Collaborator

ydshieh commented Sep 5, 2022

Hi @nghuyong Thanks a lot :-). I will take a full review this week.

The current failing tests (most of them) could be fixed by updating your branch. You can do it as

git checkout main
git pull upstream main
git checkout [YOUR-WORKING-BRANCH]
git rebase main
git push --force-with-lease

Before doing so, it would be a good idea to keep a backup of current branch in case of the commit history being messed up.

git checkout add_task_type_id
git checkout -b add_task_type_id_backup

Once the PR branch is updated, you can also try to fix the style/quality issues by running

make style
make quality

You can check the CI results in

ci/circleci: check_code_quality
ci/circleci: check_repository_consistency

for the details and suggestions.

Let me know if you encounter any difficulty.

@nghuyong
Copy link
Contributor Author

nghuyong commented Sep 5, 2022

Thanks, @ydshieh my branch has been synced with master now

@HUSTHY
Copy link

HUSTHY commented Sep 6, 2022

@nghuyong
你好,你的ernie的分支是已经合并到mater了吗——意思是安装最新的transformers库,就可以直接使用ernie3和你提供的ernie3的中文权重吧;如不需要通过之前重装transformers的方式:
pip install git+https://github.com/nghuyong/transformers@add_task_type_id

@nghuyong
Copy link
Contributor Author

nghuyong commented Sep 6, 2022

@HUSTHY, still not, and add_task_type_id has been changed, so, you cannot load ernie3 now.

@HUSTHY
Copy link

HUSTHY commented Sep 6, 2022 via email

@HUSTHY
Copy link

HUSTHY commented Sep 6, 2022

@nghuyong 好吧 我还以为能用了呢 那你实现的这个分支的代码是没有问题的吧。。。 我直接把代码放在本地应该OK的

Copy link
Collaborator

@ydshieh ydshieh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you a lot @nghuyong for adding this model. It is in great shape already, and the review is quite smooth 🤗 .

I left some comments to be addressed.

Once everything is ready, we have a few steps to take before merge. For example,

make fix-copies

is required. But we can do this at the final commit(s).

src/transformers/models/ernie/modeling_ernie.py Outdated Show resolved Hide resolved
src/transformers/models/ernie/modeling_ernie.py Outdated Show resolved Hide resolved
src/transformers/models/ernie/modeling_ernie.py Outdated Show resolved Hide resolved
src/transformers/models/auto/modeling_auto.py Outdated Show resolved Hide resolved
src/transformers/models/__init__.py Outdated Show resolved Hide resolved
src/transformers/__init__.py Outdated Show resolved Hide resolved
src/transformers/models/auto/tokenization_auto.py Outdated Show resolved Hide resolved
@nghuyong
Copy link
Contributor Author

nghuyong commented Sep 7, 2022

@ydshieh OK, thanks!!
of course, I do not submit the change to setup.py.

@ydshieh
Copy link
Collaborator

ydshieh commented Sep 7, 2022

You can ignore the test failure in ci/circleci: run_tests_hub.

@nghuyong
Copy link
Contributor Author

nghuyong commented Sep 8, 2022

@ydshieh OK, run_tests_hub could be ignored, may be the ci has no problems now.

@ydshieh
Copy link
Collaborator

ydshieh commented Sep 8, 2022

Hi @nghuyong I push a commit which adds several # copied from ...

@nghuyong
Copy link
Contributor Author

nghuyong commented Sep 8, 2022

@ydshieh OK, thanks a lot

@ydshieh ydshieh requested a review from sgugger September 8, 2022 09:20
@ydshieh
Copy link
Collaborator

ydshieh commented Sep 8, 2022

@sgugger This is model is just BERT, but with a new argument task_type_ids which is used to create a new embedding to be summed. Time for you to have a final review 🙏

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your PR! I left a couple of comments, mainly around documentation.
Also please rename everywhere ErnieLMHeadModel to ErnieForCausalLM (we can't do the rename for BERT for backward compatibility reasons, but for new models we prefer this terminology).

src/transformers/__init__.py Outdated Show resolved Hide resolved
src/transformers/models/ernie/modeling_ernie.py Outdated Show resolved Hide resolved
docs/source/en/model_doc/ernie.mdx Show resolved Hide resolved
src/transformers/models/ernie/modeling_ernie.py Outdated Show resolved Hide resolved
src/transformers/models/ernie/modeling_ernie.py Outdated Show resolved Hide resolved
@sgugger
Copy link
Collaborator

sgugger commented Sep 8, 2022

Thanks again for your contribution! @ydshieh can't merge until you change your "Request changes" to an approval.

@ydshieh
Copy link
Collaborator

ydshieh commented Sep 9, 2022

Thank you @nghuyong! I pushed a final commit to remove the remaining ErnieLMHeadModel in a log message.

@sgugger I approved :-)

@ydshieh
Copy link
Collaborator

ydshieh commented Sep 9, 2022

failed test is irrelevant to this PR.

@sgugger sgugger merged commit 22f7218 into huggingface:main Sep 9, 2022
oneraghavan pushed a commit to oneraghavan/transformers that referenced this pull request Sep 26, 2022
…uggingface#18686)

* add_ernie

* remove Tokenizer in ernie

* polish code

* format code style

* polish code

* fix style

* update doc

* make fix-copies

* change model name

* change model name

* fix dependency

* add more copied from

* rename ErnieLMHeadModel to ErnieForCausalLM
do not expose ErnieLayer
update doc

* fix

* make style

* polish code

* polish code

* fix

* fix

* fix

* fix

* fix

* final fix

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants