add task_type_id to BERT to support ERNIE-2.0 and ERNIE-3.0 models #18686

nghuyong · 2022-08-18T17:43:19Z

What does this PR do?

ERNIE2.0 and ERNIE3.0 are a series of powerful models based on BERT, especially in Chinese tasks. These models introduce task_type_embeddings in the embedding layer, so this PR is to support this feature.

the config of ERNIE2.0 / ERNIE3.0 models have the following two params:

...
"task_type_vocab_size": 3,
"use_task_id": true
...

the released ERNIE2.0 / ERNIE3.0 models have the weight of bert.embeddings.task_type_embeddings.weight

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case. https://github.com/huggingface/transformers/issues?q=is%3Aissue+ernie
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?
I write a script to convert the official released ERNIE3.0 model (paddlepaddle version). And I have checked that the model results before and after transformation are consistent (with task_type_embedding added)

import paddle
import torch
from paddlenlp.transformers import ErnieForMaskedLM, ErnieTokenizer
# this PR version
from transformers import BertTokenizer, BertForMaskedLM

tokenizer = BertTokenizer.from_pretrained('nghuyong/ernie-3.0-base-zh')
model = BertForMaskedLM.from_pretrained('nghuyong/ernie-3.0-base-zh')
input_ids = torch.tensor([tokenizer.encode(text="[MASK][MASK][MASK]是中国神魔小说的经典之作，与《三国演义》《水浒传》《红楼梦》并称为中国古典四大名著。",
                                           add_special_tokens=True)])
model.eval()
with torch.no_grad():
    predictions = model(input_ids)[0][0]
predicted_index = [torch.argmax(predictions[i]).item() for i in range(predictions.shape[0])]
predicted_token = [tokenizer._convert_id_to_token(predicted_index[i]) for i in
                   range(1, (predictions.shape[0] - 1))]
print('huggingface result')
print('predict result:\t', predicted_token)
print('[CLS] logit:\t', predictions[0].numpy())
tokenizer = ErnieTokenizer.from_pretrained("ernie-3.0-base-zh")
model = ErnieForMaskedLM.from_pretrained("ernie-3.0-base-zh")
inputs = tokenizer("[MASK][MASK][MASK]是中国神魔小说的经典之作，与《三国演义》《水浒传》《红楼梦》并称为中国古典四大名著。")
inputs = {k: paddle.to_tensor([v]) for (k, v) in inputs.items()}
model.eval()
with paddle.no_grad():
    predictions = model(**inputs)[0]
predicted_index = [paddle.argmax(predictions[i]).item() for i in range(predictions.shape[0])]
predicted_token = [tokenizer._convert_id_to_token(predicted_index[i]) for i in
                   range(1, (predictions.shape[0] - 1))]
print('paddle result')
print('predict result:\t', predicted_token)
print('[CLS] logit:\t', predictions[0].numpy())

“”“
huggingface result
predict result:	 ['西', '游', '记', '是', '中', '国', '神', '魔', '小', '说', '的', '经', '典', '之', '作', '，', '与', '《', '三', '国', '演', '义', '》', '《', '水', '浒', '传', '》', '《', '红', '楼', '梦', '》', '并', '称', '为', '中', '国', '古', '典', '四', '大', '名', '著', '。']
[CLS] logit:	 [-20.574057  -29.192085  -15.638802  ...  -1.9127564  -1.4329851  -1.8172828]

paddle result
predict result:	 ['西', '游', '记', '是', '中', '国', '神', '魔', '小', '说', '的', '经', '典', '之', '作', '，', '与', '《', '三', '国', '演', '义', '》', '《', '水', '浒', '传', '》', '《', '红', '楼', '梦', '》', '并', '称', '为', '中', '国', '古', '典', '四', '大', '名', '著', '。']
[CLS] logit:	 [-20.573637  -29.193172  -15.639115  ...  -1.9127647  -1.4330447  -1.816982 ]
”“”

Who can review?

@LysandreJik

HuggingFaceDocBuilderDev · 2022-08-18T17:54:41Z

The documentation is not available anymore as the PR was closed or merged.

LysandreJik · 2022-08-24T09:17:09Z

Hey @nghuyong, thanks for your PR!

In this situation, we'd rather have a new model class "Ernie" rather than modifying the "Bert" model class. This will result in a larger PR, but it should be very little additional work for you.

I encourage you to follow the following guide: https://github.com/huggingface/transformers/tree/main/templates/adding_a_new_model#add-new-model-like-command

It seems like it would be as simple as using the scfript to add a new model like BERT, but with the ERNIE name; then applying the changes above.

@ydshieh, would you be down to help @nghuyong if they run into any problem?

Thanks a lot!

nghuyong · 2022-08-24T09:24:06Z

Thanks for your advice. I will try to do this work.

ydshieh · 2022-08-24T09:25:23Z

Agree with @LysandreJik to have a new model file for this. And it should be fairly straightforward.

Glad to know you already have the checkpoints available! Let me me if you need any help, @nghuyong for the PR.
Looking forward to review it! Thanks, @nghuyong !

nghuyong · 2022-08-27T15:37:47Z

@ydshieh @LysandreJik I have updated this PR and add Ernie model.
Please help to review it, if you have any questions, you can AT me, thanks!

LysandreJik · 2022-08-30T12:27:20Z

Wonderful, thank you @nghuyong!

ydshieh · 2022-08-31T15:51:26Z

Hi @nghuyong, Great job! I will review the PR. Currently, the failing tests are caused by

E   AttributeError: module transformers.models.ernie has no attribute BasicTokenizer

which is from your change in

src/transformers/__init__.py

I will discuss my colleagues to see how should we do for the tokenizers.

src/transformers/__init__.py

src/transformers/models/ernie/modeling_ernie.py

nghuyong · 2022-09-04T04:22:35Z

@ydshieh So, is there anything that needs to be updated now?

ydshieh · 2022-09-05T13:41:51Z

Hi @nghuyong Thanks a lot :-). I will take a full review this week.

The current failing tests (most of them) could be fixed by updating your branch. You can do it as

git checkout main
git pull upstream main
git checkout [YOUR-WORKING-BRANCH]
git rebase main
git push --force-with-lease

Before doing so, it would be a good idea to keep a backup of current branch in case of the commit history being messed up.

git checkout add_task_type_id
git checkout -b add_task_type_id_backup

Once the PR branch is updated, you can also try to fix the style/quality issues by running

make style
make quality

You can check the CI results in

ci/circleci: check_code_quality
ci/circleci: check_repository_consistency

for the details and suggestions.

Let me know if you encounter any difficulty.

nghuyong · 2022-09-05T14:58:56Z

Thanks, @ydshieh my branch has been synced with master now

HUSTHY · 2022-09-06T03:46:06Z

@nghuyong
你好，你的ernie的分支是已经合并到mater了吗——意思是安装最新的transformers库，就可以直接使用ernie3和你提供的ernie3的中文权重吧；如不需要通过之前重装transformers的方式：
pip install git+https://github.com/nghuyong/transformers@add_task_type_id

nghuyong · 2022-09-06T03:57:05Z

@HUSTHY, still not, and add_task_type_id has been changed, so, you cannot load ernie3 now.

HUSTHY · 2022-09-06T03:57:25Z

已收到！谢谢！ ——黄洋

HUSTHY · 2022-09-06T03:59:18Z

@nghuyong 好吧我还以为能用了呢那你实现的这个分支的代码是没有问题的吧。。。我直接把代码放在本地应该OK的

ydshieh

Thank you a lot @nghuyong for adding this model. It is in great shape already, and the review is quite smooth 🤗 .

I left some comments to be addressed.

Once everything is ready, we have a few steps to take before merge. For example,

make fix-copies

is required. But we can do this at the final commit(s).

src/transformers/models/ernie/modeling_ernie.py

src/transformers/models/auto/modeling_auto.py

src/transformers/models/__init__.py

src/transformers/__init__.py

src/transformers/models/auto/tokenization_auto.py

src/transformers/models/ernie/configuration_ernie.py

nghuyong · 2022-09-07T18:31:27Z

@ydshieh OK, thanks!!
of course, I do not submit the change to setup.py.

ydshieh · 2022-09-07T18:51:14Z

You can ignore the test failure in ci/circleci: run_tests_hub.

nghuyong · 2022-09-08T02:35:11Z

@ydshieh OK, run_tests_hub could be ignored, may be the ci has no problems now.

ydshieh · 2022-09-08T08:44:36Z

Hi @nghuyong I push a commit which adds several # copied from ...

nghuyong · 2022-09-08T08:58:35Z

@ydshieh OK, thanks a lot

ydshieh · 2022-09-08T09:22:37Z

@sgugger This is model is just BERT, but with a new argument task_type_ids which is used to create a new embedding to be summed. Time for you to have a final review 🙏

sgugger

Thanks for your PR! I left a couple of comments, mainly around documentation.
Also please rename everywhere ErnieLMHeadModel to ErnieForCausalLM (we can't do the rename for BERT for backward compatibility reasons, but for new models we prefer this terminology).

src/transformers/__init__.py

src/transformers/models/ernie/modeling_ernie.py

docs/source/en/model_doc/ernie.mdx

src/transformers/models/ernie/modeling_ernie.py

do not expose ErnieLayer update doc

sgugger · 2022-09-08T17:09:05Z

Thanks again for your contribution! @ydshieh can't merge until you change your "Request changes" to an approval.

ydshieh · 2022-09-09T08:41:07Z

Thank you @nghuyong! I pushed a final commit to remove the remaining ErnieLMHeadModel in a log message.

@sgugger I approved :-)

ydshieh · 2022-09-09T09:15:11Z

failed test is irrelevant to this PR.

nghuyong changed the title ~~add task_type_id to BERT to support ERNIE model~~ add task_type_embedding to BERT to support ERNIE model Aug 18, 2022

nghuyong changed the title ~~add task_type_embedding to BERT to support ERNIE model~~ add task_type_id to BERT to support ERNIE-3.0 models Aug 23, 2022

nghuyong changed the title ~~add task_type_id to BERT to support ERNIE-3.0 models~~ add task_type_id to BERT to support ERNIE-2.0 and ERNIE-3.0 models Aug 23, 2022

nghuyong force-pushed the add_task_type_id branch 2 times, most recently from 11c5c73 to 1e954cd Compare August 27, 2022 15:33

LysandreJik requested a review from ydshieh August 30, 2022 12:27

ydshieh self-assigned this Aug 30, 2022