Add RocBert #20013

sww9370 · 2022-11-02T07:01:14Z

RocBert is a pre-trained Chinese language model that is designed from the ground up to be robust against maliciously crafted adversarial texts such as misspellings, homograph attacks, and other forms of deception.

This property is crucial in downstream applications like content moderation.

RocBert differs from the classic Bert architecture in the following ways:

besides token ids, the model also takes phonetic features and glyph features as input
the model is also pre-trained with a contrastive learning objective that stabilizes the feature space against synthetic attacks

Since the model structure and tokenizer is quite different from existing implementations, we would like to submit this PR to add a new model class.

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

HuggingFaceDocBuilderDev · 2022-11-02T12:05:52Z

The documentation is not available anymore as the PR was closed or merged.

sgugger

Thanks a lot for adding this new model! This is very clean and nice use of our Copied from mechanism!
Most of my comments are around the name: it seems the paper names the model RoCBert so let's respect the casing :-)

src/transformers/__init__.py

src/transformers/models/auto/configuration_auto.py

src/transformers/models/roc_bert/configuration_roc_bert.py

src/transformers/models/roc_bert/modeling_roc_bert.py

src/transformers/models/roc_bert/tokenization_roc_bert.py

tests/models/roc_bert/test_modeling_roc_bert.py

tests/models/roc_bert/test_tokenization_roc_bert.py

change RocBert -> RoCBert

sww9370 · 2022-11-03T02:39:27Z

@sgugger Thanks for your suggestion, I already fixed it ~

sww9370 · 2022-11-07T08:15:26Z

@ArthurZucker hi, I already fixed the code according to sgugger's advice., could you please review it, thanks!

ArthurZucker · 2022-11-07T08:29:09Z

Yes! Doing this asap 🤗 sorry for the delay

ArthurZucker

Hey! Really nice model! Great work its clean and interesting!
A few comments here and there, make sure the doctests pass, and would love to see a more detailed generation test to make sure that the generate function works properly on an integration test.

ps: really loved the use of copied from, thanks for your hard work 😄

src/transformers/models/roc_bert/configuration_roc_bert.py

ArthurZucker · 2022-11-07T09:53:25Z

src/transformers/models/roc_bert/modeling_roc_bert.py

+                device = labels_input_ids.device
+
+                target_inputs = torch.clone(labels_input_ids)
+                target_inputs[target_inputs == -100] = 0


Since we have a config.pad_token_id let's use it (unless it is a different padding toke)

Not sure if targets use this? Usually we rely on the -100 index since it's ignore by PyTorch loss functions.

Not sure if targets use this? Usually we rely on the -100 index since it's ignore by PyTorch loss functions.

In RoCBertForPreTraining model，when counting the sim_matrix between (labels_input_ids, attack_ids), we turn -100 to config.pad_token_id to get it's pooled_embed of roc_bert.

src/transformers/models/roc_bert/modeling_roc_bert.py

src/transformers/models/roc_bert/tokenization_roc_bert.py

ArthurZucker · 2022-11-07T10:00:15Z

src/transformers/models/roc_bert/tokenization_roc_bert.py

+        with open(word_pronunciation_file, "r", encoding="utf8") as in_file:
+            self.word_pronunciation = json.load(in_file)
+
+        self.ids_to_tokens = collections.OrderedDict([(ids, tok) for tok, ids in self.vocab.items()])


Most probably a nit, cc @sgugger not sure how we feel about the collection dependency as we usually do this with native python.

collections is in the standard lib, so not a problem for me.

ArthurZucker · 2022-11-07T10:08:38Z

tests/models/roc_bert/test_modeling_roc_bert.py

+
+        expected_slice = torch.tensor([[[0.6248, 0.3013, 0.3739], [0.3544, 0.8086, 0.2427], [0.3244, 0.6589, 0.1711]]])
+
+        self.assertTrue(torch.allclose(output[:, :3, :3], expected_slice, atol=1e-4))


Would be nice to have a few more test. At least one with an expected chinese generation text / showing an attack resistance. (not sure if it makes sense, tell me if it doesn't)

I changed this test code, now the input text is "ba 里系 [MASK] 国的首都” which is the adversarial text of "巴黎是 [MASK] 国的首都”, means "Paris is the [MASK] of France" in English。
In this model, we expect it can leran：
"ba里" => "巴黎"(Paris),
"系" => "是"(is),
and inference the mask word “[MASK] 国" => "法国" (France)

src/transformers/models/roc_bert/tokenization_roc_bert.py

ArthurZucker · 2022-11-07T10:14:10Z

Last comment, it seems that the issue with naming still persists, we should make sure to either write RoC or Roc everywhere.

add doc, add detail test

sww9370 · 2022-11-08T13:06:09Z

@ArthurZucker I didn't make weiweishi/roc-bert-base-zh public before, it's avaliable now, and other issues are resolved~

sgugger

Thanks again!

* add roc_bert * update roc_bert readme * code style * change name and delete unuse file * udpate model file * delete unuse log file * delete tokenizer fast * reformat code and change model file path * add RocBertForPreTraining * update docs * delete wrong notes * fix copies * fix make repo-consistency error * fix files are not present in the table of contents error * change RocBert -> RoCBert * add doc, add detail test Co-authored-by: weiweishi <weiweishi@tencent.com>

weiweishi added 14 commits November 1, 2022 10:06

add roc_bert

db6a232

update roc_bert readme

3a64622

code style

a3e2405

change name and delete unuse file

1af8383

udpate model file

611c5fe

delete unuse log file

37557c9

delete tokenizer fast

f685aeb

reformat code and change model file path

8676555

add RocBertForPreTraining

04f0dae

update docs

e2f0d65

delete wrong notes

d4b4dd7

fix copies

b93728a

fix make repo-consistency error

1605428

fix files are not present in the table of contents error

f181e83

sgugger reviewed Nov 2, 2022

View reviewed changes

sgugger requested a review from ArthurZucker November 2, 2022 13:11

weiweishi and others added 2 commits November 3, 2022 10:02

change RocBert -> RoCBert

bd22af2

Merge pull request #1 from sww9370/add_rocbert_change_name

64e7289

change RocBert -> RoCBert

ArthurZucker reviewed Nov 7, 2022

View reviewed changes

src/transformers/models/roc_bert/tokenization_roc_bert.py Show resolved Hide resolved

weiweishi and others added 2 commits November 8, 2022 20:02

add doc, add detail test

dec981e

Merge pull request #2 from sww9370/add_rocbert_dev

cd81a23

add doc, add detail test

sgugger approved these changes Nov 8, 2022

View reviewed changes

sgugger merged commit efa889d into huggingface:main Nov 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add RocBert #20013

Add RocBert #20013

sww9370 commented Nov 2, 2022

HuggingFaceDocBuilderDev commented Nov 2, 2022 •

edited

sgugger left a comment

sww9370 commented Nov 3, 2022

sww9370 commented Nov 7, 2022

ArthurZucker commented Nov 7, 2022

ArthurZucker left a comment •

edited

ArthurZucker Nov 7, 2022

sgugger Nov 7, 2022

sww9370 Nov 8, 2022

ArthurZucker Nov 7, 2022

sgugger Nov 7, 2022

ArthurZucker Nov 7, 2022

sww9370 Nov 8, 2022

ArthurZucker commented Nov 7, 2022

sww9370 commented Nov 8, 2022

sgugger left a comment


		expected_slice = torch.tensor([[[0.6248, 0.3013, 0.3739], [0.3544, 0.8086, 0.2427], [0.3244, 0.6589, 0.1711]]])

		self.assertTrue(torch.allclose(output[:, :3, :3], expected_slice, atol=1e-4))

Add RocBert #20013

Add RocBert #20013

Conversation

sww9370 commented Nov 2, 2022

Before submitting

Who can review?

HuggingFaceDocBuilderDev commented Nov 2, 2022 • edited

sgugger left a comment

Choose a reason for hiding this comment

sww9370 commented Nov 3, 2022

sww9370 commented Nov 7, 2022

ArthurZucker commented Nov 7, 2022

ArthurZucker left a comment • edited

Choose a reason for hiding this comment

ArthurZucker Nov 7, 2022

Choose a reason for hiding this comment

sgugger Nov 7, 2022

Choose a reason for hiding this comment

sww9370 Nov 8, 2022

Choose a reason for hiding this comment

ArthurZucker Nov 7, 2022

Choose a reason for hiding this comment

sgugger Nov 7, 2022

Choose a reason for hiding this comment

ArthurZucker Nov 7, 2022

Choose a reason for hiding this comment

sww9370 Nov 8, 2022

Choose a reason for hiding this comment

ArthurZucker commented Nov 7, 2022

sww9370 commented Nov 8, 2022

sgugger left a comment

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Nov 2, 2022 •

edited

ArthurZucker left a comment •

edited