NLLB tokenizer #18126

LysandreJik · 2022-07-13T15:54:23Z

Adds the NLLB tokenizer. In order to run:

>>> from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline

>>> model = AutoModelForSeq2SeqLM.from_pretrained("facebook/nllb-200-distilled-600M")
>>> tokenizer = AutoTokenizer.from_pretrained("facebook/nllb-200-distilled-600M")

>>> translator = pipeline('translation', model=model, tokenizer=tokenizer, src_lang="eng_Latn", tgt_lang='ron_Latn')
>>> translator("UN Chief says there is no military solution in Syria")
[{'translation_text': 'Şeful ONU spune că nu există o soluţie militară în Siria'}]

Closes #18043

src/transformers/models/nllb/tokenization_nllb.py

Co-authored-by: Stefan Schweter <stefan@schweter.it>

LysandreJik · 2022-07-14T07:47:41Z

All models are now public, feel free to try it out @stefan-it. The generation seems good, have not tried fine-tuning yet.

HuggingFaceDocBuilderDev · 2022-07-14T07:57:36Z

The documentation is not available anymore as the PR was closed or merged.

docs/source/en/_toctree.yml

docs/source/en/model_doc/nllb.mdx

vmarsel · 2022-07-14T20:21:16Z

I don't know of a better place to post (issue?), so I'll do it here :)

@LysandreJik Thank you so much for adding support for the NLLB dense models! I pulled out this branch and tried all of them and they work awesome!

There is the following place in the readme
"This implementation contains dense models available in release. Let us know via GitHub if you want to see MoE models as well."

So it would be really great if you could add MoE models! I tried to figure out the original repo, but it turned out to be unexpectedly difficult. I couldn't get MoE to run. So if you add MoE models, I'm sure it will make a lot of people happier, at least me :)

TonyMas · 2022-07-14T23:01:52Z

@LysandreJik Thanks a lot for your promt work! I tried using NLLB model from HuggingFace and noticed one problem:

max_length does not set in config.json for any of the NLLB models, so it uses default value of max_length (20).

transformers/src/transformers/configuration_utils.py

Line 125 in 33028f4

max_length (`int`, *optional*, defaults to 20):

As the result, your example code cannot generate more than 20 tokens. it is possible to set max_length higher when calling translation method, but it will be great to have meaningful default as well.

For comparison, both for M2M and MBart50 models max_length set in config.json file to 200.

TroyZuroske · 2022-07-15T13:58:08Z

@LysandreJik Thanks a lot for your promt work! I tried using NLLB model from HuggingFace and noticed one problem:

max_length does not set in config.json for any of the NLLB models, so it uses default value of max_length (20).

transformers/src/transformers/configuration_utils.py

Line 125 in 33028f4

max_length (`int`, *optional*, defaults to 20):

As the result, your example code cannot generate more than 20 tokens. it is possible to set max_length higher when calling translation method, but it will be great to have meaningful default as well.
For comparison, both for M2M and MBart50 models max_length set in config.json file to 200.

How is the default max_length determined per model? Or is it documented in their white papers? With this PR, I have started evaluating the extremely large model (facebook/nllb-200-3.3B) against GCP translation and so far it is doing really well despite the length of text I give it but I want to give it the best chance to perform so knowing the ideal max_length would help.

TonyMas · 2022-07-15T16:33:53Z

@LysandreJik Thanks a lot for your promt work! I tried using NLLB model from HuggingFace and noticed one problem:
max_length does not set in config.json for any of the NLLB models, so it uses default value of max_length (20).

transformers/src/transformers/configuration_utils.py

Line 125 in 33028f4

max_length (`int`, *optional*, defaults to 20):

As the result, your example code cannot generate more than 20 tokens. it is possible to set max_length higher when calling translation method, but it will be great to have meaningful default as well.
For comparison, both for M2M and MBart50 models max_length set in config.json file to 200.

How is the default max_length determined per model? Or is it documented in their white papers? With this PR, I have started evaluating the extremely large model (facebook/nllb-200-3.3B) against GCP translation and so far it is doing really well despite the length of text I give it but I want to give it the best chance to perform so knowing the ideal max_length would help.

I think usual default for max_length is to be equal to max input length. Translation pipeline in transformers are checking that max_length at higher than 90% of input length.

transformers/src/transformers/pipelines/text2text_generation.py

Lines 272 to 278 in 33028f4

    
           def check_inputs(self, input_length: int, min_length: int, max_length: int): 
        
               if input_length > 0.9 * max_length: 
        
                   logger.warning( 
        
                       f"Your input_length: {input_length} is bigger than 0.9 * max_length: {max_length}. You might consider " 
        
                       "increasing your max_length manually, e.g. translator('...', max_length=400)" 
        
                   ) 
        
               return True

sgugger

Thanks for adding this model! Would it make sense to add a default model for NLLB in the auto mappings?

README.md

docs/source/en/model_doc/nllb.mdx

src/transformers/models/nllb/tokenization_nllb.py

src/transformers/models/nllb/tokenization_nllb_fast.py

tests/models/nllb/test_tokenization_nllb.py

Co-authored-by: Stefan Schweter <stefan@schweter.it>

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* NLLB tokenizer * Apply suggestions from code review - Thanks Stefan! Co-authored-by: Stefan Schweter <stefan@schweter.it> * Final touches * Style :) * Update docs/source/en/model_doc/nllb.mdx Co-authored-by: Stefan Schweter <stefan@schweter.it> * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * PR reviews * Auto models Co-authored-by: Stefan Schweter <stefan@schweter.it> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

vince62s · 2023-03-18T09:06:04Z

as mentionned here #19943
where did you guys see that the " Langtoken" is added AFTER the tokens ?
In the NLLB paper, it says only the "Langtoken" is placed BEFORE the tokens. (mBart does the opposite)

stefan-it · 2023-03-18T14:17:41Z

I've just seen this example - where the lang-token is prepended:

https://github.com/facebookresearch/fairseq/blob/nllb/fairseq/data/multilingual/multilingual_data_manager.py#L78-L101

from original code base 🤔

vince62s · 2023-03-18T14:55:31Z

right. Also I am wondering why they use "" which is "eos" as the start token of the source sequence. (in fact same for the target sequence). I would have expected:
SRC = LangTok + tokens
TGT = BOS + LangTok, tokens + EOS

It seems they use EOS instead of BOS and that they put a EOS as the SRC start.

NLLB tokenizer

59ce18b

stefan-it reviewed Jul 13, 2022

View reviewed changes

src/transformers/models/nllb/tokenization_nllb.py Outdated Show resolved Hide resolved

stefan-it reviewed Jul 13, 2022

View reviewed changes

src/transformers/models/nllb/tokenization_nllb.py Outdated Show resolved Hide resolved

stefan-it reviewed Jul 13, 2022

View reviewed changes

src/transformers/models/nllb/tokenization_nllb.py Outdated Show resolved Hide resolved

stefan-it reviewed Jul 13, 2022

View reviewed changes

src/transformers/models/nllb/tokenization_nllb.py Outdated Show resolved Hide resolved

LysandreJik and others added 3 commits July 14, 2022 02:07

Apply suggestions from code review - Thanks Stefan!

0d85ea4

Co-authored-by: Stefan Schweter <stefan@schweter.it>

Final touches

8ce67c4

Style :)

33028f4

LysandreJik marked this pull request as ready for review July 14, 2022 07:46

stefan-it reviewed Jul 14, 2022

View reviewed changes

docs/source/en/_toctree.yml Show resolved Hide resolved

stefan-it reviewed Jul 14, 2022

View reviewed changes

docs/source/en/model_doc/nllb.mdx Outdated Show resolved Hide resolved

LysandreJik requested a review from sgugger July 18, 2022 06:41

sgugger approved these changes Jul 18, 2022

View reviewed changes

LysandreJik and others added 4 commits July 18, 2022 05:47

Update docs/source/en/model_doc/nllb.mdx

2d89468

Co-authored-by: Stefan Schweter <stefan@schweter.it>

Apply suggestions from code review

2e8f5b6

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

PR reviews

9a15675

Auto models

4ec6cf1

LysandreJik merged commit c1c79b0 into main Jul 18, 2022

LysandreJik deleted the nllb branch July 18, 2022 12:12

daje0601 mentioned this pull request Jul 18, 2022

Add Support for "No Language Left Behind" (NLLB) #18043

Closed

2 tasks

hitchhicker mentioned this pull request Jul 18, 2022

Support for NLLB UKPLab/EasyNMT#73

Open

xenova mentioned this pull request Aug 19, 2023

Add M2M100TokenizerFast (+ convert_slow_tokenizer implementation) #25478

Closed

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NLLB tokenizer #18126

NLLB tokenizer #18126

LysandreJik commented Jul 13, 2022 •

edited

LysandreJik commented Jul 14, 2022

HuggingFaceDocBuilderDev commented Jul 14, 2022 •

edited

vmarsel commented Jul 14, 2022

TonyMas commented Jul 14, 2022 •

edited

TroyZuroske commented Jul 15, 2022

TonyMas commented Jul 15, 2022

sgugger left a comment

vince62s commented Mar 18, 2023

stefan-it commented Mar 18, 2023 •

edited

vince62s commented Mar 18, 2023

NLLB tokenizer #18126

NLLB tokenizer #18126

Conversation

LysandreJik commented Jul 13, 2022 • edited

LysandreJik commented Jul 14, 2022

HuggingFaceDocBuilderDev commented Jul 14, 2022 • edited

vmarsel commented Jul 14, 2022

TonyMas commented Jul 14, 2022 • edited

TroyZuroske commented Jul 15, 2022

TonyMas commented Jul 15, 2022

sgugger left a comment

Choose a reason for hiding this comment

vince62s commented Mar 18, 2023

stefan-it commented Mar 18, 2023 • edited

vince62s commented Mar 18, 2023

LysandreJik commented Jul 13, 2022 •

edited

HuggingFaceDocBuilderDev commented Jul 14, 2022 •

edited

TonyMas commented Jul 14, 2022 •

edited

stefan-it commented Mar 18, 2023 •

edited