Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add XLM-V to Model Doc #21498

Merged
merged 6 commits into from Feb 7, 2023
Merged

Conversation

stefan-it
Copy link
Collaborator

Hi,

as discussed in #21330 it would be good to have an extra entry for the new XLM-V model in the Model Doc.

This PR adds it with some additional information about the model and conducted experiments with it.

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Feb 7, 2023

The documentation is not available anymore as the PR was closed or merged.

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this!

Comment on lines 23 to 32
> Large multilingual language models typically rely on a single vocabulary shared across 100+ languages.
> As these models have increased in parameter count and depth, vocabulary size has remained largely unchanged.
> This vocabulary bottleneck limits the representational capabilities of multilingual models like XLM-R.
> In this paper, we introduce a new approach for scaling to very large multilingual vocabularies by
> de-emphasizing token sharing between languages with little lexical overlap and assigning vocabulary capacity
> to achieve sufficient coverage for each individual language. Tokenizations using our vocabulary are typically
> more semantically meaningful and shorter compared to XLM-R. Leveraging this improved vocabulary, we train XLM-V,
> a multilingual language model with a one million token vocabulary. XLM-V outperforms XLM-R on every task we
> tested on ranging from natural language inference (XNLI), question answering (MLQA, XQuAD, TyDiQA), and
> named entity recognition (WikiAnn) to low-resource tasks (Americas NLI, MasakhaNER).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be in italics.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed :)

library had to be converted.
- The `XLMTokenizer` implementation is used to load the vocab and performs tokenization.

This model was contributed by [stefan-it](https://huggingface.co/stefan-it), including detailed experiments with XLM-V on downstream tasks.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you point to one canonical checkpoint on the Hub as well?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I added a reference to facebook/xlm-v-base :)

@stefan-it
Copy link
Collaborator Author

CI is failing, I'm going to read the Flan-T5 PR (#19892) to see how it should be done!

@sgugger
Copy link
Collaborator

sgugger commented Feb 7, 2023

You just need to add the model type (same as what you picked for the page in the doc) and name in the configuration_auto file. The PR you mention also does it :-)

@sgugger
Copy link
Collaborator

sgugger commented Feb 7, 2023

Failures are unrelated to this PR so merging!

@sgugger sgugger merged commit 7e51a44 into huggingface:main Feb 7, 2023
@stefan-it stefan-it deleted the add-xlm-v-model-doc branch February 8, 2023 14:44
miyu386 pushed a commit to miyu386/transformers that referenced this pull request Feb 9, 2023
* doc: introduce new section for XLM-V model

* doc: mention more details for XLM-V integration

* docs: paper abstract in italics, model identifier for base model added

* doc: mention new XLM-V support

* auto: add XLM-V mapping

* doc: run make fix-copies ;)
ArthurZucker pushed a commit to ArthurZucker/transformers that referenced this pull request Mar 2, 2023
* doc: introduce new section for XLM-V model

* doc: mention more details for XLM-V integration

* docs: paper abstract in italics, model identifier for base model added

* doc: mention new XLM-V support

* auto: add XLM-V mapping

* doc: run make fix-copies ;)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants