New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add XLM-V to Model Doc #21498
Add XLM-V to Model Doc #21498
Conversation
The documentation is not available anymore as the PR was closed or merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this!
docs/source/en/model_doc/xlm-v.mdx
Outdated
> Large multilingual language models typically rely on a single vocabulary shared across 100+ languages. | ||
> As these models have increased in parameter count and depth, vocabulary size has remained largely unchanged. | ||
> This vocabulary bottleneck limits the representational capabilities of multilingual models like XLM-R. | ||
> In this paper, we introduce a new approach for scaling to very large multilingual vocabularies by | ||
> de-emphasizing token sharing between languages with little lexical overlap and assigning vocabulary capacity | ||
> to achieve sufficient coverage for each individual language. Tokenizations using our vocabulary are typically | ||
> more semantically meaningful and shorter compared to XLM-R. Leveraging this improved vocabulary, we train XLM-V, | ||
> a multilingual language model with a one million token vocabulary. XLM-V outperforms XLM-R on every task we | ||
> tested on ranging from natural language inference (XNLI), question answering (MLQA, XQuAD, TyDiQA), and | ||
> named entity recognition (WikiAnn) to low-resource tasks (Americas NLI, MasakhaNER). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be in italics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed :)
library had to be converted. | ||
- The `XLMTokenizer` implementation is used to load the vocab and performs tokenization. | ||
|
||
This model was contributed by [stefan-it](https://huggingface.co/stefan-it), including detailed experiments with XLM-V on downstream tasks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you point to one canonical checkpoint on the Hub as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I added a reference to facebook/xlm-v-base
:)
CI is failing, I'm going to read the Flan-T5 PR (#19892) to see how it should be done! |
You just need to add the model type (same as what you picked for the page in the doc) and name in the configuration_auto file. The PR you mention also does it :-) |
Failures are unrelated to this PR so merging! |
* doc: introduce new section for XLM-V model * doc: mention more details for XLM-V integration * docs: paper abstract in italics, model identifier for base model added * doc: mention new XLM-V support * auto: add XLM-V mapping * doc: run make fix-copies ;)
* doc: introduce new section for XLM-V model * doc: mention more details for XLM-V integration * docs: paper abstract in italics, model identifier for base model added * doc: mention new XLM-V support * auto: add XLM-V mapping * doc: run make fix-copies ;)
Hi,
as discussed in #21330 it would be good to have an extra entry for the new XLM-V model in the Model Doc.
This PR adds it with some additional information about the model and conducted experiments with it.