Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrongly mask out eos tokens during training #949

Open
jxmsML opened this issue May 7, 2024 · 2 comments
Open

Wrongly mask out eos tokens during training #949

jxmsML opened this issue May 7, 2024 · 2 comments
Assignees
Labels
bug Something isn't working

Comments

@jxmsML
Copy link

jxmsML commented May 7, 2024

I notice that TikTokenTokenizer ALWAYS mask out eos tokens in tokenize_messages, thus that might bring caveats where the llama3 models NEVER learn when to STOP.

This is also different from how official llama-recipes did:
https://github.com/meta-llama/llama-recipes/blob/5f11aeb88ab87a5258112a6d5e5b41de93f705c3/src/llama_recipes/datasets/alpaca_dataset.py#L53-L62
where EOS is NOT set to IGNORE_INDEX

@jxmsML jxmsML changed the title mask out eos tokens during training Wrongly mask out eos tokens during training May 7, 2024
@joecummings joecummings added the bug Something isn't working label May 7, 2024
@ebsmothers
Copy link
Contributor

Thanks for opening this issue! We are following up with Llama3 authors to confirm whether this is the intended behavior, will follow up here once we have an update.

@ebsmothers
Copy link
Contributor

To summarize the discussion, I believe we are handling this correctly. There are still EOT and EOM tokens corresponding to end-of-turn and end-of-message, and EOT token is marked as a stop token for generation (see here). During training, these are not always masked like EOS is (see here). I can see that it is a bit unintuitive to always mask EOS like this, and I think care needs to be taken to make sure our TikTokenTokenizer's tokenize_messages API is not used out of context or mix-and-matched with other tokenizers that will expect different handling of EOS.

@jxmsML please let me know if you still have concerns around the usage here. Also open to any recommendations on how we can make this a bit clearer. Thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants