Wrongly mask out eos tokens during training #949

jxmsML · 2024-05-07T21:44:54Z

I notice that TikTokenTokenizer ALWAYS mask out eos tokens in tokenize_messages, thus that might bring caveats where the llama3 models NEVER learn when to STOP.

This is also different from how official llama-recipes did:
https://github.com/meta-llama/llama-recipes/blob/5f11aeb88ab87a5258112a6d5e5b41de93f705c3/src/llama_recipes/datasets/alpaca_dataset.py#L53-L62
where EOS is NOT set to IGNORE_INDEX

The text was updated successfully, but these errors were encountered:

ebsmothers · 2024-05-08T16:48:18Z

Thanks for opening this issue! We are following up with Llama3 authors to confirm whether this is the intended behavior, will follow up here once we have an update.

ebsmothers · 2024-05-08T22:27:36Z

To summarize the discussion, I believe we are handling this correctly. There are still EOT and EOM tokens corresponding to end-of-turn and end-of-message, and EOT token is marked as a stop token for generation (see here). During training, these are not always masked like EOS is (see here). I can see that it is a bit unintuitive to always mask EOS like this, and I think care needs to be taken to make sure our TikTokenTokenizer's tokenize_messages API is not used out of context or mix-and-matched with other tokenizers that will expect different handling of EOS.

@jxmsML please let me know if you still have concerns around the usage here. Also open to any recommendations on how we can make this a bit clearer. Thanks again!

jxmsML changed the title ~~mask out eos tokens during training~~ Wrongly mask out eos tokens during training May 7, 2024

joecummings assigned ebsmothers May 7, 2024

joecummings added the bug Something isn't working label May 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrongly mask out eos tokens during training #949

Wrongly mask out eos tokens during training #949

jxmsML commented May 7, 2024

ebsmothers commented May 8, 2024

ebsmothers commented May 8, 2024

Wrongly mask out eos tokens during training #949

Wrongly mask out eos tokens during training #949

Comments

jxmsML commented May 7, 2024

ebsmothers commented May 8, 2024

ebsmothers commented May 8, 2024