You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I notice that TikTokenTokenizer ALWAYS mask out eos tokens in tokenize_messages, thus that might bring caveats where the llama3 models NEVER learn when to STOP.
Thanks for opening this issue! We are following up with Llama3 authors to confirm whether this is the intended behavior, will follow up here once we have an update.
To summarize the discussion, I believe we are handling this correctly. There are still EOT and EOM tokens corresponding to end-of-turn and end-of-message, and EOT token is marked as a stop token for generation (see here). During training, these are not always masked like EOS is (see here). I can see that it is a bit unintuitive to always mask EOS like this, and I think care needs to be taken to make sure our TikTokenTokenizer's tokenize_messages API is not used out of context or mix-and-matched with other tokenizers that will expect different handling of EOS.
@jxmsML please let me know if you still have concerns around the usage here. Also open to any recommendations on how we can make this a bit clearer. Thanks again!
I notice that
TikTokenTokenizer
ALWAYS mask out eos tokens intokenize_messages
, thus that might bring caveats where the llama3 models NEVER learn when to STOP.This is also different from how official llama-recipes did:
https://github.com/meta-llama/llama-recipes/blob/5f11aeb88ab87a5258112a6d5e5b41de93f705c3/src/llama_recipes/datasets/alpaca_dataset.py#L53-L62
where EOS is NOT set to IGNORE_INDEX
The text was updated successfully, but these errors were encountered: