[FR] Sample Packing with correct attention mask #892

DreamGenX · 2024-04-28T18:06:29Z

Sample packing with correct attention mask (where the model can't attend to other examples in the batch) and ideally correct RoPE offset would be extremely beneficial. In SFT, examples tend to be highly correlated, so there's an opportunity for the model to cheat when training.

It helps (significantly) in training speed when you are training on examples with diversse lengths, with large maximum seq length.

Known existing implementations:

This kind of mask is supported by FA:

Will attention_mask be extended to 3D? (concatenate short samples for efficient training) Dao-AILab/flash-attention#432 (comment)

One thing to consider is how the examples should be packed -- e.g. naive greedy packing, vs. some more elaborate bin packing algorithm. I think a naive greedy approach would bring a lot of benefit.

RdoubleA · 2024-04-28T19:06:03Z

Thanks for opening this feature request. Indeed, this very thing is being worked on in #875. I am currently investigating how to make the sample masking work with flash attention (we currently use SDPA which does not support arbitrary masks, so may have to use Tri Dao's implementation as you pointed out). If you have thoughts on this would love your feedback on the PR.

ideally correct RoPE offset would be extremely beneficial

Do you mind elaborating on this?

DreamGenX · 2024-05-03T14:27:15Z

What I meant by the RoPE comment -- and maybe this is already handled automatically -- is that if we just concatenate examples as with naive packing e.g. in HF transformers, the token's positional embeddings will not represent the actual position in the example, but rather a position in the concatenated examples.

RdoubleA · 2024-05-15T19:52:46Z

Completed by #875

RdoubleA self-assigned this Apr 28, 2024

RdoubleA closed this as completed May 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FR] Sample Packing with correct attention mask #892

[FR] Sample Packing with correct attention mask #892

DreamGenX commented Apr 28, 2024

RdoubleA commented Apr 28, 2024 •

edited

DreamGenX commented May 3, 2024 •

edited

RdoubleA commented May 15, 2024

[FR] Sample Packing with correct attention mask #892

[FR] Sample Packing with correct attention mask #892

Comments

DreamGenX commented Apr 28, 2024

RdoubleA commented Apr 28, 2024 • edited

DreamGenX commented May 3, 2024 • edited

RdoubleA commented May 15, 2024

RdoubleA commented Apr 28, 2024 •

edited

DreamGenX commented May 3, 2024 •

edited