Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FR] Sample Packing with correct attention mask #892

Closed
DreamGenX opened this issue Apr 28, 2024 · 3 comments
Closed

[FR] Sample Packing with correct attention mask #892

DreamGenX opened this issue Apr 28, 2024 · 3 comments
Assignees

Comments

@DreamGenX
Copy link

Sample packing with correct attention mask (where the model can't attend to other examples in the batch) and ideally correct RoPE offset would be extremely beneficial. In SFT, examples tend to be highly correlated, so there's an opportunity for the model to cheat when training.

It helps (significantly) in training speed when you are training on examples with diversse lengths, with large maximum seq length.

Known existing implementations:

This kind of mask is supported by FA:

One thing to consider is how the examples should be packed -- e.g. naive greedy packing, vs. some more elaborate bin packing algorithm. I think a naive greedy approach would bring a lot of benefit.

@RdoubleA
Copy link
Contributor

RdoubleA commented Apr 28, 2024

Thanks for opening this feature request. Indeed, this very thing is being worked on in #875. I am currently investigating how to make the sample masking work with flash attention (we currently use SDPA which does not support arbitrary masks, so may have to use Tri Dao's implementation as you pointed out). If you have thoughts on this would love your feedback on the PR.

ideally correct RoPE offset would be extremely beneficial

Do you mind elaborating on this?

@RdoubleA RdoubleA self-assigned this Apr 28, 2024
@DreamGenX
Copy link
Author

DreamGenX commented May 3, 2024

What I meant by the RoPE comment -- and maybe this is already handled automatically -- is that if we just concatenate examples as with naive packing e.g. in HF transformers, the token's positional embeddings will not represent the actual position in the example, but rather a position in the concatenated examples.

@RdoubleA
Copy link
Contributor

Completed by #875

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants