Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFrame P2P Shuffling with stable ordering #8421

Closed
fjetter opened this issue Dec 19, 2023 · 1 comment · Fixed by #8453
Closed

DataFrame P2P Shuffling with stable ordering #8421

fjetter opened this issue Dec 19, 2023 · 1 comment · Fixed by #8453
Assignees
Labels
enhancement Improve existing functionality or make things work better shuffle

Comments

@fjetter
Copy link
Member

fjetter commented Dec 19, 2023

The P2P algorithm as is currently does not strictly guarantee ordering. This can be problematic for some order sensitive operations like groupby + first (dask/dask#10034) or for a drop_duplicates with keep (dask/dask#10708)

It's a little work but should be possible to get P2P to be stable

@hendrikmakait
Copy link
Member

For the sake of documenting and being precise: When talking about stable ordering, the only thing we can guarantee with P2P is stable ordering between rows of the same shuffle key (i.e., the combination of values of the rows/index we shuffle on). With shuffling as a hashing-based operation, any ordering between keys is impossible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Improve existing functionality or make things work better shuffle
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants