Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support shuffle-based groupby aggregations in dask_cudf #11800

Merged
merged 22 commits into from Sep 28, 2022

Conversation

rjzamora
Copy link
Member

@rjzamora rjzamora commented Sep 28, 2022

Description

This PR corresponds to the dask_cudf version of dask/dask#9302 (adding a shuffle-based algorithm for high-cardinality groupby aggregations). The benefits of this algorithm are most significant for cases where split_out>1 is necessary:

agg = ddf.groupby("id").agg({"x": "mean", "y": "max"}, split_out=4, shuffle=True)

NOTES:

  • shuffle="explicit-comms" is also supported (when dask_cuda is installed)
  • It should be possible to refactor remove some of this code in the future. However, due to some subtle differences between the groupby code in dask.dataframe and dask_cudf, the specialized _shuffle_aggregate is currently necessary.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@rjzamora rjzamora added 2 - In Progress Currently a work in progress improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Sep 28, 2022
@github-actions github-actions bot added the cuDF (Python) Affects Python cuDF API. label Sep 28, 2022
@rjzamora rjzamora marked this pull request as ready for review September 28, 2022 02:58
@rjzamora rjzamora requested a review from a team as a code owner September 28, 2022 02:58
Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh, sorry, all of my suggestions have resulted in disaster. Not sure if the better option given we're close to release is go back to the old code and go round again, or to try and fix things up.

I'm inclined perhaps to revert and try again more systematically, WDYT?

python/dask_cudf/dask_cudf/core.py Outdated Show resolved Hide resolved
@rjzamora
Copy link
Member Author

I'm inclined perhaps to revert and try again more systematically, WDYT?

I think the safest path forward is to stick with the ugly shuffle work-around poposed in this PR (for the 22.10 release), but to rip it out as soon as something like dask/dask#9521 is supported upstream.

What we want for the impending release is to support ddf.shuffle/sort_values/set_index/merge/join(..., shuffle="explicit-comms") and ddf.groupby(...).agg(..., shuffle="explicit-comms"). The current design ended up being uglier than we wanted (and not particularly extensible), but it does "accomplish" this. Therefore, I think we should target a more-extensible redesign for 22.12 (to provide enough time for upstream buy-in and code review).

@rjzamora
Copy link
Member Author

One more thing we can do for 22.10 is to improve the explicit-comms shuffle logic a bit to avoid creating/removing a "_partitions" column when it already exists.

@codecov
Copy link

codecov bot commented Sep 28, 2022

Codecov Report

❗ No coverage uploaded for pull request base (branch-22.10@bcf361f). Click here to learn what that means.
Patch has no changes to coverable lines.

Additional details and impacted files
@@               Coverage Diff               @@
##             branch-22.10   #11800   +/-   ##
===============================================
  Coverage                ?   87.52%           
===============================================
  Files                   ?      133           
  Lines                   ?    21801           
  Branches                ?        0           
===============================================
  Hits                    ?    19081           
  Misses                  ?     2720           
  Partials                ?        0           

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@rjzamora
Copy link
Member Author

Update: All changes related to explicit-comms have been removed from this PR - The plan for 22.10 is to support shuffle=True or shuffle="tasks" (and to leave out shuffle="explicit-comms").

rapids-bot bot pushed a commit to rapidsai/dask-cuda that referenced this pull request Sep 28, 2022
Reverts #992, which had led to unexpected issues. See rapidsai/cudf#11800 (review)

cc @wence-

Authors:
  - Richard (Rick) Zamora (https://github.com/rjzamora)

Approvers:
  - Lawrence Mitchell (https://github.com/wence-)

URL: #1001
Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, and sorry for all the going around in circles

@shwina
Copy link
Contributor

shwina commented Sep 28, 2022

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 5a4afec into rapidsai:branch-22.10 Sep 28, 2022
@rjzamora rjzamora deleted the shuffle-groupby branch September 28, 2022 20:48
@rjzamora
Copy link
Member Author

Thanks @wence- and @quasiben for helping to get this in!

raydouglass pushed a commit that referenced this pull request Sep 30, 2022
## Description
This PR fixes a subtle bug introduced in #11800.  While working on the corresponding dask-cuda benchmark for that PR rapidsai/dask-cuda#979, we discovered that non-deterministic column ordering in `_groupby_partition_agg` and `_tree_node_agg` can trigger metadata-enforcement errors in follow-up operations.  This PR simply sorts the output column ordering in those functions (so that the column ordering is always deterministic).

Note that this bug is difficult to reproduce in a pytest, because it rarely occurs with a smaller number of devices (I need to use a full dgx machine to consistently trigger the error).

## Checklist
- [ ] I am familiar with the [Contributing Guidelines](https://github.com/rapidsai/cudf/blob/HEAD/CONTRIBUTING.md).
- [ ] New or existing tests cover these changes.
- [ ] The documentation is up to date with these changes.

Authors:
   - Richard (Rick) Zamora (https://github.com/rjzamora)

Approvers:
   - GALI PREM SAGAR (https://github.com/galipremsagar)
   - Ashwin Srinath (https://github.com/shwina)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2 - In Progress Currently a work in progress cuDF (Python) Affects Python cuDF API. improvement Improvement / enhancement to an existing function non-breaking Non-breaking change
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

None yet

4 participants