Fix temporary API breakage in dask_cudf groupby #9403

rjzamora · 2022-08-18T16:44:08Z

Minor follow-up to #9302

That PR modifies the aggregate method that is inherited in dask_cudf. This PR adds minimal changes to avoid breaking dask_cudf CI.

rjzamora · 2022-08-18T16:45:10Z

dask/dataframe/groupby.py

@@ -2223,9 +2222,9 @@ def aggregate(self, arg, split_every=None, split_out=1, shuffle=None):
        )

    @_aggregate_docstring(based_on="pd.core.groupby.DataFrameGroupBy.agg")
-    def agg(self, arg, split_every=None, split_out=1, shuffle=None):
+    def agg(self, arg, split_every=None, split_out=1, **kwargs):


Using kwargs here should be fine since we are just passing everything through to self.aggregate (which will raise if unexpected kwargs are detected)

I prefer to avoid **kwargs unless absolutely necessary (especially when they mask self-documentation for function signatures). Can you say a bit more about why we need this?

especially when they mask self-documentation for function signatures

Good call, I changed this a bit to avoid kwargs

Can you say a bit more about why we need this?

Attempted to explain here: #9403 (comment)

jrbourbeau

Thanks @rjzamora -- could you say a little more about the API breakage?

rjzamora · 2022-08-18T16:49:36Z

dask/dataframe/groupby.py

@@ -1765,7 +1765,6 @@ def aggregate(self, arg, split_every=None, split_out=1, shuffle=None):
            chunk=_groupby_apply_funcs,
            chunk_kwargs=dict(
                funcs=chunk_funcs,
-                sort=self.sort,


There is technically nothing wrong with this line, but it adds little-to-no benefit, and happened to break a fragile test in dask-cudf. Therefore, I'd prefer to remove it for now.

So strong thoughts from this on me. Could we update the dask-cudf test to be less fragile?

Could we update the dask-cudf test to be less fragile?

That is certainly the plan - I am working on a dask-cudf groupby overhaul that will change the test anyway.

rjzamora · 2022-08-18T16:56:19Z

could you say a little more about the API breakage?

Yes - In dask_cudf, CudfDataFrameGroupBy inherits from DataFrameGroupBy, and overrides DataFrameGroupBy.aggregate (but not DataFrameGroupBy.agg). Therefore, calling CudfDataFrameGroupBy.agg falls back to DataFrameGroupBy.aggregate, which then calls CudfDataFrameGroupBy.aggregate with the unrecognized shuffle kwarg (and same for SeriesGroupBy).

I will also add handling for shuffle in dask_cudf, but it seems unnecessary to introduce a version barrier here.

jrbourbeau · 2022-08-18T18:49:08Z

dask/dataframe/groupby.py

@@ -1765,7 +1765,6 @@ def aggregate(self, arg, split_every=None, split_out=1, shuffle=None):
            chunk=_groupby_apply_funcs,
            chunk_kwargs=dict(
                funcs=chunk_funcs,
-                sort=self.sort,


So strong thoughts from this on me. Could we update the dask-cudf test to be less fragile?

jrbourbeau · 2022-08-18T18:50:30Z

dask/dataframe/groupby.py

+            arg,
+            split_every=split_every,
+            split_out=split_out,
+            **({} if shuffle is None else {"shuffle": shuffle}),


IIUC this can go back to shuffle=shuffle once relevant updates have been made in dask-cudf -- if that's the case could we open an issue in dask/dask so we remember to follow-up and add a comment here linking to the issue?

rjzamora · 2022-08-19T13:21:32Z

Thanks for looking at this @ian-r-rose and @jrbourbeau - I decided it was better to just make the necessary changes in dask_cudf, since the previous release of cudf (22.08) is pinned to an earlier release of dask anyway. Sorry for adding to the PR pile.

jrbourbeau · 2022-08-19T15:29:51Z

No worries! Thanks for handling all the necessary downstream changes : )

rjzamora added 3 commits August 18, 2022 09:36

avoid passing shuffle to inherited class

401c4d9

avoid breaking dask_cudf ci

086430e

roll back multi-line change

67c7b64

rjzamora added the dataframe label Aug 18, 2022

rjzamora marked this pull request as ready for review August 18, 2022 16:44

rjzamora commented Aug 18, 2022

View reviewed changes

jrbourbeau reviewed Aug 18, 2022

View reviewed changes

rjzamora commented Aug 18, 2022

View reviewed changes

avoid kwargs

cbe5e43

jrbourbeau reviewed Aug 18, 2022

View reviewed changes

jrbourbeau mentioned this pull request Aug 18, 2022

Release 2022.8.1 dask/community#268

Closed

4 tasks

rjzamora closed this Aug 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix temporary API breakage in dask_cudf groupby #9403

Fix temporary API breakage in dask_cudf groupby #9403

rjzamora commented Aug 18, 2022

rjzamora Aug 18, 2022

ian-r-rose Aug 18, 2022

rjzamora Aug 18, 2022

jrbourbeau left a comment

rjzamora Aug 18, 2022

jrbourbeau Aug 18, 2022

rjzamora Aug 18, 2022

rjzamora commented Aug 18, 2022 •

edited

jrbourbeau Aug 18, 2022

jrbourbeau Aug 18, 2022

rjzamora commented Aug 19, 2022

jrbourbeau commented Aug 19, 2022

Fix temporary API breakage in dask_cudf groupby #9403

Fix temporary API breakage in dask_cudf groupby #9403

Conversation

rjzamora commented Aug 18, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jrbourbeau left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rjzamora commented Aug 18, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rjzamora commented Aug 19, 2022

jrbourbeau commented Aug 19, 2022

rjzamora commented Aug 18, 2022 •

edited