Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix temporary API breakage in dask_cudf groupby #9403

Closed
wants to merge 4 commits into from

Conversation

rjzamora
Copy link
Member

Minor follow-up to #9302

That PR modifies the aggregate method that is inherited in dask_cudf. This PR adds minimal changes to avoid breaking dask_cudf CI.

@rjzamora rjzamora marked this pull request as ready for review August 18, 2022 16:44
@@ -2223,9 +2222,9 @@ def aggregate(self, arg, split_every=None, split_out=1, shuffle=None):
)

@_aggregate_docstring(based_on="pd.core.groupby.DataFrameGroupBy.agg")
def agg(self, arg, split_every=None, split_out=1, shuffle=None):
def agg(self, arg, split_every=None, split_out=1, **kwargs):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using kwargs here should be fine since we are just passing everything through to self.aggregate (which will raise if unexpected kwargs are detected)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to avoid **kwargs unless absolutely necessary (especially when they mask self-documentation for function signatures). Can you say a bit more about why we need this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

especially when they mask self-documentation for function signatures

Good call, I changed this a bit to avoid kwargs

Can you say a bit more about why we need this?

Attempted to explain here: #9403 (comment)

Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @rjzamora -- could you say a little more about the API breakage?

@@ -1765,7 +1765,6 @@ def aggregate(self, arg, split_every=None, split_out=1, shuffle=None):
chunk=_groupby_apply_funcs,
chunk_kwargs=dict(
funcs=chunk_funcs,
sort=self.sort,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is technically nothing wrong with this line, but it adds little-to-no benefit, and happened to break a fragile test in dask-cudf. Therefore, I'd prefer to remove it for now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So strong thoughts from this on me. Could we update the dask-cudf test to be less fragile?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we update the dask-cudf test to be less fragile?

That is certainly the plan - I am working on a dask-cudf groupby overhaul that will change the test anyway.

@rjzamora
Copy link
Member Author

rjzamora commented Aug 18, 2022

could you say a little more about the API breakage?

Yes - In dask_cudf, CudfDataFrameGroupBy inherits from DataFrameGroupBy, and overrides DataFrameGroupBy.aggregate (but not DataFrameGroupBy.agg). Therefore, calling CudfDataFrameGroupBy.agg falls back to DataFrameGroupBy.aggregate, which then calls CudfDataFrameGroupBy.aggregate with the unrecognized shuffle kwarg (and same for SeriesGroupBy).

I will also add handling for shuffle in dask_cudf, but it seems unnecessary to introduce a version barrier here.

@@ -1765,7 +1765,6 @@ def aggregate(self, arg, split_every=None, split_out=1, shuffle=None):
chunk=_groupby_apply_funcs,
chunk_kwargs=dict(
funcs=chunk_funcs,
sort=self.sort,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So strong thoughts from this on me. Could we update the dask-cudf test to be less fragile?

arg,
split_every=split_every,
split_out=split_out,
**({} if shuffle is None else {"shuffle": shuffle}),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC this can go back to shuffle=shuffle once relevant updates have been made in dask-cudf -- if that's the case could we open an issue in dask/dask so we remember to follow-up and add a comment here linking to the issue?

@jrbourbeau jrbourbeau mentioned this pull request Aug 18, 2022
4 tasks
@rjzamora
Copy link
Member Author

Thanks for looking at this @ian-r-rose and @jrbourbeau - I decided it was better to just make the necessary changes in dask_cudf, since the previous release of cudf (22.08) is pinned to an earlier release of dask anyway. Sorry for adding to the PR pile.

@rjzamora rjzamora closed this Aug 19, 2022
@jrbourbeau
Copy link
Member

No worries! Thanks for handling all the necessary downstream changes : )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants