New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix temporary API breakage in dask_cudf groupby #9403
Conversation
dask/dataframe/groupby.py
Outdated
@@ -2223,9 +2222,9 @@ def aggregate(self, arg, split_every=None, split_out=1, shuffle=None): | |||
) | |||
|
|||
@_aggregate_docstring(based_on="pd.core.groupby.DataFrameGroupBy.agg") | |||
def agg(self, arg, split_every=None, split_out=1, shuffle=None): | |||
def agg(self, arg, split_every=None, split_out=1, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using kwargs here should be fine since we are just passing everything through to self.aggregate
(which will raise if unexpected kwargs are detected)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer to avoid **kwargs
unless absolutely necessary (especially when they mask self-documentation for function signatures). Can you say a bit more about why we need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
especially when they mask self-documentation for function signatures
Good call, I changed this a bit to avoid kwargs
Can you say a bit more about why we need this?
Attempted to explain here: #9403 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @rjzamora -- could you say a little more about the API breakage?
@@ -1765,7 +1765,6 @@ def aggregate(self, arg, split_every=None, split_out=1, shuffle=None): | |||
chunk=_groupby_apply_funcs, | |||
chunk_kwargs=dict( | |||
funcs=chunk_funcs, | |||
sort=self.sort, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is technically nothing wrong with this line, but it adds little-to-no benefit, and happened to break a fragile test in dask-cudf. Therefore, I'd prefer to remove it for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So strong thoughts from this on me. Could we update the dask-cudf
test to be less fragile?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we update the dask-cudf test to be less fragile?
That is certainly the plan - I am working on a dask-cudf groupby overhaul that will change the test anyway.
Yes - In dask_cudf, I will also add handling for |
@@ -1765,7 +1765,6 @@ def aggregate(self, arg, split_every=None, split_out=1, shuffle=None): | |||
chunk=_groupby_apply_funcs, | |||
chunk_kwargs=dict( | |||
funcs=chunk_funcs, | |||
sort=self.sort, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So strong thoughts from this on me. Could we update the dask-cudf
test to be less fragile?
arg, | ||
split_every=split_every, | ||
split_out=split_out, | ||
**({} if shuffle is None else {"shuffle": shuffle}), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC this can go back to shuffle=shuffle
once relevant updates have been made in dask-cudf
-- if that's the case could we open an issue in dask/dask
so we remember to follow-up and add a comment here linking to the issue?
Thanks for looking at this @ian-r-rose and @jrbourbeau - I decided it was better to just make the necessary changes in dask_cudf, since the previous release of cudf (22.08) is pinned to an earlier release of dask anyway. Sorry for adding to the PR pile. |
No worries! Thanks for handling all the necessary downstream changes : ) |
Minor follow-up to #9302
That PR modifies the
aggregate
method that is inherited in dask_cudf. This PR adds minimal changes to avoid breaking dask_cudf CI.