New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Partial functions in aggs may have arguments #9724
Conversation
Can one of the admins verify this patch? Admins can comment |
add to allowlist |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @j-bennet!
I see this is marked as a draft still -- are there additional changes you'd still like to make?
@@ -1118,7 +1146,8 @@ def _finalize_mean(df, sum_column, count_column): | |||
return df[sum_column] / df[count_column] | |||
|
|||
|
|||
def _finalize_var(df, count_column, sum_column, sum2_column, ddof=1): | |||
def _finalize_var(df, count_column, sum_column, sum2_column, **kwargs): | |||
ddof = kwargs.get("ddof", 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC it looks like even though we're forwarding all kwargs
specified by the user, we're only using ddof
. Is that the case? If so, it'd be good to raise an error if a user passes in a kwarg
that isn't supported. Rather raise an informative error than silently ignore the kwarg
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is one of my TODOs and a reason this PR is a draft. I have to see what else I need to pass down, and figure out how to handle the rest. I think I need to see what Pandas does if I pass unexpected parameters, and match the behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added handling for unexpected args, and a test for that. numeric_only
kwarg is an interesting one - Pandas supports it, and I'm not sure why Dask doesn't, and whether it should.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
numeric_only
is a tricky one, as pandas behavior is changing. See #9471 (comment) . For now we have been filtering the warnings that pandas raises, see this PR #9496 .
@jrbourbeau I can't recall where we are in the discussion of this kwarg.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ncclementi Thank you, I looked over the PRs, and it looks to me that numeric_only
would have to be tackled separately, since it would be a pretty large change, and there's no clearly defined path yet.
Yes, I left several "note-to-self TODO" checkboxes on this PR, and I didn't have the chance to work on those yet. |
Co-authored-by: James Bourbeau <jrbourbeau@users.noreply.github.com>
@jrbourbeau I believe this is ready for review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@j-bennet This is looking good! I left some comments and suggestions for when you have time to look at it.
dask/dataframe/tests/test_groupby.py
Outdated
) | ||
ddf = dd.from_pandas(pdf, npartitions=2) | ||
|
||
with pytest.raises((FutureWarning, TypeError)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can avoid testing pandas here, as this should be covered on their end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was actually intentional. I was thinking along the lines "if Pandas changes their behavior, we should catch it here and change accordingly". If you think this is an overkill, I'll remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the idea.
In general, when there is a pandas change we want to be aware and we usually identify these ones when we run the upstream workflow. Then we can make the modifications catching the right message.
I would recommend removing testing pandas here.
@j-bennet Thank you for your work, this LGTM . @jrbourbeau do you have any more comments here ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @j-bennet @ncclementi. I left a few final comments. They're all minor, overall this PR looks great. Looking forward to seeing it merged
dask/dataframe/groupby.py
Outdated
for arg in unexpected_kwargs: | ||
raise TypeError( | ||
f"aggregate function '{func}' got an unexpected keyword argument '{arg}'" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For multiple unexpected kwargs
, it'd probably be a bit more informative to just let users see all the invalid kwargs
at once. Otherwise, it may take some iteration to get rid of them all.
for arg in unexpected_kwargs: | |
raise TypeError( | |
f"aggregate function '{func}' got an unexpected keyword argument '{arg}'" | |
) | |
if unexpected_kwargs: | |
raise TypeError( | |
f"The aggregate function '{func}' supports {expected_kwargs} keyword arguments, but got {unexpected_kwargs}" | |
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note we'll need to make a corresponding change to test_groupby_aggregate_partial_function_unexpected_kwargs
since we're updating the error message
Co-authored-by: James Bourbeau <jrbourbeau@users.noreply.github.com>
Co-authored-by: James Bourbeau <jrbourbeau@users.noreply.github.com>
Fixed all the fine points from the review. Thanks for the suggestions! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @j-bennet! Will merge once CI finishes
Also, I noticed this is your first code contribution to this repository. Welcome! |
Glad to see this in. Welcome @j-bennet ! |
Second one! But the first one was a long time ago :). This: #3828 |
Ah, in that case -- welcome back! : ) |
When the user passes a partial function into an aggregation, parameters should not be lost.
pre-commit run --all-files
TODOs:
known_np_funcs
being used for?ddof
thatstd
needs?This fixes the incorrect calculation, but may need a follow-up. In Pandas std and var accept another parameter that I'm not handling here:
numeric_only
. I'm not sure how to handle it here - would need more digging into what Pandas does with it.