Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Partial functions in aggs may have arguments #9724

Merged
merged 8 commits into from Dec 9, 2022

Conversation

j-bennet
Copy link
Contributor

@j-bennet j-bennet commented Dec 7, 2022

When the user passes a partial function into an aggregation, parameters should not be lost.

TODOs:

  • make sure documentation matches the behavior
  • handle positional args
  • what is known_np_funcs being used for?
  • anything else from kwargs besides ddof that std needs?
  • do other agg functions need args passed down?

This fixes the incorrect calculation, but may need a follow-up. In Pandas std and var accept another parameter that I'm not handling here: numeric_only. I'm not sure how to handle it here - would need more digging into what Pandas does with it.

@GPUtester
Copy link
Collaborator

Can one of the admins verify this patch?

Admins can comment ok to test to allow this one PR to run or add to allowlist to allow all future PRs from the same author to run.

@j-bennet j-bennet marked this pull request as draft December 7, 2022 05:23
@ncclementi
Copy link
Member

add to allowlist

Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @j-bennet!

I see this is marked as a draft still -- are there additional changes you'd still like to make?

dask/dataframe/tests/test_groupby.py Outdated Show resolved Hide resolved
@@ -1118,7 +1146,8 @@ def _finalize_mean(df, sum_column, count_column):
return df[sum_column] / df[count_column]


def _finalize_var(df, count_column, sum_column, sum2_column, ddof=1):
def _finalize_var(df, count_column, sum_column, sum2_column, **kwargs):
ddof = kwargs.get("ddof", 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC it looks like even though we're forwarding all kwargs specified by the user, we're only using ddof. Is that the case? If so, it'd be good to raise an error if a user passes in a kwarg that isn't supported. Rather raise an informative error than silently ignore the kwarg

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is one of my TODOs and a reason this PR is a draft. I have to see what else I need to pass down, and figure out how to handle the rest. I think I need to see what Pandas does if I pass unexpected parameters, and match the behavior.

Copy link
Contributor Author

@j-bennet j-bennet Dec 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added handling for unexpected args, and a test for that. numeric_only kwarg is an interesting one - Pandas supports it, and I'm not sure why Dask doesn't, and whether it should.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

numeric_only is a tricky one, as pandas behavior is changing. See #9471 (comment) . For now we have been filtering the warnings that pandas raises, see this PR #9496 .

@jrbourbeau I can't recall where we are in the discussion of this kwarg.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ncclementi Thank you, I looked over the PRs, and it looks to me that numeric_only would have to be tackled separately, since it would be a pretty large change, and there's no clearly defined path yet.

@j-bennet
Copy link
Contributor Author

j-bennet commented Dec 7, 2022

I see this is marked as a draft still -- are there additional changes you'd still like to make?

Yes, I left several "note-to-self TODO" checkboxes on this PR, and I didn't have the chance to work on those yet.

j-bennet and others added 2 commits December 7, 2022 11:08
Co-authored-by: James Bourbeau <jrbourbeau@users.noreply.github.com>
@j-bennet j-bennet changed the title [WIP] Partial functions in aggs may have arguments. Partial functions in aggs may have arguments. Dec 8, 2022
@j-bennet j-bennet marked this pull request as ready for review December 8, 2022 01:28
@j-bennet
Copy link
Contributor Author

j-bennet commented Dec 8, 2022

@jrbourbeau I believe this is ready for review.

Copy link
Member

@ncclementi ncclementi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@j-bennet This is looking good! I left some comments and suggestions for when you have time to look at it.

)
ddf = dd.from_pandas(pdf, npartitions=2)

with pytest.raises((FutureWarning, TypeError)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can avoid testing pandas here, as this should be covered on their end.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was actually intentional. I was thinking along the lines "if Pandas changes their behavior, we should catch it here and change accordingly". If you think this is an overkill, I'll remove it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the idea.

In general, when there is a pandas change we want to be aware and we usually identify these ones when we run the upstream workflow. Then we can make the modifications catching the right message.

I would recommend removing testing pandas here.

dask/dataframe/tests/test_groupby.py Outdated Show resolved Hide resolved
dask/dataframe/tests/test_groupby.py Outdated Show resolved Hide resolved
dask/dataframe/tests/test_groupby.py Outdated Show resolved Hide resolved
dask/dataframe/groupby.py Outdated Show resolved Hide resolved
@j-bennet j-bennet requested review from ncclementi and removed request for jrbourbeau December 9, 2022 05:55
@ncclementi
Copy link
Member

@j-bennet Thank you for your work, this LGTM .

@jrbourbeau do you have any more comments here ?

Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @j-bennet @ncclementi. I left a few final comments. They're all minor, overall this PR looks great. Looking forward to seeing it merged

dask/dataframe/groupby.py Outdated Show resolved Hide resolved
dask/dataframe/groupby.py Outdated Show resolved Hide resolved
dask/dataframe/tests/test_groupby.py Outdated Show resolved Hide resolved
dask/dataframe/tests/test_groupby.py Outdated Show resolved Hide resolved
dask/dataframe/tests/test_groupby.py Outdated Show resolved Hide resolved
dask/dataframe/tests/test_groupby.py Outdated Show resolved Hide resolved
Comment on lines 951 to 954
for arg in unexpected_kwargs:
raise TypeError(
f"aggregate function '{func}' got an unexpected keyword argument '{arg}'"
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For multiple unexpected kwargs, it'd probably be a bit more informative to just let users see all the invalid kwargs at once. Otherwise, it may take some iteration to get rid of them all.

Suggested change
for arg in unexpected_kwargs:
raise TypeError(
f"aggregate function '{func}' got an unexpected keyword argument '{arg}'"
)
if unexpected_kwargs:
raise TypeError(
f"The aggregate function '{func}' supports {expected_kwargs} keyword arguments, but got {unexpected_kwargs}"
)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note we'll need to make a corresponding change to test_groupby_aggregate_partial_function_unexpected_kwargs since we're updating the error message

dask/dataframe/groupby.py Outdated Show resolved Hide resolved
j-bennet and others added 4 commits December 9, 2022 12:01
Co-authored-by: James Bourbeau <jrbourbeau@users.noreply.github.com>
Co-authored-by: James Bourbeau <jrbourbeau@users.noreply.github.com>
Co-authored-by: James Bourbeau <jrbourbeau@users.noreply.github.com>
@j-bennet
Copy link
Contributor Author

j-bennet commented Dec 9, 2022

Fixed all the fine points from the review. Thanks for the suggestions!

@jrbourbeau jrbourbeau changed the title Partial functions in aggs may have arguments. Partial functions in aggs may have arguments Dec 9, 2022
Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @j-bennet! Will merge once CI finishes

@jrbourbeau jrbourbeau merged commit 159948e into dask:main Dec 9, 2022
@jrbourbeau
Copy link
Member

Also, I noticed this is your first code contribution to this repository. Welcome!

@mrocklin
Copy link
Member

mrocklin commented Dec 9, 2022

Glad to see this in. Welcome @j-bennet !

@j-bennet
Copy link
Contributor Author

@jrbourbeau

Also, I noticed this is your first code contribution to this repository. Welcome!

Second one! But the first one was a long time ago :). This: #3828

@j-bennet j-bennet deleted the j-bennet/9615-fix-partial-aggs branch December 10, 2022 00:05
@jrbourbeau
Copy link
Member

Ah, in that case -- welcome back! : )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants