Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fully support partial functions as aggregations #9615

Closed
ChrisJar opened this issue Nov 1, 2022 · 4 comments
Closed

Fully support partial functions as aggregations #9615

ChrisJar opened this issue Nov 1, 2022 · 4 comments
Labels
bug Something is broken dataframe

Comments

@ChrisJar
Copy link
Contributor

ChrisJar commented Nov 1, 2022

Currently in dask.dataframe, Dask ignores any arguments given to a function in a partial when performing an aggregation.

For example:

import pandas as pd
import numpy as np
import dask.dataframe as dd
from functools import partial

df = pd.DataFrame({
    "a": [5, 4, 3, 5, 4, 2, 3, 2],
    "b": [1, 2, 5, 6, 9, 2, 6, 8],
})
ddf = dd.from_pandas(df, npartitions=1)
ddf.groupby("a").agg(partial(np.std, ddof=-2)).compute()

returns

          b
a          
5  3.535534
4  4.949747
3  0.707107
2  4.242641

whereas in pandas:

df.groupby("a").agg(partial(np.std, ddof=-2))

returns:

          b
a          
2  2.121320
3  0.353553
4  2.474874
5  1.767767

The discrepancy is because Dask ignores theddof argument and defualts to ddof=1. It'd be great if there were some way for Dask to recognize and take those arguments into account like Pandas does.

@github-actions github-actions bot added the needs triage Needs a response from a contributor label Nov 1, 2022
@phobson phobson added dataframe bug Something is broken and removed needs triage Needs a response from a contributor labels Nov 1, 2022
@phobson
Copy link
Contributor

phobson commented Nov 1, 2022

Thanks for the report, @ChrisJar. I can replicate this on the latest release of dask and distributed.

To be crystal clear, I modified your reproducer slightly to put the discrepancies front-and-center:

import pandas as pd
import numpy as np
import dask.dataframe as dd
from functools import partial

df = pd.DataFrame({
    "a": [5, 4, 3, 5, 4, 2, 3, 2],
    "b": [1, 2, 5, 6, 9, 2, 6, 8],
})
ddf = dd.from_pandas(df, npartitions=1)


pd.concat([
    df.groupby("a").agg(partial(np.std, ddof=1))["b"],
    df.groupby("a").agg(partial(np.std, ddof=-2))["b"],
    ddf.groupby("a").agg(partial(np.std, ddof=1))["b"].compute(),
    ddf.groupby("a").agg(partial(np.std, ddof=-2))["b"].compute(),
], axis=1, keys=[ ("pandas", "ddof=1"), ("pandas", "ddof=-2"), ("dask", "ddof=1"), ("dask", "ddof=-2")])
a ('pandas', 'ddof=1') ('pandas', 'ddof=-2') ('dask', 'ddof=1') ('dask', 'ddof=-2')
2 4.24264 2.12132 4.24264 4.24264
3 0.707107 0.353553 0.707107 0.707107
4 4.94975 2.47487 4.94975 4.94975
5 3.53553 1.76777 3.53553 3.53553

@ncclementi
Copy link
Member

I double checked this wasn't a problem with choosing ddof=-2, because I'm not sure in which case you will have ddof=-2 . The ddof argument allows you to set the degrees of freedom in the std equation, this value is usually 0 or 1. From the numpy docs:

The standard deviation is the square root of the average of the squared deviations from the mean, i.e., std = sqrt(mean(x)), where x = abs(a - a.mean())**2.

The average squared deviation is typically calculated as x.sum() / N, where N = len(x). If, however, ddof is specified, the divisor N - ddof is used instead. In standard statistical practice, ddof=1 provides an unbiased estimator of the variance of the infinite population. ddof=0 provides a maximum likelihood estimate of the variance for normally distributed variables. The standard deviation computed in this function is the square root of the estimated variance, so even with ddof=1, it will not be an unbiased estimate of the standard deviation per se.

That being said, I agree that this is a bug on passing the args to the partial.

I replicated this with ddof=0 and 1.

     pandas             dask          
     ddof=1 ddof=0    ddof=1    ddof=0
a                                     
2  4.242641    3.0  4.242641  4.242641
3  0.707107    0.5  0.707107  0.707107
4  4.949747    3.5  4.949747  4.949747
5  3.535534    2.5  3.535534  3.535534

@j-bennet
Copy link
Contributor

May be able to close this via #9724.

@ncclementi
Copy link
Member

Thanks @j-bennet. For some reason, the issue didn't close itself when the PR is merged. Closing now!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is broken dataframe
Projects
None yet
Development

No branches or pull requests

4 participants