Fully support partial functions as aggregations #9615

ChrisJar · 2022-11-01T18:28:34Z

Currently in dask.dataframe, Dask ignores any arguments given to a function in a partial when performing an aggregation.

For example:

import pandas as pd
import numpy as np
import dask.dataframe as dd
from functools import partial

df = pd.DataFrame({
    "a": [5, 4, 3, 5, 4, 2, 3, 2],
    "b": [1, 2, 5, 6, 9, 2, 6, 8],
})
ddf = dd.from_pandas(df, npartitions=1)
ddf.groupby("a").agg(partial(np.std, ddof=-2)).compute()

returns

          b
a          
5  3.535534
4  4.949747
3  0.707107
2  4.242641

whereas in pandas:

df.groupby("a").agg(partial(np.std, ddof=-2))

returns:

          b
a          
2  2.121320
3  0.353553
4  2.474874
5  1.767767

The discrepancy is because Dask ignores theddof argument and defualts to ddof=1. It'd be great if there were some way for Dask to recognize and take those arguments into account like Pandas does.

The text was updated successfully, but these errors were encountered:

phobson · 2022-11-01T19:09:27Z

Thanks for the report, @ChrisJar. I can replicate this on the latest release of dask and distributed.

To be crystal clear, I modified your reproducer slightly to put the discrepancies front-and-center:

import pandas as pd
import numpy as np
import dask.dataframe as dd
from functools import partial

df = pd.DataFrame({
    "a": [5, 4, 3, 5, 4, 2, 3, 2],
    "b": [1, 2, 5, 6, 9, 2, 6, 8],
})
ddf = dd.from_pandas(df, npartitions=1)


pd.concat([
    df.groupby("a").agg(partial(np.std, ddof=1))["b"],
    df.groupby("a").agg(partial(np.std, ddof=-2))["b"],
    ddf.groupby("a").agg(partial(np.std, ddof=1))["b"].compute(),
    ddf.groupby("a").agg(partial(np.std, ddof=-2))["b"].compute(),
], axis=1, keys=[ ("pandas", "ddof=1"), ("pandas", "ddof=-2"), ("dask", "ddof=1"), ("dask", "ddof=-2")])

a	('pandas', 'ddof=1')	('pandas', 'ddof=-2')	('dask', 'ddof=1')	('dask', 'ddof=-2')
2	4.24264	2.12132	4.24264	4.24264
3	0.707107	0.353553	0.707107	0.707107
4	4.94975	2.47487	4.94975	4.94975
5	3.53553	1.76777	3.53553	3.53553

ncclementi · 2022-11-15T16:56:07Z

I double checked this wasn't a problem with choosing ddof=-2, because I'm not sure in which case you will have ddof=-2 . The ddof argument allows you to set the degrees of freedom in the std equation, this value is usually 0 or 1. From the numpy docs:

The standard deviation is the square root of the average of the squared deviations from the mean, i.e., std = sqrt(mean(x)), where x = abs(a - a.mean())**2.

The average squared deviation is typically calculated as x.sum() / N, where N = len(x). If, however, ddof is specified, the divisor N - ddof is used instead. In standard statistical practice, ddof=1 provides an unbiased estimator of the variance of the infinite population. ddof=0 provides a maximum likelihood estimate of the variance for normally distributed variables. The standard deviation computed in this function is the square root of the estimated variance, so even with ddof=1, it will not be an unbiased estimate of the standard deviation per se.

That being said, I agree that this is a bug on passing the args to the partial.

I replicated this with ddof=0 and 1.

     pandas             dask          
     ddof=1 ddof=0    ddof=1    ddof=0
a                                     
2  4.242641    3.0  4.242641  4.242641
3  0.707107    0.5  0.707107  0.707107
4  4.949747    3.5  4.949747  4.949747
5  3.535534    2.5  3.535534  3.535534

j-bennet · 2022-12-12T22:29:58Z

May be able to close this via #9724.

ncclementi · 2022-12-12T23:20:32Z

Thanks @j-bennet. For some reason, the issue didn't close itself when the PR is merged. Closing now!!

github-actions bot added the needs triage Needs a response from a contributor label Nov 1, 2022

phobson added dataframe bug Something is broken and removed needs triage Needs a response from a contributor labels Nov 1, 2022

j-bennet mentioned this issue Dec 7, 2022

Partial functions in aggs may have arguments #9724

Merged

8 tasks

ncclementi closed this as completed Dec 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fully support partial functions as aggregations #9615

Fully support partial functions as aggregations #9615

ChrisJar commented Nov 1, 2022

phobson commented Nov 1, 2022

ncclementi commented Nov 15, 2022

j-bennet commented Dec 12, 2022

ncclementi commented Dec 12, 2022

Fully support partial functions as aggregations #9615

Fully support partial functions as aggregations #9615

Comments

ChrisJar commented Nov 1, 2022

phobson commented Nov 1, 2022

ncclementi commented Nov 15, 2022

j-bennet commented Dec 12, 2022

ncclementi commented Dec 12, 2022