Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dask.dataframe quantile fails spectacularly in some edge cases #731

Open
shoyer opened this issue Sep 19, 2015 · 8 comments
Open

dask.dataframe quantile fails spectacularly in some edge cases #731

shoyer opened this issue Sep 19, 2015 · 8 comments

Comments

@shoyer
Copy link
Member

shoyer commented Sep 19, 2015

s = pd.Series([-1, 0, 0, 0, 1, 1])
print(s.median())  # 0.0
print(dd.from_pandas(s, 2).quantile(0.5).compute())  # 1.0

This is also true for arbitrarily large repetitions of this data, e.g.,

s = pd.Series([-1] * 1000 + [0, 0, 0] * 1000 + [1, 1] * 1000)
# also holds for all different chunk sizes that I tested other than 20
dd.from_pandas(s, 20).quantile(0.5).compute()  # 1.0

cc @ogrisel

@sinhrks
Copy link
Member

sinhrks commented Sep 19, 2015

Is there better approximate/interpolate method, or add strict option?

@shoyer
Copy link
Member Author

shoyer commented Sep 20, 2015

Yes, there are streaming algorithms designed to handle approximate quantiles. t-Digest is a good candidate: https://github.com/CamDavidsonPilon/tdigest

@ogrisel
Copy link
Contributor

ogrisel commented Sep 20, 2015

tdigest's quantile seems to do interpolation by default though. It's probably fine for large data but might be unexpected for user looking for a median.

@ispmarin
Copy link

ispmarin commented Dec 5, 2015

Having a very similar problem with quantiles:

                GP      gameNum  startingPos       yearID
count  4993.000000  4993.000000  1560.000000  4993.000000
mean      0.775686     0.138995     5.044872  1975.845984
std       0.417172     0.464600     2.649853    23.386732
min       0.000000     0.000000     0.000000  1933.000000
25%       1.000000     0.000000          NaN  1958.000000
50%       1.000000     0.000000          NaN  1976.000000
75%       1.000000     0.000000          NaN  1997.000000
max       1.000000     2.000000    10.000000  2014.000000

while pandas can do it:

            yearID      gameNum           GP  startingPos
count  4993.000000  4993.000000  4993.000000  1560.000000
mean   1975.845984     0.138995     0.775686     5.044872
std      23.386732     0.464600     0.417172     2.649853
min    1933.000000     0.000000     0.000000     0.000000
25%    1958.000000     0.000000     1.000000     3.000000
50%    1976.000000     0.000000     1.000000     5.000000
75%    1997.000000     0.000000     1.000000     7.000000
max    2014.000000     2.000000     1.000000    10.000000

from the lahman baseball AllstarFull data set.

@jrbourbeau
Copy link
Member

These edge cases seem to be handled correctly using the new method="tdigest" option in quantile (added in #4677)

In [1]: import pandas as pd

In [2]: import dask.dataframe as dd

In [3]: s = pd.Series([-1, 0, 0, 0, 1, 1])
   ...: print(s.median())  # 0.0
   ...: print(dd.from_pandas(s, 2).quantile(0.5, method="tdigest").compute())
0.0
0.0

In [4]: s = pd.Series([-1] * 1000 + [0, 0, 0] * 1000 + [1, 1] * 1000)
   ...: # also holds for all different chunk sizes that I tested other than 20
   ...: dd.from_pandas(s, 20).quantile(0.5, method="tdigest").compute()
Out[4]: 0.0

@shoyer is this safe to close?

@shoyer
Copy link
Member Author

shoyer commented Sep 24, 2019

Well, the default behavior has not changed. I still think that should be fixed. There are some cases when it's appropriate to treat each chunk as a set of independent, identically distributed samples, but that's a very strong assumption to make without user input. See also #3819 for more discussion.

@huard
Copy link

huard commented Aug 18, 2020

Divide and recombine algorithm for quantile estimation: https://hammer.figshare.com/authors/Aritra_Chakravorty/5929565

@j-bennet
Copy link
Contributor

j-bennet commented Apr 4, 2023

With latest dask, this works as expected:

In [1]: import pandas as pd

In [2]: import dask.dataframe as dd

In [3]: s = pd.Series([-1, 0, 0, 0, 1, 1])

In [4]: s.median()
Out[4]: 0.0

In [5]: dd.from_pandas(s, 2).quantile(0.5).compute()
Out[5]: 0.0

Not sure when this changed, but perhaps we can close the issue now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants