dask.dataframe quantile fails spectacularly in some edge cases #731

shoyer · 2015-09-19T03:45:43Z

s = pd.Series([-1, 0, 0, 0, 1, 1])
print(s.median())  # 0.0
print(dd.from_pandas(s, 2).quantile(0.5).compute())  # 1.0

This is also true for arbitrarily large repetitions of this data, e.g.,

s = pd.Series([-1] * 1000 + [0, 0, 0] * 1000 + [1, 1] * 1000)
# also holds for all different chunk sizes that I tested other than 20
dd.from_pandas(s, 20).quantile(0.5).compute()  # 1.0

cc @ogrisel

The text was updated successfully, but these errors were encountered:

sinhrks · 2015-09-19T16:01:55Z

Is there better approximate/interpolate method, or add strict option?

shoyer · 2015-09-20T05:39:03Z

Yes, there are streaming algorithms designed to handle approximate quantiles. t-Digest is a good candidate: https://github.com/CamDavidsonPilon/tdigest

ogrisel · 2015-09-20T07:31:34Z

tdigest's quantile seems to do interpolation by default though. It's probably fine for large data but might be unexpected for user looking for a median.

ispmarin · 2015-12-05T23:08:12Z

Having a very similar problem with quantiles:

                GP      gameNum  startingPos       yearID
count  4993.000000  4993.000000  1560.000000  4993.000000
mean      0.775686     0.138995     5.044872  1975.845984
std       0.417172     0.464600     2.649853    23.386732
min       0.000000     0.000000     0.000000  1933.000000
25%       1.000000     0.000000          NaN  1958.000000
50%       1.000000     0.000000          NaN  1976.000000
75%       1.000000     0.000000          NaN  1997.000000
max       1.000000     2.000000    10.000000  2014.000000

while pandas can do it:

            yearID      gameNum           GP  startingPos
count  4993.000000  4993.000000  4993.000000  1560.000000
mean   1975.845984     0.138995     0.775686     5.044872
std      23.386732     0.464600     0.417172     2.649853
min    1933.000000     0.000000     0.000000     0.000000
25%    1958.000000     0.000000     1.000000     3.000000
50%    1976.000000     0.000000     1.000000     5.000000
75%    1997.000000     0.000000     1.000000     7.000000
max    2014.000000     2.000000     1.000000    10.000000

from the lahman baseball AllstarFull data set.

jrbourbeau · 2019-09-24T02:04:20Z

These edge cases seem to be handled correctly using the new method="tdigest" option in quantile (added in #4677)

In [1]: import pandas as pd

In [2]: import dask.dataframe as dd

In [3]: s = pd.Series([-1, 0, 0, 0, 1, 1])
   ...: print(s.median())  # 0.0
   ...: print(dd.from_pandas(s, 2).quantile(0.5, method="tdigest").compute())
0.0
0.0

In [4]: s = pd.Series([-1] * 1000 + [0, 0, 0] * 1000 + [1, 1] * 1000)
   ...: # also holds for all different chunk sizes that I tested other than 20
   ...: dd.from_pandas(s, 20).quantile(0.5, method="tdigest").compute()
Out[4]: 0.0

@shoyer is this safe to close?

shoyer · 2019-09-24T03:32:03Z

Well, the default behavior has not changed. I still think that should be fixed. There are some cases when it's appropriate to treat each chunk as a set of independent, identically distributed samples, but that's a very strong assumption to make without user input. See also #3819 for more discussion.

huard · 2020-08-18T20:24:36Z

Divide and recombine algorithm for quantile estimation: https://hammer.figshare.com/authors/Aritra_Chakravorty/5929565

j-bennet · 2023-04-04T17:33:14Z

With latest dask, this works as expected:

In [1]: import pandas as pd

In [2]: import dask.dataframe as dd

In [3]: s = pd.Series([-1, 0, 0, 0, 1, 1])

In [4]: s.median()
Out[4]: 0.0

In [5]: dd.from_pandas(s, 2).quantile(0.5).compute()
Out[5]: 0.0

Not sure when this changed, but perhaps we can close the issue now.

jseabold mentioned this issue Jun 2, 2016

set_index drops data with bad quantiles #1212

Closed

mrocklin mentioned this issue Jun 5, 2016

Redo Quantiles #1225

Closed

This was referenced Jan 24, 2018

da.percentile and np.percentile mismatch #3099

Closed

Errors in quantile calculation #3115

Closed

jakirkham mentioned this issue Aug 2, 2018

MAINT: implement median #3819

Closed

4 tasks

r-luo mentioned this issue Aug 25, 2020

quantile calculation using dask default method gives incorrect results #6551

Closed

jcrist mentioned this issue Aug 27, 2020

Change default quantiles implementation to use tdigest #6566

Open

GenevieveBuckley added the dataframe label Oct 11, 2021

GenevieveBuckley mentioned this issue Oct 12, 2021

Stale issue / PR sprint dask/community#188

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dask.dataframe quantile fails spectacularly in some edge cases #731

dask.dataframe quantile fails spectacularly in some edge cases #731

shoyer commented Sep 19, 2015

sinhrks commented Sep 19, 2015

shoyer commented Sep 20, 2015

ogrisel commented Sep 20, 2015

ispmarin commented Dec 5, 2015

jrbourbeau commented Sep 24, 2019

shoyer commented Sep 24, 2019

huard commented Aug 18, 2020

j-bennet commented Apr 4, 2023

dask.dataframe quantile fails spectacularly in some edge cases #731

dask.dataframe quantile fails spectacularly in some edge cases #731

Comments

shoyer commented Sep 19, 2015

sinhrks commented Sep 19, 2015

shoyer commented Sep 20, 2015

ogrisel commented Sep 20, 2015

ispmarin commented Dec 5, 2015

jrbourbeau commented Sep 24, 2019

shoyer commented Sep 24, 2019

huard commented Aug 18, 2020

j-bennet commented Apr 4, 2023