-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dask.dataframe quantile fails spectacularly in some edge cases #731
Comments
Is there better approximate/interpolate method, or add |
Yes, there are streaming algorithms designed to handle approximate quantiles. t-Digest is a good candidate: https://github.com/CamDavidsonPilon/tdigest |
tdigest's quantile seems to do interpolation by default though. It's probably fine for large data but might be unexpected for user looking for a median. |
Having a very similar problem with quantiles:
while pandas can do it:
from the lahman baseball AllstarFull data set. |
These edge cases seem to be handled correctly using the new In [1]: import pandas as pd
In [2]: import dask.dataframe as dd
In [3]: s = pd.Series([-1, 0, 0, 0, 1, 1])
...: print(s.median()) # 0.0
...: print(dd.from_pandas(s, 2).quantile(0.5, method="tdigest").compute())
0.0
0.0
In [4]: s = pd.Series([-1] * 1000 + [0, 0, 0] * 1000 + [1, 1] * 1000)
...: # also holds for all different chunk sizes that I tested other than 20
...: dd.from_pandas(s, 20).quantile(0.5, method="tdigest").compute()
Out[4]: 0.0 @shoyer is this safe to close? |
Well, the default behavior has not changed. I still think that should be fixed. There are some cases when it's appropriate to treat each chunk as a set of independent, identically distributed samples, but that's a very strong assumption to make without user input. See also #3819 for more discussion. |
Divide and recombine algorithm for quantile estimation: https://hammer.figshare.com/authors/Aritra_Chakravorty/5929565 |
With latest dask, this works as expected: In [1]: import pandas as pd
In [2]: import dask.dataframe as dd
In [3]: s = pd.Series([-1, 0, 0, 0, 1, 1])
In [4]: s.median()
Out[4]: 0.0
In [5]: dd.from_pandas(s, 2).quantile(0.5).compute()
Out[5]: 0.0 Not sure when this changed, but perhaps we can close the issue now. |
This is also true for arbitrarily large repetitions of this data, e.g.,
cc @ogrisel
The text was updated successfully, but these errors were encountered: