Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible docstring error for DataFrame.quantile #8065

Closed
ianozsvald opened this issue Aug 19, 2021 · 2 comments
Closed

Possible docstring error for DataFrame.quantile #8065

ianozsvald opened this issue Aug 19, 2021 · 2 comments

Comments

@ianozsvald
Copy link

Out of curiosity I decided to see how approximate quantiles are calculated in Dask. I've followed the past bug reports and the move to tdigest so I can see that there's some flux. I think that the DataFrame.quantile docstring might be in error, but I can't easily follow the code to check.

Specifically for DataFrame.quantile the docstring states Approximate row-wise and precise column-wise quantiles of DataFrame which feels back to front - aren't the column-wise quantiles calculated over partitions and so are approximate, whilst the row-wise quantiles will be complete in a partition so will be precise? If that's the case then the fix is to swap "approximate" and "precise".

This is referenced here on the current version of Dask:

The equivalent Series method notes Approximate quantiles of Series.

Apologies if I'm missing something and causing noise!

This worked example shows that a Pandas DataFrame and a Dask DataFrame with npartitions=1 for column-wise quantile give the same result, but for npartitions=2 (and higher) we get a slightly different, approximate, result. The final part shows the same for row-wise quantiles and at least in this case the result is identical.

import pandas as pd
import dask.dataframe as dd
import numpy as np

ser = pd.Series(np.arange(0, 100))
df = pd.concat([ser]*10, axis=1)
df.shape # (100, 10), 10 columns of numbers
df.quantile(0.5) # 10 rows of identical results
0    49.5
...
9    49.5
ddf_p1 = dd.from_pandas(df, npartitions=1)
ddf_p1.quantile(0.5).compute() # same result as for Pandas
0    49.5
...
9    49.5

ddf_p2 = dd.from_pandas(df, npartitions=2)
ddf_p2.quantile(0.5).compute() # approximate result with >1 partition column-wise
0    49.0
...
9    49.0

If we calculate row-wise then I see the same result with 1 or 2 partitions:
(df.quantile(0.5, axis=1) == ddf_p1.quantile(0.5, axis=1).compute()).all() # True
(df.quantile(0.5, axis=1) == ddf_p2.quantile(0.5, axis=1).compute()).all() # True
@jsignell
Copy link
Member

I think you have the notions of row-wise and column-wise flipped. I was not at all confident of this so I looked up the pandas docs where it says:

image

ref

@ianozsvald
Copy link
Author

@jsignell thanks for checking the Pandas docs (and yes, if I was smarter I should have started there!). I've googled through the numpy docs and the only place I see references to column- and row-wise are in relation to stacking of arrays, everywhere else the axis discussion doesn't mention whether you should interpret that as columns or rows.
Being in agreement with the Pandas docs is definitely the right choice.
20 years in and I still don't know my columns from my rows :-) Apologies for the noise!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants