Possible docstring error for `DataFrame.quantile` #8065

ianozsvald · 2021-08-19T11:37:26Z

Out of curiosity I decided to see how approximate quantiles are calculated in Dask. I've followed the past bug reports and the move to tdigest so I can see that there's some flux. I think that the DataFrame.quantile docstring might be in error, but I can't easily follow the code to check.

Specifically for DataFrame.quantile the docstring states Approximate row-wise and precise column-wise quantiles of DataFrame which feels back to front - aren't the column-wise quantiles calculated over partitions and so are approximate, whilst the row-wise quantiles will be complete in a partition so will be precise? If that's the case then the fix is to swap "approximate" and "precise".

This is referenced here on the current version of Dask:

The equivalent Series method notes Approximate quantiles of Series.

Apologies if I'm missing something and causing noise!

This worked example shows that a Pandas DataFrame and a Dask DataFrame with npartitions=1 for column-wise quantile give the same result, but for npartitions=2 (and higher) we get a slightly different, approximate, result. The final part shows the same for row-wise quantiles and at least in this case the result is identical.

import pandas as pd
import dask.dataframe as dd
import numpy as np

ser = pd.Series(np.arange(0, 100))
df = pd.concat([ser]*10, axis=1)
df.shape # (100, 10), 10 columns of numbers
df.quantile(0.5) # 10 rows of identical results
0    49.5
...
9    49.5
ddf_p1 = dd.from_pandas(df, npartitions=1)
ddf_p1.quantile(0.5).compute() # same result as for Pandas
0    49.5
...
9    49.5

ddf_p2 = dd.from_pandas(df, npartitions=2)
ddf_p2.quantile(0.5).compute() # approximate result with >1 partition column-wise
0    49.0
...
9    49.0

If we calculate row-wise then I see the same result with 1 or 2 partitions:
(df.quantile(0.5, axis=1) == ddf_p1.quantile(0.5, axis=1).compute()).all() # True
(df.quantile(0.5, axis=1) == ddf_p2.quantile(0.5, axis=1).compute()).all() # True

The text was updated successfully, but these errors were encountered:

jsignell · 2021-08-19T14:17:45Z

I think you have the notions of row-wise and column-wise flipped. I was not at all confident of this so I looked up the pandas docs where it says:

ref

ianozsvald · 2021-08-19T17:07:20Z

@jsignell thanks for checking the Pandas docs (and yes, if I was smarter I should have started there!). I've googled through the numpy docs and the only place I see references to column- and row-wise are in relation to stacking of arrays, everywhere else the axis discussion doesn't mention whether you should interpret that as columns or rows.
Being in agreement with the Pandas docs is definitely the right choice.
20 years in and I still don't know my columns from my rows :-) Apologies for the noise!

ianozsvald closed this as completed Aug 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible docstring error for `DataFrame.quantile` #8065

Possible docstring error for `DataFrame.quantile` #8065

ianozsvald commented Aug 19, 2021

jsignell commented Aug 19, 2021

ianozsvald commented Aug 19, 2021

Possible docstring error for DataFrame.quantile #8065

Possible docstring error for DataFrame.quantile #8065

Comments

ianozsvald commented Aug 19, 2021

jsignell commented Aug 19, 2021

ianozsvald commented Aug 19, 2021

Possible docstring error for `DataFrame.quantile` #8065

Possible docstring error for `DataFrame.quantile` #8065