You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Out of curiosity I decided to see how approximate quantiles are calculated in Dask. I've followed the past bug reports and the move to tdigest so I can see that there's some flux. I think that the DataFrame.quantile docstring might be in error, but I can't easily follow the code to check.
Specifically for DataFrame.quantile the docstring states Approximate row-wise and precise column-wise quantiles of DataFrame which feels back to front - aren't the column-wise quantiles calculated over partitions and so are approximate, whilst the row-wise quantiles will be complete in a partition so will be precise? If that's the case then the fix is to swap "approximate" and "precise".
This is referenced here on the current version of Dask:
The equivalent Seriesmethod notes Approximate quantiles of Series.
Apologies if I'm missing something and causing noise!
This worked example shows that a Pandas DataFrame and a Dask DataFrame with npartitions=1 for column-wise quantile give the same result, but for npartitions=2 (and higher) we get a slightly different, approximate, result. The final part shows the same for row-wise quantiles and at least in this case the result is identical.
import pandas as pd
import dask.dataframe as dd
import numpy as np
ser = pd.Series(np.arange(0, 100))
df = pd.concat([ser]*10, axis=1)
df.shape # (100, 10), 10 columns of numbers
df.quantile(0.5) # 10 rows of identical results
0 49.5
...
9 49.5
ddf_p1 = dd.from_pandas(df, npartitions=1)
ddf_p1.quantile(0.5).compute() # same result as for Pandas
0 49.5
...
9 49.5
ddf_p2 = dd.from_pandas(df, npartitions=2)
ddf_p2.quantile(0.5).compute() # approximate result with >1 partition column-wise
0 49.0
...
9 49.0
If we calculate row-wise then I see the same result with 1 or 2 partitions:
(df.quantile(0.5, axis=1) == ddf_p1.quantile(0.5, axis=1).compute()).all() # True
(df.quantile(0.5, axis=1) == ddf_p2.quantile(0.5, axis=1).compute()).all() # True
The text was updated successfully, but these errors were encountered:
@jsignell thanks for checking the Pandas docs (and yes, if I was smarter I should have started there!). I've googled through the numpy docs and the only place I see references to column- and row-wise are in relation to stacking of arrays, everywhere else the axis discussion doesn't mention whether you should interpret that as columns or rows.
Being in agreement with the Pandas docs is definitely the right choice.
20 years in and I still don't know my columns from my rows :-) Apologies for the noise!
Out of curiosity I decided to see how approximate quantiles are calculated in Dask. I've followed the past bug reports and the move to
tdigest
so I can see that there's some flux. I think that theDataFrame.quantile
docstring might be in error, but I can't easily follow the code to check.Specifically for
DataFrame.quantile
the docstring statesApproximate row-wise and precise column-wise quantiles of DataFrame
which feels back to front - aren't the column-wise quantiles calculated over partitions and so are approximate, whilst the row-wise quantiles will be complete in a partition so will be precise? If that's the case then the fix is to swap "approximate" and "precise".This is referenced here on the current version of Dask:
The equivalent
Series
method notesApproximate quantiles of Series
.Apologies if I'm missing something and causing noise!
This worked example shows that a Pandas DataFrame and a Dask DataFrame with
npartitions=1
for column-wisequantile
give the same result, but fornpartitions=2
(and higher) we get a slightly different, approximate, result. The final part shows the same for row-wise quantiles and at least in this case the result is identical.The text was updated successfully, but these errors were encountered: