Default "nunique" function vs custom aggregation "nunique" function performance #10589

frbelotto · 2023-10-23T12:56:38Z

Hello guys,
I´ve been using Pandas usully for several works. Recently I´ve started moving some codes from Pandas to dask due to size increase. One of my issues have been handling with dataframe "nunique" functions, currently not avaliable on dask. Searching a little bit I´ve found that I could use a custom aggregation function for that, so I did.

cnunique = dd.Aggregation(name="cnunique",chunk=lambda s: s.apply(lambda x: list(set(x))),agg=lambda s0: s0.obj.groupby(level=list(range(s0.obj.index.nlevels)),observed = True).sum(),finalize=lambda s1: s1.apply(lambda final: len(set(final)))) # Nunique custom function according to dask documentation example

But after that, I´ve noticed a case where my custom functions performns much better than the default series.nunique function. So, I am trting to understand what´s wrong!

Creating a sample dataframe

from datetime import datetime
import pandas as pd
import dask.dataframe as dd
import numpy as np

num_variables = 1_000_000
rng = np.random.default_rng()

data = pd.DataFrame({
    'id' :  np.random.randint(1,99999,num_variables),
    'date' : [np.random.choice(pd.date_range(datetime(2021,1,1),datetime(2022,12,31))) for i in range(num_variables)],
    'product' : [np.random.choice(['giftcards', 'afiliates']) for i in range(num_variables)],
    'brand' : [np.random.choice(['brand_1', 'brand_2', 'brand_4', 'brand_6', np.nan]) for i in range(num_variables)],
    'gmv' : rng.random(num_variables) * 100,
    'revenue' : rng.random(num_variables) * 100,})

data = data.astype({'product': 'category', 'brand':'category'})
ddf = dd.from_pandas(data, npartitions=5)

Now I t ry with default "nunique"

dfnunique = ddf.groupby([ddf.date.dt.to_period('M'), 'product','brand'], dropna=False, observed=True)['id'].nunique().reset_index()
df = dfnunique.compute() 
df

I took 0.7 seconds.

Now lets try the custom nunique

df = ddf.groupby([ddf.date.dt.to_period('M'), 'product','brand'], dropna=False, observed=True).aggregate({'id' : cnunique}).reset_index()
df = df.compute()
df

I took 0.2 seconds!

If you increase the dataframe size, the difference persist!

Environment:

Dask version: 2023.9.3
Python version: 3.11.6
Operating System: W11
Install method (conda, pip, source): PIP

The text was updated successfully, but these errors were encountered:

phofl · 2024-03-04T15:00:10Z

We recently merged a bunch of improvements to nunique, it now runs in 0.2 seconds on my machine while the custom variant takes 0.3

It was previously blocking the Gil, which made it pretty slow

frbelotto · 2024-03-06T20:51:01Z

We recently merged a bunch of improvements to nunique, it now runs in 0.2 seconds on my machine while the custom variant takes 0.3

It was previously blocking the Gil, which made it pretty slow

Hello!
I think that something else might changed and Nunique is not working as expected anymore

teste = base_consumo = gerado.loc[(gerado['produto'] != 'Recargas') & (gerado['data'] >= datetime(2024,1,1))]
teste = teste.groupby(['mci', 'marca'], dropna=False, observed=True)['marca'].count()
teste.compute()

teste = base_consumo = gerado.loc[(gerado['produto'] != 'Recargas') & (gerado['data'] >= datetime(2024,1,1))]
teste = teste.groupby(['mci', 'marca'], dropna=False, observed=True)['marca'].nunique()
teste.compute()

Not work

KeyError                                  Traceback (most recent call last)
File [c:\Users\fabio\AppData\Local\Programs\Python\Python312\Lib\site-packages\dask\dataframe\utils.py:194](file:///C:/Users/fabio/AppData/Local/Programs/Python/Python312/Lib/site-packages/dask/dataframe/utils.py:194), in raise_on_meta_error(funcname, udf)
    [193](file:///C:/Users/fabio/AppData/Local/Programs/Python/Python312/Lib/site-packages/dask/dataframe/utils.py:193) try:
--> [194](file:///C:/Users/fabio/AppData/Local/Programs/Python/Python312/Lib/site-packages/dask/dataframe/utils.py:194)     yield
    [195](file:///C:/Users/fabio/AppData/Local/Programs/Python/Python312/Lib/site-packages/dask/dataframe/utils.py:195) except Exception as e:

File [c:\Users\fabio\AppData\Local\Programs\Python\Python312\Lib\site-packages\dask\dataframe\core.py:7174](file:///C:/Users/fabio/AppData/Local/Programs/Python/Python312/Lib/site-packages/dask/dataframe/core.py:7174), in _emulate(func, udf, *args, **kwargs)
   [7173](file:///C:/Users/fabio/AppData/Local/Programs/Python/Python312/Lib/site-packages/dask/dataframe/core.py:7173) with raise_on_meta_error(funcname(func), udf=udf), check_numeric_only_deprecation():
-> [7174](file:///C:/Users/fabio/AppData/Local/Programs/Python/Python312/Lib/site-packages/dask/dataframe/core.py:7174)     return func(*_extract_meta(args, True), **_extract_meta(kwargs, True))

File [c:\Users\fabio\AppData\Local\Programs\Python\Python312\Lib\site-packages\dask\dataframe\groupby.py:781](file:///C:/Users/fabio/AppData/Local/Programs/Python/Python312/Lib/site-packages/dask/dataframe/groupby.py:781), in _nunique_df_aggregate(df, levels, name, sort)
    [780](file:///C:/Users/fabio/AppData/Local/Programs/Python/Python312/Lib/site-packages/dask/dataframe/groupby.py:780) def _nunique_df_aggregate(df, levels, name, sort=False):
--> [781](file:///C:/Users/fabio/AppData/Local/Programs/Python/Python312/Lib/site-packages/dask/dataframe/groupby.py:781)     return df.groupby(level=levels, sort=sort, observed=True)[name].nunique()

File [c:\Users\fabio\AppData\Local\Programs\Python\Python312\Lib\site-packages\pandas\core\groupby\generic.py:1951](file:///C:/Users/fabio/AppData/Local/Programs/Python/Python312/Lib/site-packages/pandas/core/groupby/generic.py:1951), in DataFrameGroupBy.__getitem__(self, key)
   [1947](file:///C:/Users/fabio/AppData/Local/Programs/Python/Python312/Lib/site-packages/pandas/core/groupby/generic.py:1947)     raise ValueError(
   [1948](file:///C:/Users/fabio/AppData/Local/Programs/Python/Python312/Lib/site-packages/pandas/core/groupby/generic.py:1948)         "Cannot subset columns with a tuple with more than one element. "
   [1949](file:///C:/Users/fabio/AppData/Local/Programs/Python/Python312/Lib/site-packages/pandas/core/groupby/generic.py:1949)         "Use a list instead."
   [1950](file:///C:/Users/fabio/AppData/Local/Programs/Python/Python312/Lib/site-packages/pandas/core/groupby/generic.py:1950)     )
-> [1951](file:///C:/Users/fabio/AppData/Local/Programs/Python/Python312/Lib/site-packages/pandas/core/groupby/generic.py:1951) return super().__getitem__(key)

File [c:\Users\fabio\AppData\Local\Programs\Python\Python312\Lib\site-packages\pandas\core\base.py:244](file:///C:/Users/fabio/AppData/Local/Programs/Python/Python312/Lib/site-packages/pandas/core/base.py:244), in SelectionMixin.__getitem__(self, key)
    [243](file:///C:/Users/fabio/AppData/Local/Programs/Python/Python312/Lib/site-packages/pandas/core/base.py:243) if key not in self.obj:
--> [244](file:///C:/Users/fabio/AppData/Local/Programs/Python/Python312/Lib/site-packages/pandas/core/base.py:244)     raise KeyError(f"Column not found: {key}")
...
    return super().__getitem__(key)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\fabio\AppData\Local\Programs\Python\Python312\Lib\site-packages\pandas\core\base.py", line 244, in __getitem__
    raise KeyError(f"Column not found: {key}")

phofl · 2024-03-06T20:54:03Z

Can you provide a reproducible example that we can copy paste?

frbelotto · 2024-03-07T03:51:05Z

Can you provide a reproducible example that we can copy paste?

I could isolate the issue. Its related to dask version. I´ve created a new topic for it

Issue

github-actions bot added the needs triage Needs a response from a contributor label Oct 23, 2023

github-actions bot added the needs attention It's been a while since this was pushed on. Needs attention from the owner or a maintainer. label Nov 27, 2023

phofl closed this as completed Mar 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Default "nunique" function vs custom aggregation "nunique" function performance #10589

Default "nunique" function vs custom aggregation "nunique" function performance #10589

frbelotto commented Oct 23, 2023 •

edited

phofl commented Mar 4, 2024

frbelotto commented Mar 6, 2024

phofl commented Mar 6, 2024

frbelotto commented Mar 7, 2024 •

edited

Default "nunique" function vs custom aggregation "nunique" function performance #10589

Default "nunique" function vs custom aggregation "nunique" function performance #10589

Comments

frbelotto commented Oct 23, 2023 • edited

phofl commented Mar 4, 2024

frbelotto commented Mar 6, 2024

phofl commented Mar 6, 2024

frbelotto commented Mar 7, 2024 • edited

frbelotto commented Oct 23, 2023 •

edited

frbelotto commented Mar 7, 2024 •

edited