Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default "nunique" function vs custom aggregation "nunique" function performance #10589

Closed
frbelotto opened this issue Oct 23, 2023 · 4 comments
Closed
Labels
needs attention It's been a while since this was pushed on. Needs attention from the owner or a maintainer. needs triage Needs a response from a contributor

Comments

@frbelotto
Copy link

frbelotto commented Oct 23, 2023

Hello guys,
I´ve been using Pandas usully for several works. Recently I´ve started moving some codes from Pandas to dask due to size increase. One of my issues have been handling with dataframe "nunique" functions, currently not avaliable on dask. Searching a little bit I´ve found that I could use a custom aggregation function for that, so I did.

cnunique = dd.Aggregation(name="cnunique",chunk=lambda s: s.apply(lambda x: list(set(x))),agg=lambda s0: s0.obj.groupby(level=list(range(s0.obj.index.nlevels)),observed = True).sum(),finalize=lambda s1: s1.apply(lambda final: len(set(final)))) # Nunique custom function according to dask documentation example

But after that, I´ve noticed a case where my custom functions performns much better than the default series.nunique function. So, I am trting to understand what´s wrong!

Creating a sample dataframe

from datetime import datetime
import pandas as pd
import dask.dataframe as dd
import numpy as np

num_variables = 1_000_000
rng = np.random.default_rng()

data = pd.DataFrame({
    'id' :  np.random.randint(1,99999,num_variables),
    'date' : [np.random.choice(pd.date_range(datetime(2021,1,1),datetime(2022,12,31))) for i in range(num_variables)],
    'product' : [np.random.choice(['giftcards', 'afiliates']) for i in range(num_variables)],
    'brand' : [np.random.choice(['brand_1', 'brand_2', 'brand_4', 'brand_6', np.nan]) for i in range(num_variables)],
    'gmv' : rng.random(num_variables) * 100,
    'revenue' : rng.random(num_variables) * 100,})

data = data.astype({'product': 'category', 'brand':'category'})
ddf = dd.from_pandas(data, npartitions=5)

Now I t ry with default "nunique"

dfnunique = ddf.groupby([ddf.date.dt.to_period('M'), 'product','brand'], dropna=False, observed=True)['id'].nunique().reset_index()
df = dfnunique.compute() 
df

image
I took 0.7 seconds.

Now lets try the custom nunique

df = ddf.groupby([ddf.date.dt.to_period('M'), 'product','brand'], dropna=False, observed=True).aggregate({'id' : cnunique}).reset_index()
df = df.compute()
df

image
I took 0.2 seconds!

If you increase the dataframe size, the difference persist!

Environment:

  • Dask version: 2023.9.3
  • Python version: 3.11.6
  • Operating System: W11
  • Install method (conda, pip, source): PIP
@github-actions github-actions bot added the needs triage Needs a response from a contributor label Oct 23, 2023
@github-actions github-actions bot added the needs attention It's been a while since this was pushed on. Needs attention from the owner or a maintainer. label Nov 27, 2023
@phofl
Copy link
Collaborator

phofl commented Mar 4, 2024

We recently merged a bunch of improvements to nunique, it now runs in 0.2 seconds on my machine while the custom variant takes 0.3

It was previously blocking the Gil, which made it pretty slow

@phofl phofl closed this as completed Mar 4, 2024
@frbelotto
Copy link
Author

We recently merged a bunch of improvements to nunique, it now runs in 0.2 seconds on my machine while the custom variant takes 0.3

It was previously blocking the Gil, which made it pretty slow

Hello!
I think that something else might changed and Nunique is not working as expected anymore

teste = base_consumo = gerado.loc[(gerado['produto'] != 'Recargas') & (gerado['data'] >= datetime(2024,1,1))]
teste = teste.groupby(['mci', 'marca'], dropna=False, observed=True)['marca'].count()
teste.compute()

image

teste = base_consumo = gerado.loc[(gerado['produto'] != 'Recargas') & (gerado['data'] >= datetime(2024,1,1))]
teste = teste.groupby(['mci', 'marca'], dropna=False, observed=True)['marca'].nunique()
teste.compute()

Not work

KeyError                                  Traceback (most recent call last)
File [c:\Users\fabio\AppData\Local\Programs\Python\Python312\Lib\site-packages\dask\dataframe\utils.py:194](file:///C:/Users/fabio/AppData/Local/Programs/Python/Python312/Lib/site-packages/dask/dataframe/utils.py:194), in raise_on_meta_error(funcname, udf)
    [193](file:///C:/Users/fabio/AppData/Local/Programs/Python/Python312/Lib/site-packages/dask/dataframe/utils.py:193) try:
--> [194](file:///C:/Users/fabio/AppData/Local/Programs/Python/Python312/Lib/site-packages/dask/dataframe/utils.py:194)     yield
    [195](file:///C:/Users/fabio/AppData/Local/Programs/Python/Python312/Lib/site-packages/dask/dataframe/utils.py:195) except Exception as e:

File [c:\Users\fabio\AppData\Local\Programs\Python\Python312\Lib\site-packages\dask\dataframe\core.py:7174](file:///C:/Users/fabio/AppData/Local/Programs/Python/Python312/Lib/site-packages/dask/dataframe/core.py:7174), in _emulate(func, udf, *args, **kwargs)
   [7173](file:///C:/Users/fabio/AppData/Local/Programs/Python/Python312/Lib/site-packages/dask/dataframe/core.py:7173) with raise_on_meta_error(funcname(func), udf=udf), check_numeric_only_deprecation():
-> [7174](file:///C:/Users/fabio/AppData/Local/Programs/Python/Python312/Lib/site-packages/dask/dataframe/core.py:7174)     return func(*_extract_meta(args, True), **_extract_meta(kwargs, True))

File [c:\Users\fabio\AppData\Local\Programs\Python\Python312\Lib\site-packages\dask\dataframe\groupby.py:781](file:///C:/Users/fabio/AppData/Local/Programs/Python/Python312/Lib/site-packages/dask/dataframe/groupby.py:781), in _nunique_df_aggregate(df, levels, name, sort)
    [780](file:///C:/Users/fabio/AppData/Local/Programs/Python/Python312/Lib/site-packages/dask/dataframe/groupby.py:780) def _nunique_df_aggregate(df, levels, name, sort=False):
--> [781](file:///C:/Users/fabio/AppData/Local/Programs/Python/Python312/Lib/site-packages/dask/dataframe/groupby.py:781)     return df.groupby(level=levels, sort=sort, observed=True)[name].nunique()

File [c:\Users\fabio\AppData\Local\Programs\Python\Python312\Lib\site-packages\pandas\core\groupby\generic.py:1951](file:///C:/Users/fabio/AppData/Local/Programs/Python/Python312/Lib/site-packages/pandas/core/groupby/generic.py:1951), in DataFrameGroupBy.__getitem__(self, key)
   [1947](file:///C:/Users/fabio/AppData/Local/Programs/Python/Python312/Lib/site-packages/pandas/core/groupby/generic.py:1947)     raise ValueError(
   [1948](file:///C:/Users/fabio/AppData/Local/Programs/Python/Python312/Lib/site-packages/pandas/core/groupby/generic.py:1948)         "Cannot subset columns with a tuple with more than one element. "
   [1949](file:///C:/Users/fabio/AppData/Local/Programs/Python/Python312/Lib/site-packages/pandas/core/groupby/generic.py:1949)         "Use a list instead."
   [1950](file:///C:/Users/fabio/AppData/Local/Programs/Python/Python312/Lib/site-packages/pandas/core/groupby/generic.py:1950)     )
-> [1951](file:///C:/Users/fabio/AppData/Local/Programs/Python/Python312/Lib/site-packages/pandas/core/groupby/generic.py:1951) return super().__getitem__(key)

File [c:\Users\fabio\AppData\Local\Programs\Python\Python312\Lib\site-packages\pandas\core\base.py:244](file:///C:/Users/fabio/AppData/Local/Programs/Python/Python312/Lib/site-packages/pandas/core/base.py:244), in SelectionMixin.__getitem__(self, key)
    [243](file:///C:/Users/fabio/AppData/Local/Programs/Python/Python312/Lib/site-packages/pandas/core/base.py:243) if key not in self.obj:
--> [244](file:///C:/Users/fabio/AppData/Local/Programs/Python/Python312/Lib/site-packages/pandas/core/base.py:244)     raise KeyError(f"Column not found: {key}")
...
    return super().__getitem__(key)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\fabio\AppData\Local\Programs\Python\Python312\Lib\site-packages\pandas\core\base.py", line 244, in __getitem__
    raise KeyError(f"Column not found: {key}")

@phofl
Copy link
Collaborator

phofl commented Mar 6, 2024

Can you provide a reproducible example that we can copy paste?

@frbelotto
Copy link
Author

frbelotto commented Mar 7, 2024

Can you provide a reproducible example that we can copy paste?

I could isolate the issue. Its related to dask version. I´ve created a new topic for it

Issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs attention It's been a while since this was pushed on. Needs attention from the owner or a maintainer. needs triage Needs a response from a contributor
Projects
None yet
Development

No branches or pull requests

2 participants