-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Default "nunique" function vs custom aggregation "nunique" function performance #10589
Comments
We recently merged a bunch of improvements to nunique, it now runs in 0.2 seconds on my machine while the custom variant takes 0.3 It was previously blocking the Gil, which made it pretty slow |
Hello!
Not work
|
Can you provide a reproducible example that we can copy paste? |
I could isolate the issue. Its related to dask version. I´ve created a new topic for it |
Hello guys,
I´ve been using Pandas usully for several works. Recently I´ve started moving some codes from Pandas to dask due to size increase. One of my issues have been handling with dataframe "nunique" functions, currently not avaliable on dask. Searching a little bit I´ve found that I could use a custom aggregation function for that, so I did.
cnunique = dd.Aggregation(name="cnunique",chunk=lambda s: s.apply(lambda x: list(set(x))),agg=lambda s0: s0.obj.groupby(level=list(range(s0.obj.index.nlevels)),observed = True).sum(),finalize=lambda s1: s1.apply(lambda final: len(set(final))))
# Nunique custom function according to dask documentation exampleBut after that, I´ve noticed a case where my custom functions performns much better than the default series.nunique function. So, I am trting to understand what´s wrong!
Creating a sample dataframe
Now I t ry with default "nunique"
I took 0.7 seconds.
Now lets try the custom nunique
I took 0.2 seconds!
If you increase the dataframe size, the difference persist!
Environment:
The text was updated successfully, but these errors were encountered: