-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove unnecessary first
aggregation from groupby agg code
#951
Conversation
# Fix the column names and the order of them, as this was messed with during the aggregations | ||
df_agg.columns = df_agg.columns.get_level_values(-1) | ||
# SQL does not care about the index, but if group columns were specified we'll want to keep those | ||
df_agg = df_agg.reset_index(drop=(not group_columns)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In cases where we have multiple groupby columns but are selecting only 1 does it make sense to handle that logic of dropping the rest here or do we rely on the limit_to
projection mechanisms downstream to achieve this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah right now we're relying on limit_to
to exclude the unnecessary columns, not sure if it's worth it to manually drop these columns after the reset_index
Codecov Report
@@ Coverage Diff @@
## main #951 +/- ##
==========================================
+ Coverage 75.08% 77.51% +2.42%
==========================================
Files 75 75
Lines 4231 4220 -11
Branches 771 767 -4
==========================================
+ Hits 3177 3271 +94
+ Misses 886 775 -111
- Partials 168 174 +6
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
first
aggregation from groupby agg codefirst
aggregation from groupby agg code
I tested this on a sample dataset: df = timeseries(freq="10ms")
c.create_table("df",df, persist=True)
%timeit c.sql("SELECT id, SUM(x) from df GROUP BY id").compute()
%timeit c.sql("SELECT id, SUM(x) from df GROUP BY id, name").compute() Main vs This PR
|
Right now, if group columns are specified for a groupby aggregation (e.g.
select name, sum(x) from df group by name
), we include afirst
aggregation to grab the relevant row values for each grouping. This shouldn't be necessary, as Dask groupby aggregations already provide this info in the resulting index.This PR removes this
first
aggregation and retains the lost information by instead dropping the groupby agg index withreset_index(drop=False)
when group columns are specified.