Remove unnecessary `first` aggregation from groupby agg code #951

charlesbluca · 2022-12-02T16:32:23Z

Right now, if group columns are specified for a groupby aggregation (e.g. select name, sum(x) from df group by name), we include a first aggregation to grab the relevant row values for each grouping. This shouldn't be necessary, as Dask groupby aggregations already provide this info in the resulting index.

This PR removes this first aggregation and retains the lost information by instead dropping the groupby agg index with reset_index(drop=False) when group columns are specified.

ayushdg · 2022-12-02T16:47:17Z

dask_sql/physical/rel/logical/aggregate.py

-        # Fix the column names and the order of them, as this was messed with during the aggregations
-        df_agg.columns = df_agg.columns.get_level_values(-1)
+        # SQL does not care about the index, but if group columns were specified we'll want to keep those
+        df_agg = df_agg.reset_index(drop=(not group_columns))


In cases where we have multiple groupby columns but are selecting only 1 does it make sense to handle that logic of dropping the rest here or do we rely on the limit_to projection mechanisms downstream to achieve this?

Yeah right now we're relying on limit_to to exclude the unnecessary columns, not sure if it's worth it to manually drop these columns after the reset_index

codecov-commenter · 2022-12-02T21:39:17Z

Codecov Report

Merging #951 (6ba3997) into main (0ca12a1) will increase coverage by 2.42%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main     #951      +/-   ##
==========================================
+ Coverage   75.08%   77.51%   +2.42%     
==========================================
  Files          75       75              
  Lines        4231     4220      -11     
  Branches      771      767       -4     
==========================================
+ Hits         3177     3271      +94     
+ Misses        886      775     -111     
- Partials      168      174       +6

Impacted Files	Coverage Δ
dask_sql/physical/rel/logical/aggregate.py	`89.50% <100.00%> (-0.61%)`	⬇️
dask_sql/physical/utils/groupby.py	`80.00% <0.00%> (-20.00%)`	⬇️
dask_sql/_version.py	`35.31% <0.00%> (+1.41%)`	⬆️
dask_sql/input_utils/hive.py	`100.00% <0.00%> (+81.74%)`	⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

dask_sql/physical/rel/logical/aggregate.py

ayushdg · 2022-12-07T15:31:48Z

I tested this on a sample dataset:

df = timeseries(freq="10ms")
c.create_table("df",df, persist=True)
 %timeit c.sql("SELECT id, SUM(x) from df GROUP BY id").compute()
%timeit c.sql("SELECT id, SUM(x) from df GROUP BY id, name").compute()

Main vs This PR

1.37s vs 970ms
9.04s vs 6.35s

Remove first aggregation from groupby agg code

c881c60

ayushdg reviewed Dec 2, 2022

View reviewed changes

Remove custom null handling (seems work without now?)

a50120c

charlesbluca changed the title ~~[WIP] Remove unnecessary first aggregation from groupby agg code~~ Remove unnecessary first aggregation from groupby agg code Dec 5, 2022

charlesbluca marked this pull request as ready for review December 5, 2022 14:21

charlesbluca requested a review from galipremsagar as a code owner December 5, 2022 14:21

charlesbluca commented Dec 5, 2022

View reviewed changes

dask_sql/physical/rel/logical/aggregate.py Show resolved Hide resolved

charlesbluca and others added 4 commits December 5, 2022 10:26

Merge branch 'main' into remove-first-agg

47b2a8b

Merge branch 'main' into remove-first-agg

2f9555b

Merge remote-tracking branch 'origin/main' into remove-first-agg

25b5b96

Merge branch 'main' into remove-first-agg

500812e

charlesbluca and others added 4 commits December 7, 2022 13:27

Remove comment referring to fast_groupby

181c813

Merge branch 'main' into remove-first-agg

70d88e9

Merge branch 'main' into remove-first-agg

6ba3997

rerun tests

17d5e3c

ayushdg approved these changes Dec 13, 2022

View reviewed changes

ayushdg merged commit 6810f1a into dask-contrib:main Dec 13, 2022

charlesbluca deleted the remove-first-agg branch March 19, 2024 16:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove unnecessary `first` aggregation from groupby agg code #951

Remove unnecessary `first` aggregation from groupby agg code #951

charlesbluca commented Dec 2, 2022

ayushdg Dec 2, 2022

charlesbluca Dec 2, 2022

codecov-commenter commented Dec 2, 2022 •

edited

ayushdg commented Dec 7, 2022 •

edited

Remove unnecessary first aggregation from groupby agg code #951

Remove unnecessary first aggregation from groupby agg code #951

Conversation

charlesbluca commented Dec 2, 2022

ayushdg Dec 2, 2022

Choose a reason for hiding this comment

charlesbluca Dec 2, 2022

Choose a reason for hiding this comment

codecov-commenter commented Dec 2, 2022 • edited

Codecov Report

ayushdg commented Dec 7, 2022 • edited

Remove unnecessary `first` aggregation from groupby agg code #951

Remove unnecessary `first` aggregation from groupby agg code #951

codecov-commenter commented Dec 2, 2022 •

edited

ayushdg commented Dec 7, 2022 •

edited