Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove unnecessary first aggregation from groupby agg code #951

Merged
merged 10 commits into from
Dec 13, 2022

Conversation

charlesbluca
Copy link
Collaborator

Right now, if group columns are specified for a groupby aggregation (e.g. select name, sum(x) from df group by name), we include a first aggregation to grab the relevant row values for each grouping. This shouldn't be necessary, as Dask groupby aggregations already provide this info in the resulting index.

This PR removes this first aggregation and retains the lost information by instead dropping the groupby agg index with reset_index(drop=False) when group columns are specified.

# Fix the column names and the order of them, as this was messed with during the aggregations
df_agg.columns = df_agg.columns.get_level_values(-1)
# SQL does not care about the index, but if group columns were specified we'll want to keep those
df_agg = df_agg.reset_index(drop=(not group_columns))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In cases where we have multiple groupby columns but are selecting only 1 does it make sense to handle that logic of dropping the rest here or do we rely on the limit_to projection mechanisms downstream to achieve this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah right now we're relying on limit_to to exclude the unnecessary columns, not sure if it's worth it to manually drop these columns after the reset_index

@codecov-commenter
Copy link

codecov-commenter commented Dec 2, 2022

Codecov Report

Merging #951 (6ba3997) into main (0ca12a1) will increase coverage by 2.42%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main     #951      +/-   ##
==========================================
+ Coverage   75.08%   77.51%   +2.42%     
==========================================
  Files          75       75              
  Lines        4231     4220      -11     
  Branches      771      767       -4     
==========================================
+ Hits         3177     3271      +94     
+ Misses        886      775     -111     
- Partials      168      174       +6     
Impacted Files Coverage Δ
dask_sql/physical/rel/logical/aggregate.py 89.50% <100.00%> (-0.61%) ⬇️
dask_sql/physical/utils/groupby.py 80.00% <0.00%> (-20.00%) ⬇️
dask_sql/_version.py 35.31% <0.00%> (+1.41%) ⬆️
dask_sql/input_utils/hive.py 100.00% <0.00%> (+81.74%) ⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@charlesbluca charlesbluca changed the title [WIP] Remove unnecessary first aggregation from groupby agg code Remove unnecessary first aggregation from groupby agg code Dec 5, 2022
@charlesbluca charlesbluca marked this pull request as ready for review December 5, 2022 14:21
@ayushdg
Copy link
Collaborator

ayushdg commented Dec 7, 2022

I tested this on a sample dataset:

df = timeseries(freq="10ms")
c.create_table("df",df, persist=True)
 %timeit c.sql("SELECT id, SUM(x) from df GROUP BY id").compute()
%timeit c.sql("SELECT id, SUM(x) from df GROUP BY id, name").compute()

Main vs This PR

1.37s vs 970ms
9.04s vs 6.35s

@ayushdg ayushdg merged commit 6810f1a into dask-contrib:main Dec 13, 2022
@charlesbluca charlesbluca deleted the remove-first-agg branch March 19, 2024 16:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants