Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make common expression alias human-readable #10333

Merged
merged 9 commits into from
May 9, 2024
Merged

Conversation

MohamedAbdeen21
Copy link
Contributor

@MohamedAbdeen21 MohamedAbdeen21 commented May 1, 2024

Which issue does this PR close?

Closes #10280.

Thanks @JasonLi-cn for pointing out the relevant code snippet and the detailed description, saved me some time ❤️

Rationale for this change

Making the plans human-readable when common subexpression elimination is applied (check example in ticket and edited tests in this PR).

What changes are included in this PR?

Changing alias_symbol from curr_expr_identifier + sub_expr_identifier to #{curr_expr_identifier}, making it more readable.

Also, since we no longer need the sub_expr for the alias, we can remove the entire visit stack + visit record enum.

Are these changes tested?

Changed existing tests to reflect the change.

Are there any user-facing changes?

Readable plans when CSE is applied

@github-actions github-actions bot added the optimizer Optimizer rules label May 1, 2024
@MohamedAbdeen21 MohamedAbdeen21 marked this pull request as draft May 1, 2024 19:46
@github-actions github-actions bot added sql logical-expr Logical plan and expressions core Core datafusion crate sqllogictest substrait labels May 4, 2024
@github-actions github-actions bot removed sql logical-expr Logical plan and expressions core Core datafusion crate substrait labels May 4, 2024
@MohamedAbdeen21 MohamedAbdeen21 changed the title remove expr node accumulation make common expression alias human-readable May 4, 2024
@MohamedAbdeen21 MohamedAbdeen21 marked this pull request as ready for review May 5, 2024 09:21
@MohamedAbdeen21
Copy link
Contributor Author

I don't think removing the accumulation and a couple of clones will increase perf, but let's see

/benchmark

Copy link

github-actions bot commented May 5, 2024

Benchmark results

Benchmarks comparing b412dba (main) and dc5cf09 (PR)
Comparing b412dba and dc5cf09
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃  b412dba ┃  dc5cf09 ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │ 285.83ms │ 288.67ms │     no change │
│ QQuery 2     │  41.73ms │  40.62ms │     no change │
│ QQuery 3     │  58.43ms │  59.84ms │     no change │
│ QQuery 4     │  74.13ms │  82.33ms │  1.11x slower │
│ QQuery 5     │ 101.34ms │ 103.10ms │     no change │
│ QQuery 6     │  15.83ms │  15.85ms │     no change │
│ QQuery 7     │ 235.81ms │ 225.12ms │     no change │
│ QQuery 8     │  41.77ms │  40.85ms │     no change │
│ QQuery 9     │ 120.96ms │ 117.94ms │     no change │
│ QQuery 10    │ 110.24ms │ 105.75ms │     no change │
│ QQuery 11    │  76.23ms │  74.64ms │     no change │
│ QQuery 12    │  60.00ms │  59.89ms │     no change │
│ QQuery 13    │ 110.87ms │ 113.69ms │     no change │
│ QQuery 14    │  18.27ms │  18.63ms │     no change │
│ QQuery 15    │  31.41ms │  30.55ms │     no change │
│ QQuery 16    │  45.93ms │  45.72ms │     no change │
│ QQuery 17    │ 149.03ms │ 147.66ms │     no change │
│ QQuery 18    │ 498.69ms │ 528.42ms │  1.06x slower │
│ QQuery 19    │  61.05ms │  60.85ms │     no change │
│ QQuery 20    │ 119.30ms │ 107.69ms │ +1.11x faster │
│ QQuery 21    │ 319.51ms │ 319.68ms │     no change │
│ QQuery 22    │  29.00ms │  28.43ms │     no change │
└──────────────┴──────────┴──────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary      ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (b412dba)   │ 2605.35ms │
│ Total Time (dc5cf09)   │ 2615.92ms │
│ Average Time (b412dba) │  118.42ms │
│ Average Time (dc5cf09) │  118.91ms │
│ Queries Faster         │         1 │
│ Queries Slower         │         2 │
│ Queries with No Change │        19 │
└────────────────────────┴───────────┘
--------------------
Benchmark tpch_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Query        ┃  b412dba ┃  dc5cf09 ┃       Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ QQuery 1     │ 428.55ms │ 429.60ms │    no change │
│ QQuery 2     │  56.11ms │  56.12ms │    no change │
│ QQuery 3     │ 143.48ms │ 141.43ms │    no change │
│ QQuery 4     │  86.21ms │  86.00ms │    no change │
│ QQuery 5     │ 200.36ms │ 196.78ms │    no change │
│ QQuery 6     │ 106.66ms │ 105.24ms │    no change │
│ QQuery 7     │ 270.62ms │ 289.56ms │ 1.07x slower │
│ QQuery 8     │ 187.43ms │ 184.49ms │    no change │
│ QQuery 9     │ 285.36ms │ 289.37ms │    no change │
│ QQuery 10    │ 225.58ms │ 230.45ms │    no change │
│ QQuery 11    │  42.72ms │  41.21ms │    no change │
│ QQuery 12    │ 127.18ms │ 127.13ms │    no change │
│ QQuery 13    │ 176.86ms │ 176.04ms │    no change │
│ QQuery 14    │ 124.67ms │ 124.03ms │    no change │
│ QQuery 15    │ 188.81ms │ 183.89ms │    no change │
│ QQuery 16    │  50.40ms │  48.29ms │    no change │
│ QQuery 17    │ 298.97ms │ 302.80ms │    no change │
│ QQuery 18    │ 432.58ms │ 450.10ms │    no change │
│ QQuery 19    │ 226.26ms │ 226.74ms │    no change │
│ QQuery 20    │ 187.81ms │ 184.51ms │    no change │
│ QQuery 21    │ 315.67ms │ 314.31ms │    no change │
│ QQuery 22    │  40.97ms │  39.12ms │    no change │
└──────────────┴──────────┴──────────┴──────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary      ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (b412dba)   │ 4203.25ms │
│ Total Time (dc5cf09)   │ 4227.21ms │
│ Average Time (b412dba) │  191.06ms │
│ Average Time (dc5cf09) │  192.15ms │
│ Queries Faster         │         0 │
│ Queries Slower         │         1 │
│ Queries with No Change │        21 │
└────────────────────────┴───────────┘
--------------------
Benchmark tpch_sf10.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Query        ┃   b412dba ┃   dc5cf09 ┃    Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━┩
│ QQuery 1     │ 4203.72ms │ 4191.48ms │ no change │
│ QQuery 2     │  487.60ms │  507.13ms │ no change │
│ QQuery 3     │ 1702.79ms │ 1687.60ms │ no change │
│ QQuery 4     │  803.77ms │  818.32ms │ no change │
│ QQuery 5     │ 2192.67ms │ 2146.59ms │ no change │
│ QQuery 6     │ 1038.22ms │ 1008.99ms │ no change │
│ QQuery 7     │ 3504.91ms │ 3500.76ms │ no change │
│ QQuery 8     │ 2448.41ms │ 2428.53ms │ no change │
│ QQuery 9     │ 3999.62ms │ 3898.60ms │ no change │
│ QQuery 10    │ 2486.28ms │ 2513.15ms │ no change │
│ QQuery 11    │  349.37ms │  342.49ms │ no change │
│ QQuery 12    │ 1198.27ms │ 1216.85ms │ no change │
│ QQuery 13    │ 2292.78ms │ 2315.45ms │ no change │
│ QQuery 14    │ 1266.50ms │ 1267.54ms │ no change │
│ QQuery 15    │ 1948.21ms │ 1910.24ms │ no change │
│ QQuery 16    │  496.71ms │  509.15ms │ no change │
│ QQuery 17    │ 5150.71ms │ 5092.54ms │ no change │
│ QQuery 18    │ 6569.11ms │ 6690.56ms │ no change │
│ QQuery 19    │ 2256.56ms │ 2230.59ms │ no change │
│ QQuery 20    │ 2539.69ms │ 2512.55ms │ no change │
│ QQuery 21    │ 4329.77ms │ 4335.05ms │ no change │
│ QQuery 22    │  464.62ms │  475.21ms │ no change │
└──────────────┴───────────┴───────────┴───────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary      ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (b412dba)   │ 51730.28ms │
│ Total Time (dc5cf09)   │ 51599.36ms │
│ Average Time (b412dba) │  2351.38ms │
│ Average Time (dc5cf09) │  2345.43ms │
│ Queries Faster         │          0 │
│ Queries Slower         │          0 │
│ Queries with No Change │         22 │
└────────────────────────┴────────────┘

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @MohamedAbdeen21 I went through this code carefully and it seems really nice to me -- not only is it more efficient (less copying of Identifier) the code is also simpler.

I think it is good to go from my perspective given all the existing tests pass 🚀

cc @wiedld as I think you have worked on this code recently and @waynexia and @peter-toth who I think is working to improve this code in terms of copies

datafusion/optimizer/src/common_subexpr_eliminate.rs Outdated Show resolved Hide resolved
@MohamedAbdeen21
Copy link
Contributor Author

MohamedAbdeen21 commented May 6, 2024

Thanks for the review @alamb.

I was trying to implement it duckdb style with #0, #1, etc ... But as I'm not super familiar with the optimizer, I failed to do so, maybe someone can help.

The biggest problem I found was that the rule is applied twice, meaning that we need a shared/global mapping of expr -> id between rule calls.

Also, the ordering of the created projections should be incremental (#1 doesn't appear before #0 in innermost projection); meaning that aliases should be assigned after the projection is added, something I found difficult to do.

@peter-toth
Copy link
Contributor

can we please wait with merging this PR until we sort out the issues of the rule in #10396

@alamb
Copy link
Contributor

alamb commented May 8, 2024

Per the discussion on #10396, we merged that one first and now this one needs to be rebased / resolved. Marking it as draft until we can do that

I believe @MohamedAbdeen21 plans to do it this weekend (per #10396 (comment))

@alamb alamb marked this pull request as draft May 8, 2024 17:22
@MohamedAbdeen21 MohamedAbdeen21 marked this pull request as ready for review May 8, 2024 21:10
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense (again) to me -- thank you @MohamedAbdeen21 and @peter-toth

@alamb alamb merged commit 2a15614 into apache:main May 9, 2024
23 checks passed
let mut desc = String::new();

while let Some(item) = self.visit_stack.pop() {
fn pop_enter_mark(&mut self) -> Option<usize> {
Copy link
Contributor

@peter-toth peter-toth May 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shoudn't change this part.
The logic that builds up an identifier using visit_stack / 3 kinds of VisitRecord is neccessary and actually a very clever and way to build up an identifier from the current node and sub-identifiers. (An identifier to be a String was not that a clever decision and will be fixed in #10426, but that's a different issue).

This PR shouldn't change what an identifier is / how it is built up otherwise we end up with identifier colliding bugs again. The IdArray, ExprStats and CommonExprs datastructures require an identifier to represent a full expression subtreee. This means that:

fn expr_identifier(expr: &Expr) -> Identifier {
    format!("#{{{expr}}}")
}

would cause bugs as shown in 1. of #10396.

I.e. if we encountered both col("a") + col("b") and col("a + b") in the expression list to be CSEd and we used "{expr}" (the non-unique stringified representation) as identifiers then the equal identifier ("a + b") of those 2 different expressions would collide and we counted 2 for the occurance of one of the 2 expressions (and the other expression's count would be lost) resulting wrong CSE.

Please note that currently the identifier of col("a") + col("b") is "{a + b|b|a}" so it doesn't collide with col("a + b")'s identifier: "{a + b}".

Again, this is hard to test now because of the resolution bug: #10413.
I.e. if we write a test where we have

select a + b, "a + b" from (
   select 1 as a, 2 as b, 1 as "a + b"
)

then currently it gets resolved as

select "a + b", "a + b" from (
   select 1 as a, 2 as b, 1 as "a + b"
)

and this prevents us to create a test case for CSE identifier collision.
(Please note that I'm simplifying the identifier collision exmple as simple columns (col("a + b")) are not subject to CSE.)

What this PR can do is to change the aliases (use something else than identifiers) to make the plans more readable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@peter-toth -- thank you -- I see now the danger of the lack of test coverage in this area.

I am still a little unclear on if the bug you described above is something that can actually be hit in practice (due to lack of a test case), or if it will be masked by #10413

What shall we do now? I can see three possibilities:

  1. Revert this PR and try and make a follow on one that doesn't change it
  2. Create a PR to revert just ID change
  3. You can fix the ID change in your fix for Make CommonSubexprEliminate faster by stop copying so many strings #10426

Please let me know your thoughts. I can potentially help with 1 or 2.

cc @MohamedAbdeen21

Copy link
Contributor

@peter-toth peter-toth May 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am still a little unclear on if the bug you described above is something that can actually be hit in practice (due to lack of a test case), or if it will be masked by #10413

That's a good question. I think once #10413 gets resolved we can hit the identifier collision issue due to this PR. But actually, let's try to add an identifier collision test case after #10413.

I would suggest 1., revert the commit and a new version of the PR where only the aliases are made simpler.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revert PR is merged: #10436

And I am pretty stoked that @jonahgao is looking at #10413 (comment)

@@ -885,15 +876,15 @@ mod test {
)?;

let expected = vec![
(8, "{(SUM(a + Int32(1)) - AVG(c)) * Int32(2)|{Int32(2)}|{SUM(a + Int32(1)) - AVG(c)|{AVG(c)|{c}}|{SUM(a + Int32(1))|{a + Int32(1)|{Int32(1)}|{a}}}}}"),
(6, "{SUM(a + Int32(1)) - AVG(c)|{AVG(c)|{c}}|{SUM(a + Int32(1))|{a + Int32(1)|{Int32(1)}|{a}}}}"),
(8, "#{(SUM(a + Int32(1)) - AVG(c)) * Int32(2)}"),
Copy link
Contributor

@peter-toth peter-toth May 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IdArray shouldn't change in this PR. Actually, new fields can be added to the tuples but the identifier field should remain as it is.

@MohamedAbdeen21
Copy link
Contributor Author

At this point I think we'd better revert this, close it and revisit it after #10413. @alamb @peter-toth

@alamb
Copy link
Contributor

alamb commented May 9, 2024

Sounds like the consensus is to revert this PR -- could you possible make a revert PR @MohamedAbdeen21 ?

MohamedAbdeen21 added a commit to MohamedAbdeen21/arrow-datafusion that referenced this pull request May 9, 2024
MohamedAbdeen21 added a commit to MohamedAbdeen21/arrow-datafusion that referenced this pull request May 9, 2024
@alamb
Copy link
Contributor

alamb commented May 9, 2024

Revert PR: #10436

alamb pushed a commit that referenced this pull request May 10, 2024
* Revert "make common expression alias human-readable (#10333)"

This reverts commit 2a15614.

* keep some ok parts
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make alias_symbol more human-readable
3 participants