New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix aggregate type coercion bug #3710
Conversation
3edc1cd
to
b486da7
Compare
@@ -178,16 +178,15 @@ impl Optimizer { | |||
F: FnMut(&LogicalPlan, &dyn OptimizerRule), | |||
{ | |||
let mut new_plan = plan.clone(); | |||
debug!("Input logical plan:\n{}\n", plan.display_indent()); | |||
trace!("Full input logical plan:\n{:?}", plan); | |||
log_plan("Optimizer input", plan); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a driveby cleanup to improve logging (specifically, also add trace!
to log schema)
@@ -147,8 +147,8 @@ order by s_acctbal desc, n_name, s_name, p_partkey;"#; | |||
Inner Join: #supplier.s_nationkey = #nation.n_nationkey | |||
Inner Join: #partsupp.ps_suppkey = #supplier.s_suppkey | |||
Inner Join: #part.p_partkey = #partsupp.ps_partkey | |||
Filter: #part.p_size = Int32(15) AND #part.p_type LIKE Utf8("%BRASS") | |||
TableScan: part projection=[p_partkey, p_mfgr, p_type, p_size], partial_filters=[#part.p_size = Int32(15), #part.p_type LIKE Utf8("%BRASS")] | |||
Filter: CAST(#part.p_size AS Int64) = Int64(15) AND #part.p_type LIKE Utf8("%BRASS") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to be honest, I am not sure why this has changed (aka the filters are no longer simplified). I will look into that in the morning
@@ -97,7 +97,18 @@ fn optimize(plan: &LogicalPlan) -> Result<LogicalPlan> { | |||
let new_exprs = plan | |||
.expressions() | |||
.into_iter() | |||
.map(|expr| expr.rewrite(&mut expr_rewriter)) | |||
.map(|expr| { | |||
let original_name = expr.name()?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the root cause issue is that the UnwrapCastInComparison
can add potentially change the expression but it doesn't add an alias so the output name changes
Bigger picture there are at least three places that we have rediscovered this same problem when rewriting expressions --#3555 and https://github.com/apache/arrow-datafusion/blob/master/datafusion/optimizer/src/simplify_expressions.rs#L316 I will try and make a follow on PR to clean them all up. In particular, I think this is something from_plan
could potentially handle
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
expr.name()
here hides the casts that are added by this optimization rule, so expr.name()
is the same as the original name (even though the expression is now different), and the alias does not get added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this optimizer rule may actually change the name (e.g. from Int64(0)
to Int32(0)
) which i think is the root cause of the issue in this bug
fn case_when_aggregate() -> Result<()> { | ||
let sql = "SELECT col_utf8, SUM(CASE WHEN col_int32 > 0 THEN 1 ELSE 0 END) AS n FROM test GROUP BY col_utf8"; | ||
let plan = test_sql(sql)?; | ||
let expected = "Projection: #test.col_utf8, #SUM(CASE WHEN test.col_int32 > Int64(0) THEN Int64(1) ELSE Int64(0) END) AS n\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you merge latest from master - we should not include #
before column names now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in e3830a7
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks @alamb
f96e318
to
a9d6f8c
Compare
I tested this with the Dask SQL test suite and it fixed 5 queries 🚀 .. more than I expected |
.collect::<Result<Vec<_>>>()?; | ||
|
||
from_plan(plan, new_exprs.as_slice(), new_inputs.as_slice()) | ||
} | ||
|
||
fn name_for_alias(expr: &Expr) -> Result<String> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I plan to make this easier on the eyes as a follow on PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
follow on #3727
#[test] | ||
fn case_when() -> Result<()> { | ||
let sql = "SELECT CASE WHEN col_int32 > 0 THEN 1 ELSE 0 END FROM test"; | ||
let plan = test_sql(sql)?; | ||
let expected = | ||
"Projection: CASE WHEN test.col_int32 > Int32(0) THEN Int64(1) ELSE Int64(0) END\ | ||
\n TableScan: test projection=[col_int32]"; | ||
"Projection: CASE WHEN test.col_int32 > Int32(0) THEN Int64(1) ELSE Int64(0) END AS CASE WHEN test.col_int32 > Int64(0) THEN Int64(1) ELSE Int64(0) END\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI @andygrove the alias was added to this as well
Thanks again @alamb. This was a huge improvement. |
I am pretty excited where the datafusion optimizer is headign |
Benchmark runs are scheduled for baseline = 965133c and contender = 64669e9. 64669e9 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
DRAFT -- I am just pushing it up in case anyone else finds the debugging via logging helpfulWhich issue does this PR close?
Closes #3704
Rationale for this change
There is a planning regression
What changes are included in this PR?
Fix bug (will comment inline)
Are there any user-facing changes?