Add optimizer rule for type coercion (binary operations only) #3222

andygrove · 2022-08-22T17:37:25Z

Which issue does this PR close?

Part of #3221

There is a follow on PR #3353 to fix a test that is ignored here due to an existing bug that was exposed by changes in this PR.

Rationale for this change

I would like type coercion to happen in the logical plan. I would also like to match the behavior of Postgres and Spark where CAST does not appear in the field names in the schema (and this happens a lot more because of the new type coercion rule).

DataFusion currently does a lot of type coercion in the physical plan (which is unaffected by this change, although some of the code may now be redundant).

See #3031 (comment) for a discussion about adding casts to the logical plan.

What changes are included in this PR?

New optimizer rule
Update expected plans in tests now that CAST is added to query plans
Update expected results in tests now that CAST no longer appears in field names

Are there any user-facing changes?

Yes, optimized logical plans may now include CAST expressions that were not previously there (they were added in the physical plan)

andygrove · 2022-08-22T17:38:15Z

Note that this is related to #3185 from @liukun4515

andygrove · 2022-08-22T18:51:34Z

datafusion/core/tests/sql/mod.rs

@@ -752,29 +752,31 @@ async fn try_execute_to_batches(
 /// Execute query and return results as a Vec of RecordBatches
 async fn execute_to_batches(ctx: &SessionContext, sql: &str) -> Vec<RecordBatch> {


This test method was optimizing the plan twice, so I fixed that.

codecov-commenter · 2022-08-22T19:33:22Z

Codecov Report

Merging #3222 (54bf82f) into master (751cbc8) will increase coverage by 0.00%.
The diff coverage is 95.69%.

@@           Coverage Diff           @@
##           master    #3222   +/-   ##
=======================================
  Coverage   85.57%   85.57%           
=======================================
  Files         295      296    +1     
  Lines       54111    54173   +62     
=======================================
+ Hits        46304    46361   +57     
- Misses       7807     7812    +5

Impacted Files	Coverage Δ
datafusion/core/tests/dataframe_functions.rs	`100.00% <ø> (ø)`
datafusion/core/tests/parquet_pruning.rs	`99.43% <ø> (ø)`
datafusion/core/tests/sql/decimal.rs	`100.00% <ø> (ø)`
datafusion/core/tests/sql/functions.rs	`100.00% <ø> (ø)`
datafusion/core/tests/sql/joins.rs	`99.33% <ø> (ø)`
datafusion/core/tests/sql/parquet.rs	`100.00% <ø> (ø)`
datafusion/core/tests/sql/predicates.rs	`100.00% <ø> (ø)`
datafusion/core/tests/sql/subqueries.rs	`94.24% <ø> (-0.09%)`	⬇️
datafusion/core/tests/sql/timestamp.rs	`99.65% <ø> (ø)`
datafusion/core/tests/sql/window.rs	`95.87% <ø> (ø)`
... and 16 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

andygrove · 2022-08-23T14:34:49Z

@liukun4515 This PR is ready for review now

andygrove · 2022-08-23T17:41:05Z

@liukun4515 This PR is ready for review now

Never mind, I need to have full recursion for the expression rewriting and we seem to be missing some infrastructure for that, or I am just not finding it.

jdye64

I like the idea of having the type coercion in an optimizer since that allows it to be either enable or disabled on a per dialect basis. Learned a few things from some interesting snippets in there as well. Seems good to me

jdye64 · 2022-08-24T23:25:00Z

datafusion/core/src/execution/context.rs

@@ -1377,6 +1378,8 @@ impl SessionState {
        }
        rules.push(Arc::new(ReduceOuterJoin::new()));
        rules.push(Arc::new(FilterPushDown::new()));
+        // we do type coercion after filter push down so that we don't push CAST filters to Parquet


smart move, that would have been a hard bug to find!

I am confused about this comment and explain why do the type coercion after the filter push down optimizer rule.

I think the type coercion rule should be done in preview stage.

For example, Filter expr: FLOAT32(C1) < FLOAT64(16). We should do type coercion first and convert the filter expr to CAST(INT32(C1) AS FLOAT64 < FLOAT64(16) and try to push the new filter expr to the table scan operation.

If you don't do the type coercion first, you will push the expr: FLOAT32(C1) < FLOAT64(16) to table scan, Does this can be applied to the parquet filter or pruning filter?

@liukun4515 This PR is ready for review now

Yes, this is ready for review now.

I filed #3289 applying TypeCoercion before FilterPushDown. I think the PR would get too large to review if I make those changes here.

jdye64 · 2022-08-24T23:36:01Z

datafusion/expr/src/expr_fn.rs

@@ -259,6 +260,23 @@ pub fn cast(expr: Expr, data_type: DataType) -> Expr {
    }
 }

+/// Create a cast expression
+pub fn cast_if_needed(


Makes sense

jdye64 · 2022-08-25T00:36:10Z

datafusion/optimizer/src/type_coercion.rs

+        let new_expr = plan
+            .expressions()
+            .into_iter()
+            .map(|expr| expr.rewrite(&mut expr_rewrite))


Learned something new here today about the expr rewriter

liukun4515 · 2022-08-25T09:53:30Z

@liukun4515 This PR is ready for review now

Never mind, I need to have full recursion for the expression rewriting and we seem to be missing some infrastructure for that, or I am just not finding it.

@andygrove is it ready to review？

liukun4515 · 2022-08-25T09:56:22Z

benchmarks/src/bin/tpch.rs

    }
-    let physical_plan = ctx.create_physical_plan(&plan).await?;
+    // note that `create_physical_plan` will optimize the plan so we pass the unoptimized plan


Is there error or different result for your plan, If it is optimized by some rules more times?
I also this in my preview pr.

For example:
A+B, we will get the coercion type C. After the first optimization, will get cast(A AS C) + CAST(B AS C).
After the second optimization, we may get the coercion type D from cast(A AS C) + CAST(B AS C).

The original code was running the optimizer twice, which was not necessary. With the new rule there was a problem optimizing twice. I will look at this again today and write up an issue or make it safe to optimize twice.

I reverted this change

liukun4515 · 2022-08-25T10:16:38Z

datafusion/optimizer/src/type_coercion.rs

+                let left_type = left.get_type(&self.schema)?;
+                let right_type = right.get_type(&self.schema)?;
+                match right_type {
+                    DataType::Interval(_) => {


Can you explain why we skip this datatype like left op Interval? I can't get the point.
If we leave a lot of special code, it's difficult to maintain them.

Removing this causes one test failue:

---- sql::timestamp::timestamp_array_add_interval stdout ---- thread 'sql::timestamp::timestamp_array_add_interval' panicked at 'called `Result::unwrap()` on an `Err` value: "Internal(\"Unsupported CAST from Interval(DayTime) to Timestamp(Nanosecond, None)\") at Creating physical plan for 'SELECT ts, ts - INTERVAL '8' MILLISECONDS FROM table_a': Projection: #table_a.ts, #table_a.ts - CAST(IntervalDayTime(\"8\") AS Timestamp(Nanosecond, None))\n TableScan: table_a projection=[ts]"', datafusion/core/tests/sql/mod.rs:773:10

~~Arrow does not support CAST from Interval(DayTime) to Timestamp(Nanosecond, None). I think this could be added so I filed apache/arrow-rs#2606. Once this is implemented, we can remove this code.~~

I agree it is strange -- maybe it is worth a ticket to investigate further (or maybe @waitingkuo is already tracking it)

FWIW I think this special case is no longer necessary -- I tried removing it in #3379 and all the tests still pass.

Good spot @liukun4515

Once this PR is merged, I'll get #3379 ready for review

…nge an expression name

andygrove · 2022-09-03T18:00:57Z

@liukun4515 @alamb @jdye64 This PR is finally ready for review.

alamb · 2022-09-04T13:59:14Z

I plan to review this tomorrow

andygrove · 2022-09-04T15:17:26Z

datafusion/core/tests/sql/explain_analyze.rs

@@ -694,7 +694,7 @@ async fn test_physical_plan_display_indent() {
        "            RepartitionExec: partitioning=Hash([Column { name: \"c1\", index: 0 }], 9000)",
        "              AggregateExec: mode=Partial, gby=[c1@0 as c1], aggr=[MAX(aggregate_test_100.c12), MIN(aggregate_test_100.c12)]",
        "                CoalesceBatchesExec: target_batch_size=4096",
-        "                  FilterExec: c12@1 < CAST(10 AS Float64)",
+        "                  FilterExec: c12@1 < 10",


This is the physical plan, which no longer contains a cast here because the logical plan optimized out the cast of a literal value.

🎉 -- which I think is a good example of the value of this pass

liukun4515 · 2022-09-05T01:59:20Z

@liukun4515 @alamb @jdye64 This PR is finally ready for review.

I will review it today, but i missed the pr of issue #3330

alamb

Looks good to me -- I think the only thing that need to be fixed prior to merge is related to the test for timestamp being ignored. Otherwise I think this looks great. Thank you @andygrove

alamb · 2022-09-05T13:20:24Z

datafusion/core/src/execution/context.rs

@@ -1401,6 +1402,9 @@ impl SessionState {
        }
        rules.push(Arc::new(ReduceOuterJoin::new()));
        rules.push(Arc::new(FilterPushDown::new()));
+        // we do type coercion after filter push down so that we don't push CAST filters to Parquet


I have a partially written ticket (I will post later this week) related to supporting CAST in pruning logic (which is part of what is pushed to parquet). Perhaps this is also related

alamb · 2022-09-05T13:22:26Z

datafusion/core/tests/sql/explain_analyze.rs

@@ -694,7 +694,7 @@ async fn test_physical_plan_display_indent() {
        "            RepartitionExec: partitioning=Hash([Column { name: \"c1\", index: 0 }], 9000)",
        "              AggregateExec: mode=Partial, gby=[c1@0 as c1], aggr=[MAX(aggregate_test_100.c12), MIN(aggregate_test_100.c12)]",
        "                CoalesceBatchesExec: target_batch_size=4096",
-        "                  FilterExec: c12@1 < CAST(10 AS Float64)",
+        "                  FilterExec: c12@1 < 10",


🎉 -- which I think is a good example of the value of this pass

alamb · 2022-09-05T13:23:16Z

datafusion/core/tests/sql/joins.rs

@@ -1438,9 +1438,9 @@ async fn reduce_left_join_1() -> Result<()> {
        "Explain [plan_type:Utf8, plan:Utf8]",
        "  Projection: #t1.t1_id, #t1.t1_name, #t1.t1_int, #t2.t2_id, #t2.t2_name, #t2.t2_int [t1_id:UInt32;N, t1_name:Utf8;N, t1_int:UInt32;N, t2_id:UInt32;N, t2_name:Utf8;N, t2_int:UInt32;N]",
        "    Inner Join: #t1.t1_id = #t2.t2_id [t1_id:UInt32;N, t1_name:Utf8;N, t1_int:UInt32;N, t2_id:UInt32;N, t2_name:Utf8;N, t2_int:UInt32;N]",
-        "      Filter: #t1.t1_id < Int64(100) [t1_id:UInt32;N, t1_name:Utf8;N, t1_int:UInt32;N]",
+        "      Filter: CAST(#t1.t1_id AS Int64) < Int64(100) [t1_id:UInt32;N, t1_name:Utf8;N, t1_int:UInt32;N]",


I think it is much clearer that the casts are now visible in the explain plan (as it is clear what is going on)

alamb · 2022-09-05T13:25:11Z

datafusion/core/tests/sql/timestamp.rs

@@ -1398,6 +1398,7 @@ async fn timestamp_sub_interval_days() -> Result<()> {
 }

 #[tokio::test]
+#[ignore] // https://github.com/apache/arrow-datafusion/issues/3327


Should this test be ignored? Maybe this is a merge conflict -- I think @HaoYang670 has fixed this test in #3337 so it no longer needs to be ignored

Thanks for catching that. Yes, this must have been some kind of merge conflict. I have unignored this.

alamb · 2022-09-05T13:26:27Z

datafusion/expr/src/expr.rs

@@ -1101,6 +1102,20 @@ mod test {
        Ok(())
    }

+    #[test]


alamb · 2022-09-05T13:26:51Z

datafusion/physical-expr/src/planner.rs

@@ -47,7 +47,13 @@ pub fn create_physical_expr(
    input_schema: &Schema,
    execution_props: &ExecutionProps,
 ) -> Result<Arc<dyn PhysicalExpr>> {
-    assert_eq!(input_schema.fields.len(), input_dfschema.fields().len());
+    if input_schema.fields.len() != input_dfschema.fields().len() {


alamb · 2022-09-05T13:27:47Z

datafusion/physical-expr/src/planner.rs

+        return Err(DataFusionError::Internal(
+            "create_physical_expr passed Arrow schema and DataFusion \
+            schema with different number of fields"
+                .to_string(),
+        ));


Suggested change

return Err(DataFusionError::Internal(

"create_physical_expr passed Arrow schema and DataFusion \

schema with different number of fields"

.to_string(),

));

return Err(DataFusionError::Internal(

format!("create_physical_expr passed Arrow schema and DataFusion \

schema with different number of fields, {} vs {}",

input_schema.fields.len(), input_dfschema.fields().len()

),

));

alamb · 2022-09-05T13:29:26Z

datafusion/optimizer/src/type_coercion.rs

+                let left_type = left.get_type(&self.schema)?;
+                let right_type = right.get_type(&self.schema)?;
+                match right_type {
+                    DataType::Interval(_) => {


I agree it is strange -- maybe it is worth a ticket to investigate further (or maybe @waitingkuo is already tracking it)

alamb · 2022-09-05T13:31:18Z

datafusion/optimizer/src/type_coercion.rs

+                    }
+                    _ => {
+                        let coerced_type = coerce_types(&left_type, op, &right_type)?;
+                        let left = left.clone().cast_to(&coerced_type, &self.schema)?;


I think it is a minor point and could be done in a follow on PR, but since this function gets an owned expr it might be possible to match expr rather than match &expr and save these clones

alamb

LGTM

I filed #3377 to track pruning expressions with casts that can't be removed at plan time

alamb · 2022-09-06T11:12:44Z

In order to help this PR along, I took the liberty of merging from master and then resolved a logical test conflict introduced with #3359 in 54bf82f

alamb · 2022-09-06T11:13:09Z

Once this PR passes CI, I plan to merge it in (and I will make a follow on PR with my suggested improvements)

alamb · 2022-09-06T11:19:28Z

datafusion/core/tests/sql/expr.rs

@@ -247,8 +247,8 @@ async fn query_not() -> Result<()> {
 async fn csv_query_sum_cast() {
    let ctx = SessionContext::new();
    register_aggregate_csv_by_sql(&ctx).await;
-    // c8 = i32; c9 = i64
-    let sql = "SELECT c8 + c9 FROM aggregate_test_100";
+    // c8 = i32; c6 = i64


I made this change due to the fact that #3359 changed the type of c9 so it was no longer i64 but u64

liukun4515 · 2022-09-06T11:22:50Z

datafusion/core/tests/dataframe_functions.rs

-        "| 100                     |",
-        "+-------------------------+",
+        "+--------+",
+        "| test.b |",


isn't the original header better？
@alamb @andygrove

I personally don't think seeing the cast in the column name adds much value. Also no cast in the subject is consistent with postgres:

alamb=# select cast(1 as int); int4 ------ 1 (1 row) alamb=# select cast(i as int) from foo; i --- 1 2 0 (3 rows)

liukun4515

LGTM,
I also have comments about the header, but we can do in the follow up pr or issue

ursabot · 2022-09-06T12:22:26Z

Benchmark runs are scheduled for baseline = 9b546e7 and contender = 191d8b7. 191d8b7 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

andygrove · 2022-09-06T13:55:17Z

Thank you for the review @alamb and @liukun4515

andygrove changed the title ~~Add optimizer rule for type coercion~~ WIP: Add optimizer rule for type coercion Aug 22, 2022

github-actions bot added core Core datafusion crate logical-expr Logical plan and expressions optimizer Optimizer rules labels Aug 22, 2022

andygrove changed the title ~~WIP: Add optimizer rule for type coercion~~ Add optimizer rule for type coercion (binary operations only) Aug 22, 2022

andygrove marked this pull request as ready for review August 22, 2022 18:50

andygrove commented Aug 22, 2022

View reviewed changes

andygrove requested review from yjshen, liukun4515 and tustvold August 22, 2022 18:51

andygrove mentioned this pull request Aug 22, 2022

Support "IS TRUE/FALSE" syntax #3189

Closed

andygrove marked this pull request as draft August 22, 2022 22:10

andygrove marked this pull request as ready for review August 23, 2022 14:33

sarahyurick mentioned this pull request Aug 23, 2022

Support "IS TRUE/FALSE" syntax (redo) #3235

Merged

andygrove marked this pull request as draft August 23, 2022 17:40

github-actions bot added the physical-expr Physical Expressions label Aug 23, 2022

andygrove mentioned this pull request Aug 24, 2022

Add type checking to logical plan for IsDistinctFrom / IsNotDistinctFrom expressions #3247

Open

andygrove marked this pull request as ready for review August 24, 2022 14:53

This was referenced Aug 24, 2022

[DF] Remaining DataFusion / sqlparser work for the GA release dask-contrib/dask-sql#706

Closed

Add type coercion for UDFs in logical plan #3254

Merged

Add SQL planner support for Like, ILike and SimilarTo, with optional escape character #3101

Merged

jdye64 approved these changes Aug 25, 2022

View reviewed changes

liukun4515 reviewed Aug 25, 2022

View reviewed changes

github-actions bot removed the physical-expr Physical Expressions label Aug 25, 2022

Add binary type coercion to logical plan and do not allow CAST to cha…

f5fbe25

…nge an expression name

andygrove force-pushed the binary-type-coercion branch from ec5d32a to f5fbe25 Compare September 3, 2022 17:40

andygrove marked this pull request as ready for review September 3, 2022 17:42

andygrove mentioned this pull request Sep 3, 2022

Fix csv_query_approx_count once binary type coercion PR is merged #3353

Open

fix tests

0178435

update avro tests

01f12d9

andygrove mentioned this pull request Sep 3, 2022

add SQL support for tinyint and all unsigned INTs #3309

Closed

andygrove added 2 commits September 4, 2022 09:08

Merge remote-tracking branch 'apache/master' into binary-type-coercion

4b09cb8

add reference to GitHub issue

d14d0e5

andygrove commented Sep 4, 2022

View reviewed changes

alamb approved these changes Sep 5, 2022

View reviewed changes

andygrove and others added 2 commits September 5, 2022 09:47

unignore timestamp_add_interval_months

be3ec2e

Merge remote-tracking branch 'apache/master' into binary-type-coercion

30c2057

alamb approved these changes Sep 6, 2022

View reviewed changes

fix: update tests to use correct column types

54bf82f

alamb reviewed Sep 6, 2022

View reviewed changes

liukun4515 reviewed Sep 6, 2022

View reviewed changes

alamb mentioned this pull request Sep 6, 2022

Minor: Improvements to type coercion rule #3379

Merged

liukun4515 approved these changes Sep 6, 2022

View reviewed changes

alamb merged commit 191d8b7 into apache:master Sep 6, 2022

andygrove deleted the binary-type-coercion branch September 6, 2022 13:55

andygrove mentioned this pull request Sep 7, 2022

Add TypeCoercion optimizer rule #3221

Closed

alamb mentioned this pull request Sep 8, 2022

Support casting expressions in pruning logic #3377

Closed

		@@ -752,29 +752,31 @@ async fn try_execute_to_batches(
		/// Execute query and return results as a Vec of RecordBatches
		async fn execute_to_batches(ctx: &SessionContext, sql: &str) -> Vec<RecordBatch> {

@@ @@ -1101,6 +1102,20 @@ mod test { @@
                       Ok(())
                   }
+                  #[test]

Add optimizer rule for type coercion (binary operations only) #3222

Add optimizer rule for type coercion (binary operations only) #3222

Conversation

andygrove commented Aug 22, 2022 • edited

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

andygrove commented Aug 22, 2022

Choose a reason for hiding this comment

codecov-commenter commented Aug 22, 2022 • edited

Codecov Report

andygrove commented Aug 23, 2022

andygrove commented Aug 23, 2022

jdye64 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liukun4515 commented Aug 25, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liukun4515 Aug 25, 2022 • edited

Choose a reason for hiding this comment

andygrove Aug 29, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andygrove commented Sep 3, 2022

alamb commented Sep 4, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liukun4515 commented Sep 5, 2022 • edited

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

alamb commented Sep 6, 2022

alamb commented Sep 6, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liukun4515 left a comment

Choose a reason for hiding this comment

ursabot commented Sep 6, 2022

andygrove commented Sep 6, 2022

andygrove commented Aug 22, 2022 •

edited

codecov-commenter commented Aug 22, 2022 •

edited

liukun4515 Aug 25, 2022 •

edited

andygrove Aug 29, 2022 •

edited

liukun4515 commented Sep 5, 2022 •

edited