Remove type coercions from ScalarValue and aggregation function code #3705

ozankabak · 2022-10-04T13:34:53Z

Which issue does this PR close?

Improves the situation on #2447.

Rationale for this change

There is an ongoing effort to reduce/eliminate type coercions in places like ScalarValue code, aggregation code and other downstream places (see #2355). The situation we want obtain is to move as many coercions as possible upstream to the extent that we can (see discussions in PR #3570).

This PR improves the current situation by sanitizing ScalarValue code and implementation of aggregation functions to remove the coercions therein. For the former, there is only one coercion left (in try_from_value function), the need for which arises due to an upstream hardcoding (which we will attempt to fix in the near future).

In many cases, this PR makes the code raise a compile-time error when performing "unsafe" operations that require a coercion (and forces the programmer to add the coercion explicitly). For example, one will get a compile time error if one tries to subtract an unsigned value from an uninitialized ScalarValue, or divide such a value to a decimal value -- this wasn't the case before.

What changes are included in this PR?

As we were removing unnecessary coercions from the codebase, we encountered a "silent upcasting" behavior in the sum function, which promotes narrow types to 64-bit types unconditionally:

https://github.com/apache/arrow-datafusion/blob/57312284c082c914dd5e1edaa5c1fe3dbe4f222d/datafusion/physical-expr/src/aggregate/sum.rs#L217-L235

We are not 100% sure why this behavior was there, but removing it does not seem to result in any ill-effects. If anyone can tell us why such coercions are necessary, we can add them back in.

Are there any user-facing changes?

No.

ozankabak · 2022-10-04T13:40:55Z

@alamb, this is one of our promised follow-up PR's to analyze ScalarValue-related coercions in the code and remove the ones we can.

alamb · 2022-10-05T19:55:11Z

As we were removing unnecessary coercions from the codebase, we encountered a "silent upcasting" behavior in the sum function, which promotes narrow types to 64-bit types unconditionally:

We are not 100% sure why this behavior was there, but removing it does not seem to result in any ill-effects. If anyone can tell us why such coercions are necessary, we can add them back in.

This is common when computing sums to avoid overflow, I think -- like the type of SUMing Int8 types can quickly end up as a Int16 or larger.

That being said, the fact there is no test is somewhat concerning

alamb

Thank you very much for this PR. Very nice 🏅 I find it especially impressive that you have returned to help improve the codebase.

The only thing I am slightly worried about is the change to casting in aggregates. I am going to write a quick test to make sure everything is fine

alamb · 2022-10-05T19:57:05Z

datafusion/common/src/scalar.rs

            }
-            _ => impl_common_symmetric_cases_op!(


alamb · 2022-10-05T19:57:31Z

datafusion/common/src/scalar.rs

-            || value_type == DataType::UInt32
-            || value_type == DataType::UInt16
-            || value_type == DataType::UInt8
+        matches!(


alamb · 2022-10-05T21:06:06Z

datafusion/physical-expr/src/aggregate/min_max.rs

-    }};
-
-    ($VALUES:expr, $ARRAYTYPE:ident, $SCALAR:ident, $OP:ident, $TZ:expr) => {{
+    ($VALUES:expr, $ARRAYTYPE:ident, $SCALAR:ident, $OP:ident $(, $EXTRA_ARGS:ident)*) => {{


alamb · 2022-10-05T21:09:27Z

datafusion/physical-expr/src/aggregate/min_max.rs

-    }};
-
-    ($VALUE:expr, $DELTA:expr, $SCALAR:ident, $OP:ident, $TZ:expr) => {{
+    ($VALUE:expr, $DELTA:expr, $SCALAR:ident, $OP:ident $(, $EXTRA_ARGS:ident)*) => {{


these are drive by cleanups to reduce duplication in the macros, right?

(BTW if you submit these as individual free standing PRs you might find the reviews are faster -- finding enough contiguous time to review large PR changes can be challenging at times)

alamb · 2022-10-05T21:11:33Z

datafusion/physical-expr/src/aggregate/sum.rs

-            sum_row!(index, accessor, rhs, f64)
-        }
-        (DataType::Float64, ScalarValue::UInt8(rhs)) => {
+    match s {


I think this is a good change -- to use the type of the input to the type of the accumulator.

alamb

use std::sync::Arc;

use datafusion::arrow::array::{Int8Array};
use datafusion::arrow::record_batch::RecordBatch;
use datafusion::error::Result;
use datafusion::prelude::SessionContext;

/// This example demonstrates what happens summing a small int type
#[tokio::main]
async fn main() -> Result<()> {
    // define data that purposely will overflow an int8
    let array: Int8Array = (1..100).collect();
    let batch = RecordBatch::try_from_iter(vec![
        ("c", Arc::new(array) as _)
    ]).unwrap();

    let ctx = SessionContext::new();
    ctx.register_batch("t", batch).unwrap();

    println!("Count is:");
    ctx.sql("SELECT count(c) from t")
        .await
        .unwrap()
        .show()
        .await
        .unwrap();

    // Can't fit the sum in int8
    println!("Sum is:");
    ctx.sql("SELECT sum(c) from t")
        .await
        .unwrap()
        .show()
        .await
        .unwrap();

    Ok(())
}

It worked on both master and this branch 👍

+------------+
| COUNT(t.c) |
+------------+
| 99         |
+------------+
Sum is:
+----------+
| SUM(t.c) |
+----------+
| 4950     |
+----------+

alamb · 2022-10-05T21:22:35Z

The runtime coercion in ScalarValue is pretty old, it might predate when we had better type coercion support in the optimizer. Maybe @andygrove remembers

ozankabak · 2022-10-05T21:48:37Z

Yes, I did a bunch of drive-by cleanups while removing coercions :) The example you posted works because coercions are already done before we ever call add_to_row. My first thoughts regarding overflow were similar to yours, but then I realized by the time we perform additions we already have everything casted to wider types -- so the coercions I removed really seemed unnecessary.

liukun4515 · 2022-10-06T04:35:39Z

I think the type coercion should be done in the TypeCoercion in the logical phase.
There are some TODO works for agg/buildin function/udf in the issue #3582 (comment)

alamb · 2022-10-06T12:42:27Z

I think the type coercion should be done in the TypeCoercion in the logical phase.

I agree - and I think this PR is a step towards that goal (removes some unneeded logic in the physical phase as it is now done in the logical phase)

alamb · 2022-10-06T12:43:53Z

I will plan to merge this PR later today unless anyone else objects or would like more time to review

alamb · 2022-10-06T19:27:53Z

I merged this PR with master locally and ran the tests to ensure there are no logical conflicts. Looks good.

alamb · 2022-10-06T19:28:00Z

Thanks again @ozankabak

ursabot · 2022-10-06T19:32:30Z

Benchmark runs are scheduled for baseline = 88eadc4 and contender = 8dcef91. 8dcef91 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sanitize ScalarValue and aggregation code from type coercions

205c728

github-actions bot added the physical-expr Physical Expressions label Oct 4, 2022

andygrove self-requested a review October 4, 2022 15:24

Remove forced type cast from sum_row! macro used in SumRowAccumulator

50d18e9

alamb reviewed Oct 5, 2022

View reviewed changes

alamb approved these changes Oct 5, 2022

View reviewed changes

alamb merged commit 8dcef91 into apache:master Oct 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove type coercions from ScalarValue and aggregation function code #3705

Remove type coercions from ScalarValue and aggregation function code #3705

ozankabak commented Oct 4, 2022 •

edited

ozankabak commented Oct 4, 2022

alamb commented Oct 5, 2022

alamb left a comment

alamb Oct 5, 2022

alamb Oct 5, 2022

alamb Oct 5, 2022

alamb Oct 5, 2022

alamb Oct 5, 2022

alamb left a comment

alamb commented Oct 5, 2022

ozankabak commented Oct 5, 2022

liukun4515 commented Oct 6, 2022

alamb commented Oct 6, 2022

alamb commented Oct 6, 2022

alamb commented Oct 6, 2022

alamb commented Oct 6, 2022

ursabot commented Oct 6, 2022

Remove type coercions from ScalarValue and aggregation function code #3705

Remove type coercions from ScalarValue and aggregation function code #3705

Conversation

ozankabak commented Oct 4, 2022 • edited

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

ozankabak commented Oct 4, 2022

alamb commented Oct 5, 2022

alamb left a comment

Choose a reason for hiding this comment

alamb Oct 5, 2022

Choose a reason for hiding this comment

alamb Oct 5, 2022

Choose a reason for hiding this comment

alamb Oct 5, 2022

Choose a reason for hiding this comment

alamb Oct 5, 2022

Choose a reason for hiding this comment

alamb Oct 5, 2022

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

alamb commented Oct 5, 2022

ozankabak commented Oct 5, 2022

liukun4515 commented Oct 6, 2022

alamb commented Oct 6, 2022

alamb commented Oct 6, 2022

alamb commented Oct 6, 2022

alamb commented Oct 6, 2022

ursabot commented Oct 6, 2022

ozankabak commented Oct 4, 2022 •

edited