Add `Scalar` / `Datum` support to compute kernels #1047

alamb · 2021-12-16T14:04:24Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

When implementing analytics, users often want to do operations like array + constant or array + array

The Rust implementation of Arrow is strongly typed 🙌 , including the compute kernels, but often data is passed around in Arc<dyn Array> / ArrayRef so the types can be dynamic. This then places the burden on the user of the library to determine the exact types of the inputs in order to call the appropriate kernels

Let's take for example, trying to compare an array to find all elements that contain the number 5 and your input an array may have Int64 or Int32 values

In the current version of arrow, in order to do so you would need to write code like

use arrow::compute::eq_scalar;

fn find_5(array: ArrayRef) -> Result<BooleanArray>{
  match array.data_type() {
    DataType::Int64 => eq_scalar(array.as_any().downcast_ref::<Int64Array>().unwrap(), 5),
    DataType::UInt64 => eq_scalar(array.as_any().downcast_ref::<UInt64Array>().unwrap(), 5),
    ...
  }
}

This ends up being macroized and is non ideal because the user had to do dynamic type dispatch anyways, so there is no runtime performance benefit to strongly typed kernels

Describe the solution you'd like

It would be nice to be able to call a single function and let the rust library dynamically (at runtime) pick the correct kernel to call.

As described and suggested by @jorgecarleitao and @nevi-me in #984 (comment) this could go all the way and take the form of following the C++ Datum model

fn eq(left: Datum, right: Datum) -> BooleanArray {
...
}

Where Datum looks something like

enum Datum {
  Scalar(ScalarValue),
  Array(ArrayRef)
}

And ScalarValue which would look something like ScalarValue from DataFusion https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/scalar.rs or Scalar from arrow2 https://github.com/jorgecarleitao/arrow2/tree/main/src/scalar

Describe alternatives you've considered
The alternative is a continued proliferation of individual kernels such as eq, eq_utf8, etc

Additional context
There is a lot of additional discussion on #984

cc @matthewmturner @Dandandan @jimexist @houqp

The text was updated successfully, but these errors were encountered:

houqp · 2021-12-24T20:28:54Z

I support adding higher level easy to use API for non-performance critical code, but it would be good to still keep the specialized kernels around so users can still leverage them when type info is available at the call site.

alamb · 2021-12-26T13:47:20Z

but it would be good to still keep the specialized kernels around so users can still leverage them when type info is available at the call site.

The reason I was saying the specialized kernels might not be totally useful was, my mind, that the overhead of a type dispatch could be almost entirely amortized -- like if we are comparing 1000 elements in an array, an extra check on the array's type might not be a huge deal

I don't see anything wrong with leaving all the specialized kernels available too (maybe it would help inlining 🤔 )

tustvold · 2022-10-28T00:05:05Z

I believe this overlaps with #2842

tustvold · 2023-06-01T16:57:02Z

So I've been playing around with this and the major challenge is avoiding a huge amount of API churn / boilerplate

Take the signature

add_dyn(a: &dyn Array, b: &dyn Array) -> Result<ArrayRef>

Its not clear how to convert this to a Datum based model. One option would be

add_dyn(a: Datum<'_>, b: Datum<'_>) -> Result<ArrayRef>

Where Datum is something like

enum Datum<'a> {
    Array(&'a dyn Array),
    Scalar(&'a dyn Scalar)
}

But this has a couple of issues

Callsites now have to explicitly wrap their arguments in Datum
There is no way to return a scalar

Making Datum a trait doesn't help here either, because the specialization rules prevent blanked implementations for both T: Scalar and T: Array.

Another option would be to make the methods generic, with impl Into<Datum>, but this also has downsides of

Runs into same blanket impl issues as deriving Datum trait
Kernels are now generic resulting in significant, localised, additional codegen

Taking a step back I had a potentially controversial thought, why not just treat a single element array as a scalar array?

This would have some pretty compelling advantages:

No changes to type signatures necessary
Unary kernels like casting or aggregation just work with no modification
Complete type coverage for no effort

The obvious downside is the representation is not very memory efficient. I think the question boils down to what is the purpose of the scalar representation, is it:

To allow more efficient kernels where one side is known to be a scalar, e.g. scalar comparison, etc...
Provide an efficient type-erased representation for row-oriented operations like grouping
Provide efficient scalar operations

My 2 cents is that 2. is a use-case better served by the row representation, and 3. is beyond the scope of a vectorized execution engine, and therefore 1. is the target for this feature. As such I think this is perfectly acceptable approach. The overheads of the slightly less efficient representation will be more than outweighed by the costs of the dynamic dispatch alone.

What do people think?

alamb · 2023-06-01T17:35:37Z

Taking a step back I had a potentially controversial thought, why not just treat a single element array as a scalar array?

For what it is worth I think this is what DuckDB does (at least this is how I interpret this slide from the 22 - DuckDB Internals (CMU Advanced Databases / Spring 2023) lecture

I think the biggest potential downside of the "use a single row to mean a scalar value" could be that it is confusing as an API. The two input array sizes have to match except when one of them (would it always be the right argument?) had a single row, in which case it would have a special fast path 🤔 But it wouldn't be clear from the call signature if there was a special fast path or not.

Edit: I have another idea I am typing up

alamb · 2023-06-01T17:41:23Z

What if we made a 'new' set of kernels (that just dispatched to the existing ones, if they existed)

This seems like the ideal user experience from my perspective:

let scalar = Int64Scalar::new(42);
let array = Int64Array::from(....);

// Call the "add" kernels in a new module
let result: Datum = arrow::compute::datum::add(&scalar, &array);

Maybe the types could look like this:

enum Datum<'a> {
    Array(&'a dyn Array),
    Scalar(&'a dyn Scalar),
    OwnedArray(Arc<Array>),
    OwnedSclar(Arc<Scalar>),
}

impl Into Datum<..> for Int64Scalar {
..
}


fn add (left: impl Into<Datum>, right: impl Into<Datum>) -> Datum<'static> {
..
}

Perhaps?

tustvold · 2023-06-01T17:59:30Z

dispatched to the existing ones

The existing ones don't support scalars with logical types, that's the whole issue 😅. We have to change the way we support scalars, and I'm leaning towards the one that doesn't involve bifurcating the API...

except when one

There aren't a huge number of binary kernels, so I think we could get pretty good coverage quite quickly. Certainly orders of magnitude faster than porting all of the existing kernels to support some notion of scalar.

Is this not something that could be overcome with a bit of documentation, from the UX side I can see an argument that having to treat scalars differently is actually a worse UX... It has certainly led to a fairly substantial amount of logic in DF...

Into<Datum>

This runs into the blanket impl issue, there isn't a way to provide blanket impls for the concrete types which then forces peculiar things like (&array as &dyn Array)

Edit: done some further experiments and had a great chat with @alamb will write up more tomorrow, I think there is a nicer way to do this

alamb · 2023-06-01T18:19:31Z

This runs into the blanket impl issue, there isn't a way to provide blanket impls for the concrete types which then forces peculiar things like (&array as &dyn Array)

I was imagining implementing impl Into<Datum> explicitly for all the concrete types...

sundy-li · 2023-06-02T01:47:26Z

For what it is worth I think this is what DuckDB does (at least this is how I interpret this slide from the 22 - DuckDB Internals (CMU Advanced Databases / Spring 2023) lecture

If the array is already in Flat format, it needs to construct an extra Selection vector to iterate the array, maybe it's a little more overhead than iterating the array itself.

We used an enum to represent the vector, it's hard to add support for another dictionary vector.

#[derive(Debug, Clone, PartialEq, EnumAsInner)]
pub enum ValueRef<'a, T: ValueType> {
    Scalar(T::ScalarRef<'a>),
    Column(T::Column),
}

The main drawback is code flood, so we use the code generator to help write vectorized methods, the generated file is ~6000 LOC .

* Add Scalar/Datum abstraction (#1047) * Add dyn Array

alamb added the enhancement Any new improvement worthy of a entry in the changelog label Dec 16, 2021

alamb mentioned this issue Dec 16, 2021

Add scalar comparison kernels for DictionaryArray #984

Closed

tustvold self-assigned this May 9, 2023

tustvold mentioned this issue Jun 3, 2023

add min and max kernel #4355

Closed

tustvold added a commit to tustvold/arrow-rs that referenced this issue Jun 9, 2023

Add Scalar/Datum abstraction (apache#1047)

c577e76

tustvold added a commit to tustvold/arrow-rs that referenced this issue Jun 9, 2023

Add Scalar/Datum abstraction (apache#1047)

be967a2

tustvold mentioned this issue Jun 9, 2023

Add Scalar/Datum abstraction (#1047) #4393

Merged

tustvold added a commit to tustvold/arrow-rs that referenced this issue Jun 9, 2023

Add Scalar/Datum abstraction (apache#1047)

04cc386

tustvold mentioned this issue Jul 3, 2023

More frequent major releases for arrow-rs #1120

Closed

xxchan mentioned this issue Jul 3, 2023

expr: avoid repeating the same scalar into an array risingwavelabs/risingwave#9052

Open

tustvold mentioned this issue Jul 3, 2023

Incorrect results returned for TPC-H Query 8 apache/datafusion#6794

Closed

tustvold closed this as completed in #4393 Jul 4, 2023

tustvold added a commit that referenced this issue Jul 4, 2023

Add Scalar/Datum abstraction (#1047) (#4393)

9ee36b2

* Add Scalar/Datum abstraction (#1047) * Add dyn Array

tustvold mentioned this issue Jul 8, 2023

Interval Negation Incorrect #4488

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `Scalar` / `Datum` support to compute kernels #1047

Add `Scalar` / `Datum` support to compute kernels #1047

alamb commented Dec 16, 2021

houqp commented Dec 24, 2021

alamb commented Dec 26, 2021

tustvold commented Oct 28, 2022

tustvold commented Jun 1, 2023 •

edited

alamb commented Jun 1, 2023 •

edited

alamb commented Jun 1, 2023

tustvold commented Jun 1, 2023 •

edited

alamb commented Jun 1, 2023

sundy-li commented Jun 2, 2023 •

edited

Add Scalar / Datum support to compute kernels #1047

Add Scalar / Datum support to compute kernels #1047

Comments

alamb commented Dec 16, 2021

houqp commented Dec 24, 2021

alamb commented Dec 26, 2021

tustvold commented Oct 28, 2022

tustvold commented Jun 1, 2023 • edited

alamb commented Jun 1, 2023 • edited

alamb commented Jun 1, 2023

tustvold commented Jun 1, 2023 • edited

alamb commented Jun 1, 2023

sundy-li commented Jun 2, 2023 • edited

Add `Scalar` / `Datum` support to compute kernels #1047

Add `Scalar` / `Datum` support to compute kernels #1047

tustvold commented Jun 1, 2023 •

edited

alamb commented Jun 1, 2023 •

edited

tustvold commented Jun 1, 2023 •

edited

sundy-li commented Jun 2, 2023 •

edited