Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved Temporal and Decimal Arithmetic #3999

Closed
Tracked by #5753
tustvold opened this issue Apr 1, 2023 · 13 comments · Fixed by #4465
Closed
Tracked by #5753

Improved Temporal and Decimal Arithmetic #3999

tustvold opened this issue Apr 1, 2023 · 13 comments · Fixed by #4465
Assignees
Labels
enhancement Any new improvement worthy of a entry in the changelog

Comments

@tustvold
Copy link
Contributor

tustvold commented Apr 1, 2023

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

The current support for arithmetic on temporal and decimal types is limited and fairly inconsistent:

Describe the solution you'd like

It is not a fully formed design, but I would like to implement something inspired by the std::ops traits used for Rust arithmetic.

In particular something like

trait Add<Rhs> {
    type Output = T;
    fn add(&self, rhs: &Rhs) -> Result<Self::Output, ArrowError>;

    fn add_checked(&self, rhs: &Rhs) -> Result<Self::Output, ArrowError>;
}

trait Scalar {
    fn data_type(&self) -> &DataType;

    fn as_any(&self) -> &dyn Any;
}

struct PrimitiveScalar<T: ArrowPrimitiveType>{
    value: T::Native,
    data_type: DataType,
}

impl <T: ArrowPrimitiveType> Scalar for PrimitiveScalar<T> {

}

impl <T: ArrowPrimitiveType> Add<PrimitiveArray<T>> for PrimitiveArray<T> {
    ...
}

impl <T: ArrowPrimitiveType> Add<PrimitiveScalar<T>> for PrimitiveArray<T> {
    ...
}

impl <T: ArrowPrimitiveType> Add<PrimitiveScalar<T>> for PrimitiveScalar<T> {
    ...
}

impl Add<dyn Array> for dyn Array {
    ...
}

impl Add<dyn Scalar> for dyn Array {
    ...
}

impl Add<dyn Scalar> for dyn Scalar {
    ...
}

Aside from solving the above problems, it would make for a more idiomatic design that is easier to extend and maintain. We would then deprecate and remove the old kernels.

This would also be a pattern that could be translated to support other kernels that run into similar issues with scalars, e.g. arrow-ord and arrow-string.

Describe alternatives you've considered

We could drop support for arithmetic with statically typed arguments, and add further kernels for performing arithmetic on dyn Scalar

Additional context

@tustvold tustvold added the enhancement Any new improvement worthy of a entry in the changelog label Apr 1, 2023
@tustvold tustvold self-assigned this Apr 1, 2023
@tustvold
Copy link
Contributor Author

tustvold commented Apr 1, 2023

@alamb @viirya I would be interested to hear your thoughts on this, I think trait-based dispatch would be a much cleaner way to handle method overloading than our current approach of increasingly arcane method names, but I'm not sure if I've just been marinating in Rust's generics for too long 😅

@alamb
Copy link
Contributor

alamb commented Apr 2, 2023

Having a kernel like add(array: &dyn Array, scalar: &Scalar) is certainly compelling

I agree traits would likely be nicer than arcane names.

One thing I think would be a step back was if the traits were complicated enough to express things like can't add PrimitiveType to TimestampType -- because I think that would generate very hard to understand compiler error messages.

why use a trait for Scalar rather than an enum?

The seems the main usescases for trait are:

  1. Users can supply their own implementations
  2. Can let the compiler generate specialized code.

Maybe it would more naturally extend to nested types 🤔 but I am not sure how one would add a StructArray ...

enum Scalar {
  Int8(..)
...}

Do we plan to remove all typed kernels?

I wonder if you are thinking about deprecating all the strongly typed kernels in favor of the dyn kernels?

I seems like the dyn kernels would be easier to use for general purpose computing, though would increase binary code size (e.g. anything linking with arrow-rs would get all the arithmetic kernels even if it only used UInt64 for example)

🤔

@tustvold
Copy link
Contributor Author

tustvold commented Apr 2, 2023

Having a kernel like add(array: &dyn Array, scalar: &Scalar) is certainly compelling

The eventual idea would be to not have such a kernel at all, instead you would just call array.add(scalar) and let trait-resolution work out what variant to call based on the types you give it.

One thing I think would be a step back was if the traits were complicated enough to express things like can't add PrimitiveType to TimestampType -- because I think that would generate very hard to understand compiler error messages.

What are your thoughts on these error messages?

https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=e5e12eacb2b9041f988afe2012637a74

https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=6550efd10b3bf3ec7e7806de71fc6d0e

why use a trait for Scalar rather than an enum

Mainly for consistency with the use of dyn Array. We need a strongly typed PrimitiveScalar or similar, but whether you wrap that up in a trait object or an enum for the dyn dispatch is largely immaterial, you will end up a large match block either way. Using dyn for both might allow you to share more of this though, and would likely be less code in general

Do we plan to remove all typed kernels

The intention is for the traits to be implemented for the concrete types, filling the role of the "typed kernels", whilst also being implemented for the trait objects filling the role of the "dyn kernels". The trait-dispatch machinery in Rust will automatically work out which one is appropriate to call.

@alamb
Copy link
Contributor

alamb commented Apr 2, 2023

The other thing I wanted to mention was that we already have some non trivial changes in flight (e.g ArrayData, etc) and one thing I think we should try to is wrap up outstanding refactorings before we take on another substantial refactor

@tustvold
Copy link
Contributor Author

tustvold commented Apr 2, 2023

The other thing I wanted to mention was that we already have some non trivial changes in flight

100% agree, I intend to have these done in time for the next release at the end of the coming week. I'm queuing this up as the next big ticket item 😄

@alamb
Copy link
Contributor

alamb commented Apr 2, 2023

What are your thoughts on these error messages?

https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=e5e12eacb2b9041f988afe2012637a74

= note: expected reference `&PrimitiveArray<DurationType>`
              found reference `&PrimitiveArray<TimestampNanosecondType>`
note: associated function defined here
  --> src/main.rs:5:8
   |
5  |     fn add(&self, rhs: &Rhs) -> Result<Self::Output, String>;

✅ This one seems good

https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=6550efd10b3bf3ec7e7806de71fc6d0e

= help: the following other types implement trait `Add<Rhs>`:
             <PrimitiveArray<TimestampNanosecondType> as Add<PrimitiveArray<DurationType>>>
             <PrimitiveArray<TimestampNanosecondType> as Add<PrimitiveArray<IntervalType>>>

This one is also ok given there are only two options and they are probably what the user wants. However, I think in reality there would be so many impls that the ones I cared about (for adding timestamps) would not make it into the actual message (they would be hidden behind ...) and that is 👎 as then I would have to go spelunking into the arrow type system to figure out what was going on.

Maybe this could be handled with sufficiently good and detailed examples and messages.

@tustvold
Copy link
Contributor Author

tustvold commented Apr 2, 2023

would be so many impls that the ones I cared about (for adding timestamps) would not make it into the actual message

It would appear the cut off is 8, which I don't think we will exceed for any type

https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=dd81152086cab213a8ff1d3286aaa28c

Even then I think this is still better than the current approach of using method names to define overloads, at least the compiler tries to help you, instead of always forcing you to search the docs for the specific method signature you want?

@alamb
Copy link
Contributor

alamb commented Apr 2, 2023

Even then I think this is still better than the current approach of using method names to define overloads, at least the compiler tries to help you

I am not sure I agree that the compiler's help is better in this case, but that is a personal preference. I certainly don't oppose this proposal on this grounds, but I wanted to explore it

@tustvold
Copy link
Contributor Author

tustvold commented Apr 2, 2023

but that is a personal preference

Perhaps you might articulate what your personal preference would be for?

@alamb
Copy link
Contributor

alamb commented Apr 2, 2023

Perhaps you might articulate what your personal preference would be for?

I guess I would prefer a runtime error with a clear error message that "timestamp + timestamp is not supported"

@tustvold
Copy link
Contributor Author

tustvold commented Apr 2, 2023

Using the dyn variants will yield that behaviour

@viirya
Copy link
Member

viirya commented Apr 2, 2023

I think for a trait approach, the advantage would be a cleaner implementation. It might be easier to maintain for long term. The concerns I have are basically: 1. performance. I'm not sure if such trait-based dispatch could bring performance penalty even it is slight. 2. special needs. Adding arcane method names is because there are special needs for computation (e.g. checked, opt, overflow-allowing, etc). In order to match such requirements, I think one trait is not necessary but it will lead to some special traits for special need. So eventually we might still have arcane trait names 😂 .

@tustvold
Copy link
Contributor Author

tustvold commented Apr 3, 2023

I'm not sure if such trait-based dispatch could bring performance penalty even it is slight

Trait-based dispatch is zero-cost, and behaves the same as normal function dispatch. Trait-object dispatch, i.e. using dyn is not zero-cost, however, it is unavoidable if you don't know the types statically. TLDR it should be no different than currently, and for cases where the static types are known, it will be faster

special needs for computation

Yes, we will need to keep an eye out for this. I'm optimistic we can get by with traits for the type of operation, i.e. Divide, Add, Substract, etc... with checked and unchecked member functions. Divide might need a third for divide by zero => NULL, but that is perfectly fine imo

tustvold added a commit to tustvold/arrow-rs that referenced this issue Jul 4, 2023
tustvold added a commit to tustvold/arrow-rs that referenced this issue Jul 4, 2023
tustvold added a commit that referenced this issue Jul 8, 2023
* Add Datum based arithmetic kernels (#3999)

* Fix benchmark

* Review feedback
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants