Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace DecimalArray with PrimitiveArray #2637

Closed
tustvold opened this issue Sep 2, 2022 · 17 comments · Fixed by #2945
Closed

Replace DecimalArray with PrimitiveArray #2637

tustvold opened this issue Sep 2, 2022 · 17 comments · Fixed by #2945
Labels
arrow Changes to the arrow crate parquet Changes to the parquet crate question Further information is requested

Comments

@tustvold
Copy link
Contributor

tustvold commented Sep 2, 2022

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

Following #2477 we now have Decimal128Type and Decimal256Type. I think it should be possible to make these impl ArrowPrimitiveType and consequently remove the need for DecimalArray entirely. Any logic for checked conversion (#2387) could be located and implemented in Decimal

This would provide a quick and easy way to increase the kernel support for decimals, as we would effectively get implementations "for free".

Describe the solution you'd like

I would like to treat decimals the same way we treat other constant width types

Describe alternatives you've considered

We could not do this, it was just an idea that occurred to me and may help to reduce the second-class citizen status that decimals currently have.

Additional context

Was noticed whilst working on #2635

FYI @liukun4515 @HaoYang670

@tustvold tustvold added the question Further information is requested label Sep 2, 2022
@liukun4515
Copy link
Contributor

liukun4515 commented Sep 5, 2022

@tustvold Do you means that you want to implement the ArrowPrimitiveType for Decimal128Type?
I go through the code,

/// Trait bridging the dynamic-typed nature of Arrow (via [`DataType`]) with the
/// static-typed nature of rust types ([`ArrowNativeType`]) for all types that implement [`ArrowNativeType`].
pub trait ArrowPrimitiveType: 'static {
    /// Corresponding Rust native type for the primitive type.
    type Native: ArrowNativeType;

    /// the corresponding Arrow data type of this primitive type.
    const DATA_TYPE: DataType;

    /// Returns the byte width of this primitive type.
    fn get_byte_width() -> usize {
        std::mem::size_of::<Self::Native>()
    }

    /// Returns a default value of this primitive type.
    ///
    /// This is useful for aggregate array ops like `sum()`, `mean()`.
    fn default_value() -> Self::Native {
        Default::default()
    }
}

But what's the const DATA_TYPE: DataType; for Decimal128Type?, just use the default decimal128 type

const DEFAULT_TYPE: DataType =
        DataType::Decimal128(DECIMAL128_MAX_PRECISION, DECIMAL_DEFAULT_SCALE);

@tustvold
Copy link
Contributor Author

tustvold commented Sep 5, 2022

Something like that, or we remove it. I suspect there will be some churn to making it work

@HaoYang670
Copy link
Contributor

HaoYang670 commented Sep 5, 2022

In arrow2, Decimal128 is based on PrimitiveArray<i128>, maybe we could learn from it. (I am not 100% sure, but in arrow2 Decimal seems like a logical type (a combination of PrimitiveArray and DataType::Decimal).)
jorgecarleitao/arrow2#1194

@liukun4515
Copy link
Contributor

In arrow2, Decimal128 is based on PrimitiveArray<i128>, maybe we could learn from it. (I am not 100% sure, but in arrow2 Decimal seems like a logical type (a combination of PrimitiveArray and DataType::Decimal).) jorgecarleitao/arrow2#1194

use i128 to represent the decimal128 is good way which is used in current kernel computing.
i128 can be compared and do arithmetic op easily, but it's not a generic way if we need to support decimal256.

Do you have any thoughts about this @viirya ? I know you implement the most feature about decimal256 and refactor some part of decimal.

@HaoYang670
Copy link
Contributor

HaoYang670 commented Sep 9, 2022

but it's not a generic way if we need to support decimal256.

What I find in arrow2 is that it defines its own i256: https://github.com/jorgecarleitao/arrow2/pull/1194/files#diff-ace1f1ec74768840fb47ce3992630c335ee710890f0b2898397731c055144774R507

Not sure if it is a good way, but it give us some inspiration.

@tustvold
Copy link
Contributor Author

tustvold commented Sep 9, 2022

Imo it would make sense to implement the relevant arithmetic, etc... traits for Decimal<DecimalType>, potentially using i128 under the covers, and then implement ArrowPrimitiveType for DecimalType with ArrowPrimitiveType::Native being Decimal<DecimalType>

Edit: oh wait that won't work as Decimal has additional fields... Darn...

@tustvold
Copy link
Contributor Author

tustvold commented Sep 24, 2022

I had a play around implementing this and it appears to be very tractable. What I did is as follows:

  • Define struct i256([u8; 32]) and implement ArrowNativeType for it
  • Implement ArrowPrimitiveType for Decimal128Type and Decimal256Type with Native of i128 and i256 respectively
  • Replace DecimalArray and DecimalBuilder with PrimitiveArray and PrimitiveBuilder
  • Add specialized impl PrimitiveArray<T: DecimalType> adding with_precision_and_scale
  • Add specialized impl PrimitiveBuilder<T: DecimalType> adding with_precision_and_scale
  • Add ArrayData::validate_decimal_values which validates values are within bounds of decimal precision
  • Remove old decimal logic, including specialized take, comparison kernels, etc...
  • Add public free function to format signed little endian byte arrays with a given precision and scale
  • Implement ArrowNativeOps for i128 and i256
  • Casting precision becomes purely a metadata operation, casting scale can use the normal arithmetic operations

This appears to work quite well, and I'm fairly happy this will achieve the stated aims of massively simplifying the decimal implementation, making it more consistent with other types such as timestamps, and provide an easy path forward to getting full kernel coverage.

There are, however, a couple of major implications of this change:

  • The values returned from PrimitiveArray::value will be the storage type, and not a semantic Decimal type. This is the same as timestamps, times, durations, etc... and I think is therefore fine, but is worth highlighting
  • Precision will not be validated by default, instead only via explicit calls to validate_decimal_values Decimal Precision Validation #2387. This makes the builders, FromIterator etc... consistently infallible (currently this is inconsistent)
  • Overflow for checked kernels will be at the boundary of the storage type, i.e. i128 or i256 and not based on precision, users who wish to catch precision-based overflow will need to use the checked calls and then make an explicit call to validate the result
  • Arithmetic, comparison, etc... across arrays with different scale will be non-trivial (although it would be complicated regardless)

If people are happy with this path forward, I will polish up this approach over the coming week, but I'm loathe to work on this further if the consensus isn't there. Even better would be if someone is willing to help out with this effort 😄?

Thoughts @alamb @liukun4515 @HaoYang670 @kmitchener @viirya?

@alamb
Copy link
Contributor

alamb commented Sep 25, 2022

Thank you @tustvold -- from my perspective it sounds like a good plan and I would be willing to help update datafusion to use this new approach.

That being said, however, I don't feel my vote I should have a huge bearing in the matter as I don't use the decimal type in our projects. I would be very interested to hear @liukun4515 and @viirya 's opinions on the matter

@liukun4515
Copy link
Contributor

I had a play around implementing this and it appears to be very tractable. What I did is as follows:

  • Define struct i256([u8; 32]) and implement ArrowNativeType for it

In order to do unified processing, I suggest that use the [u8;16] to present the i128 like the i256.

We can implement the operation like comparation, arithmetic operation based on the u8 array.

  • Implement ArrowPrimitiveType for Decimal128Type and Decimal256Type with Native of i128 and i256 respectively

Using the i128([u8;16]) and i128([u8;32]) as the Native is good point, My concern is the const DATA_TYPE: DataType for this.

The decimal data type is infinite and can't be enumerated. We can't enumerate all decimal type like time stamp with limited time unit.

How to determine the Data type for this ArrowPrimitiveType?

This is key point for me.

  • Replace DecimalArray and DecimalBuilder with PrimitiveArray and PrimitiveBuilder
  • Add specialized impl PrimitiveArray<T: DecimalType> adding with_precision_and_scale
  • Add specialized impl PrimitiveBuilder<T: DecimalType> adding with_precision_and_scale
  • Add ArrayData::validate_decimal_values which validates values are within bounds of decimal precision
  • Remove old decimal logic, including specialized take, comparison kernels, etc...
  • Add public free function to format signed little endian byte arrays with a given precision and scale
  • Implement ArrowNativeOps for i128 and i256
  • Casting precision becomes purely a metadata operation, casting scale can use the normal arithmetic operations

This appears to work quite well, and I'm fairly happy this will achieve the stated aims of massively simplifying the decimal implementation, making it more consistent with other types such as timestamps, and provide an easy path forward to getting full kernel coverage.

There are, however, a couple of major implications of this change:

  • The values returned from PrimitiveArray::value will be the storage type, and not a semantic Decimal type. This is the same as timestamps, times, durations, etc... and I think is therefore fine, but is worth highlighting
  • Precision will not be validated by default, instead only via explicit calls to validate_decimal_values Decimal Precision Validation #2387. This makes the builders, FromIterator etc... consistently infallible (currently this is inconsistent)
  • Overflow for checked kernels will be at the boundary of the storage type, i.e. i128 or i256 and not based on precision, users who wish to catch precision-based overflow will need to use the checked calls and then make an explicit call to validate the result
  • Arithmetic, comparison, etc... across arrays with different scale will be non-trivial (although it would be complicated regardless)

I think we don't need to consider operation with different scale.
We should add checker for this with the assumptions.

If the native for decimal has its data type(precision, scale), is this can be resolved?

If people are happy with this path forward, I will polish up this approach over the coming week, but I'm loathe to work on this further if the consensus isn't there. Even better would be if someone is willing to help out with this effort 😄?

Thoughts @alamb @liukun4515 @HaoYang670 @kmitchener @viirya?

I personally still want to implement decimal data type alone and don't combine it with all primitive data type.
But this thought is also what I have thought before, I basically agree with your proposal with some comments and questions.

@tustvold

@HaoYang670
Copy link
Contributor

Thank you, @tustvold, for your perseverance on the decimal implementation.

Personally, I prefer the idea of implementing our own i256 type, which makes the definition of decimal128 and decimal256 more consistent.

And, I agree with implementing Decimal as one of ArrowPrimitiveType (It could be even better if Decimal could be ArrowNumericType). Because DecimalArray has the same physic layout as other primitive arrays (one validate buffer + one value buffer + constant value length).

@HaoYang670
Copy link
Contributor

The decimal data type is infinite and can't be enumerated. We can't enumerate all decimal type like time stamp with limited time unit.

Hi @liukun4515, why do you concern the countability of Decimal? How does it impact the implementation?

@viirya
Copy link
Member

viirya commented Sep 26, 2022

Conceptually, I think DecimalArray should be treated as PrimitiveArray. That being said, it sounds like a good idea to me. In C++ Arrow, Decimal128Array and Decimal256Array both are PrimitiveArray too. So such a change also makes a more consistent Arrow implementation. Another big benefit is, this can simplify decimal kernels and from user perspective there is more consistent APIs including kernels.

Overall I'd vote for this direction. As we will use decimal types, kernels, etc. in our project, I will be sure to help on if we decide to move with this direction. So that being said, even we don't go for this change, I have been planning to complement decimal support.

The concern from my side is, if any, significant API change. That will be many breaking cases. And seems we need to hold on the effort to add decimal related kernels into this crate (there seems a PR doing that).

I think that this is good plan for long term, although in the short term, there will be pain from changing.

@kmitchener
Copy link

I don't know enough to express an opinion about the implementation details, but I'm happy to see that it seems there's agreement on a path forward toward a better decimal implementation.

tustvold added a commit to tustvold/arrow-rs that referenced this issue Sep 26, 2022
tustvold added a commit to tustvold/arrow-rs that referenced this issue Sep 26, 2022
tustvold added a commit to tustvold/arrow-rs that referenced this issue Sep 26, 2022
@tustvold
Copy link
Contributor Author

First part of this is in #2781 PTAL

I think this also serves to highlight why I don't wish to try to make decimal256 and decimal128 use the same generic implementation for arithmetic, specifically the performance of BigInt is ~25x worse than using fixed width native types such as i128, as it performs allocations and branching. I would instead propose only paying this cost for the rather esoteric decimal256, and leaving the door open to potentially optimise this in future.

tustvold added a commit that referenced this issue Oct 3, 2022
* Add i256 (#2637)

* RAT

* Fix doc comments

* Store as parts

* Custom multiply implementation

* Make from_parts public

* Document mulx

* Remove branch from to_i128

* Clippy
tustvold added a commit to tustvold/arrow-rs that referenced this issue Oct 13, 2022
tustvold added a commit to tustvold/arrow-rs that referenced this issue Oct 13, 2022
tustvold added a commit to tustvold/arrow-rs that referenced this issue Oct 14, 2022
tustvold added a commit to tustvold/arrow-rs that referenced this issue Oct 14, 2022
tustvold added a commit to tustvold/arrow-rs that referenced this issue Oct 14, 2022
tustvold added a commit to tustvold/arrow-rs that referenced this issue Oct 14, 2022
tustvold added a commit that referenced this issue Oct 14, 2022
* Decimal cleanup (#2637)

* Format
tustvold added a commit to tustvold/arrow-rs that referenced this issue Oct 14, 2022
tustvold added a commit that referenced this issue Oct 14, 2022
* Don't validate decimal precision in ArrayData (#2637)

* Format
@tustvold
Copy link
Contributor Author

I think this is now complete, there is still work to flesh out support for decimals in the arithmetic, comparison, etc... kernels, but that can be tracked separately

tustvold added a commit that referenced this issue Oct 15, 2022
* Filter DecimalArray as PrimitiveArray (#2637)

* Add decimal filter benches

* Format
@alamb alamb added the arrow Changes to the arrow crate label Oct 28, 2022
@alamb
Copy link
Contributor

alamb commented Oct 28, 2022

label_issue.py automatically added labels {'arrow'} from #2818

@alamb
Copy link
Contributor

alamb commented Oct 28, 2022

label_issue.py automatically added labels {'parquet'} from #2857

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate parquet Changes to the parquet crate question Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants