Replace DecimalArray with PrimitiveArray #2637

tustvold · 2022-09-02T13:47:47Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

Following #2477 we now have Decimal128Type and Decimal256Type. I think it should be possible to make these impl ArrowPrimitiveType and consequently remove the need for DecimalArray entirely. Any logic for checked conversion (#2387) could be located and implemented in Decimal

This would provide a quick and easy way to increase the kernel support for decimals, as we would effectively get implementations "for free".

Describe the solution you'd like

I would like to treat decimals the same way we treat other constant width types

Describe alternatives you've considered

We could not do this, it was just an idea that occurred to me and may help to reduce the second-class citizen status that decimals currently have.

Additional context

Was noticed whilst working on #2635

FYI @liukun4515 @HaoYang670

The text was updated successfully, but these errors were encountered:

liukun4515 · 2022-09-05T02:30:05Z

@tustvold Do you means that you want to implement the ArrowPrimitiveType for Decimal128Type?
I go through the code,

/// Trait bridging the dynamic-typed nature of Arrow (via [`DataType`]) with the
/// static-typed nature of rust types ([`ArrowNativeType`]) for all types that implement [`ArrowNativeType`].
pub trait ArrowPrimitiveType: 'static {
    /// Corresponding Rust native type for the primitive type.
    type Native: ArrowNativeType;

    /// the corresponding Arrow data type of this primitive type.
    const DATA_TYPE: DataType;

    /// Returns the byte width of this primitive type.
    fn get_byte_width() -> usize {
        std::mem::size_of::<Self::Native>()
    }

    /// Returns a default value of this primitive type.
    ///
    /// This is useful for aggregate array ops like `sum()`, `mean()`.
    fn default_value() -> Self::Native {
        Default::default()
    }
}

But what's the const DATA_TYPE: DataType; for Decimal128Type?, just use the default decimal128 type

const DEFAULT_TYPE: DataType =
        DataType::Decimal128(DECIMAL128_MAX_PRECISION, DECIMAL_DEFAULT_SCALE);

tustvold · 2022-09-05T08:51:36Z

Something like that, or we remove it. I suspect there will be some churn to making it work

HaoYang670 · 2022-09-05T09:46:01Z

In arrow2, Decimal128 is based on PrimitiveArray<i128>, maybe we could learn from it. (I am not 100% sure, but in arrow2 Decimal seems like a logical type (a combination of PrimitiveArray and DataType::Decimal).)
jorgecarleitao/arrow2#1194

liukun4515 · 2022-09-09T01:52:30Z

In arrow2, Decimal128 is based on PrimitiveArray<i128>, maybe we could learn from it. (I am not 100% sure, but in arrow2 Decimal seems like a logical type (a combination of PrimitiveArray and DataType::Decimal).) jorgecarleitao/arrow2#1194

use i128 to represent the decimal128 is good way which is used in current kernel computing.
i128 can be compared and do arithmetic op easily, but it's not a generic way if we need to support decimal256.

Do you have any thoughts about this @viirya ? I know you implement the most feature about decimal256 and refactor some part of decimal.

HaoYang670 · 2022-09-09T02:00:56Z

but it's not a generic way if we need to support decimal256.

What I find in arrow2 is that it defines its own i256: https://github.com/jorgecarleitao/arrow2/pull/1194/files#diff-ace1f1ec74768840fb47ce3992630c335ee710890f0b2898397731c055144774R507

Not sure if it is a good way, but it give us some inspiration.

tustvold · 2022-09-09T02:55:50Z

Imo it would make sense to implement the relevant arithmetic, etc... traits for Decimal<DecimalType>, potentially using i128 under the covers, and then implement ArrowPrimitiveType for DecimalType with ArrowPrimitiveType::Native being Decimal<DecimalType>

Edit: oh wait that won't work as Decimal has additional fields... Darn...

tustvold · 2022-09-24T11:55:53Z

I had a play around implementing this and it appears to be very tractable. What I did is as follows:

Define struct i256([u8; 32]) and implement ArrowNativeType for it
Implement ArrowPrimitiveType for Decimal128Type and Decimal256Type with Native of i128 and i256 respectively
Replace DecimalArray and DecimalBuilder with PrimitiveArray and PrimitiveBuilder
Add specialized impl PrimitiveArray<T: DecimalType> adding with_precision_and_scale
Add specialized impl PrimitiveBuilder<T: DecimalType> adding with_precision_and_scale
Add ArrayData::validate_decimal_values which validates values are within bounds of decimal precision
Remove old decimal logic, including specialized take, comparison kernels, etc...
Add public free function to format signed little endian byte arrays with a given precision and scale
Implement ArrowNativeOps for i128 and i256
Casting precision becomes purely a metadata operation, casting scale can use the normal arithmetic operations

This appears to work quite well, and I'm fairly happy this will achieve the stated aims of massively simplifying the decimal implementation, making it more consistent with other types such as timestamps, and provide an easy path forward to getting full kernel coverage.

There are, however, a couple of major implications of this change:

The values returned from PrimitiveArray::value will be the storage type, and not a semantic Decimal type. This is the same as timestamps, times, durations, etc... and I think is therefore fine, but is worth highlighting
Precision will not be validated by default, instead only via explicit calls to validate_decimal_values Decimal Precision Validation #2387. This makes the builders, FromIterator etc... consistently infallible (currently this is inconsistent)
Overflow for checked kernels will be at the boundary of the storage type, i.e. i128 or i256 and not based on precision, users who wish to catch precision-based overflow will need to use the checked calls and then make an explicit call to validate the result
Arithmetic, comparison, etc... across arrays with different scale will be non-trivial (although it would be complicated regardless)

If people are happy with this path forward, I will polish up this approach over the coming week, but I'm loathe to work on this further if the consensus isn't there. Even better would be if someone is willing to help out with this effort 😄?

Thoughts @alamb @liukun4515 @HaoYang670 @kmitchener @viirya?

alamb · 2022-09-25T12:05:34Z

Thank you @tustvold -- from my perspective it sounds like a good plan and I would be willing to help update datafusion to use this new approach.

That being said, however, I don't feel my vote I should have a huge bearing in the matter as I don't use the decimal type in our projects. I would be very interested to hear @liukun4515 and @viirya 's opinions on the matter

liukun4515 · 2022-09-26T00:26:28Z

I had a play around implementing this and it appears to be very tractable. What I did is as follows:

Define struct i256([u8; 32]) and implement ArrowNativeType for it

In order to do unified processing, I suggest that use the [u8;16] to present the i128 like the i256.

We can implement the operation like comparation, arithmetic operation based on the u8 array.

Implement ArrowPrimitiveType for Decimal128Type and Decimal256Type with Native of i128 and i256 respectively

Using the i128([u8;16]) and i128([u8;32]) as the Native is good point, My concern is the const DATA_TYPE: DataType for this.

The decimal data type is infinite and can't be enumerated. We can't enumerate all decimal type like time stamp with limited time unit.

How to determine the Data type for this ArrowPrimitiveType?

This is key point for me.

Replace DecimalArray and DecimalBuilder with PrimitiveArray and PrimitiveBuilder

Add specialized impl PrimitiveArray<T: DecimalType> adding with_precision_and_scale

Add specialized impl PrimitiveBuilder<T: DecimalType> adding with_precision_and_scale

Add ArrayData::validate_decimal_values which validates values are within bounds of decimal precision

Remove old decimal logic, including specialized take, comparison kernels, etc...

Add public free function to format signed little endian byte arrays with a given precision and scale

Implement ArrowNativeOps for i128 and i256

Casting precision becomes purely a metadata operation, casting scale can use the normal arithmetic operations

This appears to work quite well, and I'm fairly happy this will achieve the stated aims of massively simplifying the decimal implementation, making it more consistent with other types such as timestamps, and provide an easy path forward to getting full kernel coverage.

There are, however, a couple of major implications of this change:

The values returned from PrimitiveArray::value will be the storage type, and not a semantic Decimal type. This is the same as timestamps, times, durations, etc... and I think is therefore fine, but is worth highlighting

Precision will not be validated by default, instead only via explicit calls to validate_decimal_values Decimal Precision Validation #2387. This makes the builders, FromIterator etc... consistently infallible (currently this is inconsistent)

Overflow for checked kernels will be at the boundary of the storage type, i.e. i128 or i256 and not based on precision, users who wish to catch precision-based overflow will need to use the checked calls and then make an explicit call to validate the result

Arithmetic, comparison, etc... across arrays with different scale will be non-trivial (although it would be complicated regardless)

I think we don't need to consider operation with different scale.
We should add checker for this with the assumptions.

If the native for decimal has its data type(precision, scale), is this can be resolved？

If people are happy with this path forward, I will polish up this approach over the coming week, but I'm loathe to work on this further if the consensus isn't there. Even better would be if someone is willing to help out with this effort 😄?

Thoughts @alamb @liukun4515 @HaoYang670 @kmitchener @viirya?

I personally still want to implement decimal data type alone and don't combine it with all primitive data type.
But this thought is also what I have thought before, I basically agree with your proposal with some comments and questions.

@tustvold

HaoYang670 · 2022-09-26T01:45:21Z

Thank you, @tustvold, for your perseverance on the decimal implementation.

Personally, I prefer the idea of implementing our own i256 type, which makes the definition of decimal128 and decimal256 more consistent.

And, I agree with implementing Decimal as one of ArrowPrimitiveType (It could be even better if Decimal could be ArrowNumericType). Because DecimalArray has the same physic layout as other primitive arrays (one validate buffer + one value buffer + constant value length).

HaoYang670 · 2022-09-26T01:52:52Z

The decimal data type is infinite and can't be enumerated. We can't enumerate all decimal type like time stamp with limited time unit.

Hi @liukun4515, why do you concern the countability of Decimal? How does it impact the implementation?

viirya · 2022-09-26T05:48:08Z

Conceptually, I think DecimalArray should be treated as PrimitiveArray. That being said, it sounds like a good idea to me. In C++ Arrow, Decimal128Array and Decimal256Array both are PrimitiveArray too. So such a change also makes a more consistent Arrow implementation. Another big benefit is, this can simplify decimal kernels and from user perspective there is more consistent APIs including kernels.

Overall I'd vote for this direction. As we will use decimal types, kernels, etc. in our project, I will be sure to help on if we decide to move with this direction. So that being said, even we don't go for this change, I have been planning to complement decimal support.

The concern from my side is, if any, significant API change. That will be many breaking cases. And seems we need to hold on the effort to add decimal related kernels into this crate (there seems a PR doing that).

I think that this is good plan for long term, although in the short term, there will be pain from changing.

kmitchener · 2022-09-26T12:35:32Z

I don't know enough to express an opinion about the implementation details, but I'm happy to see that it seems there's agreement on a path forward toward a better decimal implementation.

tustvold · 2022-09-26T19:18:31Z

First part of this is in #2781 PTAL

I think this also serves to highlight why I don't wish to try to make decimal256 and decimal128 use the same generic implementation for arithmetic, specifically the performance of BigInt is ~25x worse than using fixed width native types such as i128, as it performs allocations and branching. I would instead propose only paying this cost for the rather esoteric decimal256, and leaving the door open to potentially optimise this in future.

* Add i256 (#2637) * RAT * Fix doc comments * Store as parts * Custom multiply implementation * Make from_parts public * Document mulx * Remove branch from to_i128 * Clippy

…he#2637)

… (#2833)

* Decimal cleanup (#2637) * Format

* Don't validate decimal precision in ArrayData (#2637) * Format

tustvold · 2022-10-15T02:02:55Z

I think this is now complete, there is still work to flesh out support for decimals in the arithmetic, comparison, etc... kernels, but that can be tracked separately

* Filter DecimalArray as PrimitiveArray (#2637) * Add decimal filter benches * Format

alamb · 2022-10-28T13:07:22Z

label_issue.py automatically added labels {'arrow'} from #2818

alamb · 2022-10-28T13:07:23Z

label_issue.py automatically added labels {'parquet'} from #2857

tustvold added the question Further information is requested label Sep 2, 2022

liukun4515 mentioned this issue Sep 17, 2022

Dividing decimal type gives wrong error: "170141183460469231731687303715884105727 is too large to store in a Decimal128 apache/arrow-datafusion#3498

Closed

This was referenced Sep 23, 2022

add arithmetic and comparison kernels from DF + small modifications #2770

Closed

Remove ArrowNativeType::FromStr #2775

Merged

tustvold mentioned this issue Sep 24, 2022

Decimal Precision Validation #2387

Closed

tustvold added a commit to tustvold/arrow-rs that referenced this issue Sep 26, 2022

Add i256 (apache#2637)

5d445b2

tustvold added a commit to tustvold/arrow-rs that referenced this issue Sep 26, 2022

Add i256 (apache#2637)

093055c

tustvold added a commit to tustvold/arrow-rs that referenced this issue Sep 26, 2022

Add i256 (apache#2637)

db74e05

tustvold mentioned this issue Sep 26, 2022

Add i256 (#2637) #2781

Merged

tustvold mentioned this issue Sep 29, 2022

optimize: no validation for decimal array for cast/take #2551

Closed

tustvold added a commit that referenced this issue Oct 3, 2022

Add i256 (#2637) (#2781)

4df1f3c

* Add i256 (#2637) * RAT * Fix doc comments * Store as parts * Custom multiply implementation * Make from_parts public * Document mulx * Remove branch from to_i128 * Clippy

This was referenced Oct 3, 2022

Fix i256 checked multiplication #2818

Merged

Convert rows to arrays (#2677) #2826

Merged

tustvold added a commit to tustvold/arrow-rs that referenced this issue Oct 6, 2022

Derive ArrowPrimitiveType for Decimal128Type and Decimal256Type (apac…

f1402ad

…he#2637)

tustvold added a commit to tustvold/arrow-rs that referenced this issue Oct 6, 2022

Derive ArrowPrimitiveType for Decimal128Type and Decimal256Type (apac…

2b40538

…he#2637)

tustvold mentioned this issue Oct 6, 2022

Derive ArrowPrimitiveType for Decimal128Type and Decimal256Type (#2637) #2833

Merged

tustvold mentioned this issue Oct 6, 2022

Inconsistent Dyn Scalar Kernels #2837

Closed

tustvold added a commit that referenced this issue Oct 6, 2022

Derive ArrowPrimitiveType for Decimal128Type and Decimal256Type (#2637)…

37c8679

… (#2833)

viirya mentioned this issue Oct 11, 2022

Make DecimalArray as PrimitiveArray #2857

Merged

tustvold added a commit to tustvold/arrow-rs that referenced this issue Oct 13, 2022

Decimal cleanup (apache#2637)

08569df

tustvold added a commit to tustvold/arrow-rs that referenced this issue Oct 13, 2022

Decimal cleanup (apache#2637)

5bf214a

tustvold mentioned this issue Oct 13, 2022

Decimal cleanup (#2637) #2865

Merged

tustvold added a commit to tustvold/arrow-rs that referenced this issue Oct 14, 2022

Take decimal as primitive (apache#2637)

a25af3d

tustvold added a commit to tustvold/arrow-rs that referenced this issue Oct 14, 2022

Take decimal as primitive (apache#2637)

defc330

tustvold mentioned this issue Oct 14, 2022

Take decimal as primitive (#2637) #2869

Merged

tustvold added a commit to tustvold/arrow-rs that referenced this issue Oct 14, 2022

Filter DecimalArray as PrimitiveArray (apache#2637)

f73aec4

tustvold mentioned this issue Oct 14, 2022

Filter DecimalArray as PrimitiveArray ~5x Faster (#2637) #2870

Merged

tustvold added a commit to tustvold/arrow-rs that referenced this issue Oct 14, 2022

Filter DecimalArray as PrimitiveArray (apache#2637)

ba5b771

tustvold added a commit that referenced this issue Oct 14, 2022

Take decimal as primitive (#2637) (#2869)

d7f994c

tustvold added a commit that referenced this issue Oct 14, 2022

Decimal cleanup (#2637) (#2865)

d67d5fb

* Decimal cleanup (#2637) * Format

tustvold added a commit to tustvold/arrow-rs that referenced this issue Oct 14, 2022

Don't validate decimal precision in ArrayData (apache#2637)

c92945c

tustvold mentioned this issue Oct 14, 2022

Don't validate decimal precision in ArrayData (#2637) #2873

Merged

tustvold added a commit that referenced this issue Oct 14, 2022

Don't validate decimal precision in ArrayData (#2637) (#2873)

4d7d411

* Don't validate decimal precision in ArrayData (#2637) * Format

tustvold closed this as completed Oct 15, 2022

tustvold added a commit that referenced this issue Oct 15, 2022

Filter DecimalArray as PrimitiveArray ~80% Faster (#2637) (#2870)

ede36d7

* Filter DecimalArray as PrimitiveArray (#2637) * Add decimal filter benches * Format

This was referenced Oct 27, 2022

Remove Unused NativeDecimalType #2945

Merged

Write Generic Code over [Large]BinaryArray and [Large]StringArray #2946

Closed

alamb added the arrow Changes to the arrow crate label Oct 28, 2022

alamb added the parquet Changes to the parquet crate label Oct 28, 2022

vvv mentioned this issue Jan 7, 2023

[doc] Fix broken URLs #3486

Merged

unconsolable mentioned this issue Jan 13, 2023

perf: store small decimal values as int risinglightdb/risinglight#768

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace DecimalArray with PrimitiveArray #2637

Replace DecimalArray with PrimitiveArray #2637

tustvold commented Sep 2, 2022 •

edited

liukun4515 commented Sep 5, 2022 •

edited

tustvold commented Sep 5, 2022

HaoYang670 commented Sep 5, 2022 •

edited

liukun4515 commented Sep 9, 2022

HaoYang670 commented Sep 9, 2022 •

edited

tustvold commented Sep 9, 2022 •

edited

tustvold commented Sep 24, 2022 •

edited

alamb commented Sep 25, 2022

liukun4515 commented Sep 26, 2022

HaoYang670 commented Sep 26, 2022

HaoYang670 commented Sep 26, 2022

viirya commented Sep 26, 2022

kmitchener commented Sep 26, 2022

tustvold commented Sep 26, 2022

tustvold commented Oct 15, 2022

alamb commented Oct 28, 2022

alamb commented Oct 28, 2022

Replace DecimalArray with PrimitiveArray #2637

Replace DecimalArray with PrimitiveArray #2637

Comments

tustvold commented Sep 2, 2022 • edited

liukun4515 commented Sep 5, 2022 • edited

tustvold commented Sep 5, 2022

HaoYang670 commented Sep 5, 2022 • edited

liukun4515 commented Sep 9, 2022

HaoYang670 commented Sep 9, 2022 • edited

tustvold commented Sep 9, 2022 • edited

tustvold commented Sep 24, 2022 • edited

alamb commented Sep 25, 2022

liukun4515 commented Sep 26, 2022

HaoYang670 commented Sep 26, 2022

HaoYang670 commented Sep 26, 2022

viirya commented Sep 26, 2022

kmitchener commented Sep 26, 2022

tustvold commented Sep 26, 2022

tustvold commented Oct 15, 2022

alamb commented Oct 28, 2022

alamb commented Oct 28, 2022

tustvold commented Sep 2, 2022 •

edited

liukun4515 commented Sep 5, 2022 •

edited

HaoYang670 commented Sep 5, 2022 •

edited

HaoYang670 commented Sep 9, 2022 •

edited

tustvold commented Sep 9, 2022 •

edited

tustvold commented Sep 24, 2022 •

edited