Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support decimal scalar value #1394

Merged
merged 2 commits into from Dec 6, 2021

Conversation

liukun4515
Copy link
Contributor

@liukun4515 liukun4515 commented Dec 3, 2021

Which issue does this PR close?

Closes #1393

From #122 In order to support the decimal data type in the datafusion, we should add decimal scalar value first.

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

@github-actions github-actions bot added the datafusion Changes in the datafusion crate label Dec 3, 2021
@liukun4515 liukun4515 marked this pull request as ready for review December 3, 2021 03:31
@liukun4515
Copy link
Contributor Author

#[allow(clippy::box_vec)]
I am confused about the failed ci

@liukun4515
Copy link
Contributor Author

@Dandandan @alamb @houqp PTAL

@alamb
Copy link
Contributor

alamb commented Dec 3, 2021

I am confused about the failed ci

this is likely due to a new rust version being released with more stringent clippy lint rules

Thankfully @xudong963 has a PR up to get it working again: #1395

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good @liukun4515 -- thank you. I had a few suggestions, but overall this looks just about perfect.

@@ -43,6 +43,8 @@ pub enum ScalarValue {
Float32(Option<f32>),
/// 64bit float
Float64(Option<f64>),
/// 128bit decimal, using the i128 to represent the decimal
Decimal128(Option<i128>, Option<usize>, Option<usize>),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 it seems like we always need the precision and scale (to know what type the Decimal is).

Thus, perhaps this could be changed to be

Suggested change
Decimal128(Option<i128>, Option<usize>, Option<usize>),
Decimal128(Option<i128>, usize, usize),

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice point.
I almost forgot this.

scale: usize,
) -> Result<Self> {
// make sure the precision and scale is valid
// TODO const the max precision and min scale
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The arrow spec doesn't seem to say anything about min/max valid scales: https://github.com/apache/arrow/blob/master/format/Schema.fbs#L181-L190

I poked a little around in arrow-rs and I also didn't find any limits on the precision or scale either

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use 128bit to represent a decimal value, the max precision is 38.
The scale is always greater or equal to 0 and less than or equal to precision.
In the arrow-rs, the decimal is represented by i128 with precision and scale.
So we can add the MAX_PRECISION = 38 in the datafusion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to make sure I got this correctly.
The spec doesn't say anything about the the precision/scale of DECIMAL, but we're implementing DECIMAL128 here so 38 should be the max precision.
After apache/arrow-rs#131 is implemented, we can implement DECIMAL256 type in datafusion and the max precision will be 76.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, now in the arrow-rs the decimal256 has not been implemented.
In the arrow-go, it is also not implemented too.
Just in the arrow-java and arrow-c++, the bitwith of 256 has been implemented.
@capkurmagati

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can make this const in other pull request

@@ -100,6 +102,12 @@ impl PartialEq for ScalarValue {
// any newly added enum variant will require editing this list
// or else face a compile error
match (self, other) {
(Decimal128(v1, p1, s1), Decimal128(v2, p2, s2)) => {
v1.eq(v2) && p1.eq(p2) && s1.eq(s2)
// TODO how to handle this case: decimal(123,10,1) with decimal(1230,10,2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Int8(Some(100)) is not equal to Int64(Some(100)) with this implementation either , so I think it is consistent with the rest of this comparison of decimal(123,10,1) and decimal(1230,10,2)` are different --

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep.
If we want to compare two values with the different data types, you should convert them to the same data type and then compare them.

@@ -171,6 +179,17 @@ impl PartialOrd for ScalarValue {
// any newly added enum variant will require editing this list
// or else face a compile error
match (self, other) {
// TODO decimal type, we just compare the values which have the same precision and scale.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the code in this PR is correct -- two decimal's can be compared if they have the same precision / scale and they can't be compared (returns None) if they have different precision/scales).

Comment on lines +277 to +274
p.hash(state);
s.hash(state)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could probably skip hashing the precision and scale - and just hash the value

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If two decimal values have the same value and diff precision or scale, the hash value is the same.
But if we hash all data in the decimal, the above case will not happen.
It is better to hash all data in decimal.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good point 👍

Comment on lines 515 to 518
ScalarValue::Decimal128(_, _, _) => {
// TODO add the default precision and scale for this case
// DataType::Decimal(38, 0)
panic!("The Decimal Scalar value with invalid precision or scale.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if you changed Decimal as suggested above to always have precision and scale, this code would not be possible

ScalarValue::Decimal128(v1, _, _) => v1,
_ => unreachable!(),
})
.collect::<Vec<Option<i128>>>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using DecimalBuilder in this way is awkward. I wonder if we could add a function to create Decimal values from an array of i128 (likely it would go in arrow-rs). Something like

let arr = DecimalArray::from_iter_and_scale(array.into_iter(), precision, scale)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can add the from_iter_and_scale function in arrow-rs later and replace this in the followup pull request.
Do you agree this? @alamb

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

related pr: apache/arrow-rs#1009

scale: &usize,
) -> ScalarValue {
let array = array.as_any().downcast_ref::<DecimalArray>().unwrap();
// TODO add checker: the precision and scale are same with array
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would be a good assert! type check to add, though I don't think it would ever happen unless there is some bug

@liukun4515
Copy link
Contributor Author

@alamb I have addressed all comments.
If there are no other comments, please help to merge this pr.
I will propose other pull requests about decimal, for example aggregate with decimal data type.

@liukun4515 liukun4515 requested a review from alamb December 6, 2021 13:56
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great -- thank you @liukun4515

@@ -43,6 +43,8 @@ pub enum ScalarValue {
Float32(Option<f32>),
/// 64bit float
Float64(Option<f64>),
/// 128bit decimal, using the i128 to represent the decimal
Decimal128(Option<i128>, usize, usize),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Comment on lines +277 to +274
p.hash(state);
s.hash(state)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good point 👍

@alamb alamb merged commit 7f24a79 into apache:master Dec 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datafusion Changes in the datafusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

support the decimal scalar value
3 participants