Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize the validation of Decimal256 #2320

Closed
liukun4515 opened this issue Aug 4, 2022 · 13 comments · Fixed by #2360
Closed

Optimize the validation of Decimal256 #2320

liukun4515 opened this issue Aug 4, 2022 · 13 comments · Fixed by #2360
Assignees
Labels
arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog

Comments

@liukun4515
Copy link
Contributor

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

When arrow construct a decimal array, we need to verify the data of decimal with the precision.

From current implemention

pub(crate) fn validate_decimal_precision(value: i128, precision: usize) -> Result<i128> {
and
pub(crate) fn validate_decimal256_precision(

We compare the decimal using i128 or string, it may be inefficient.

optimize the validation using the byte array instead of i128 or string.

Describe the solution you'd like

For exmaple, precision 3, The max value is i128.to_le_byte(999), and the min value if i128.to_le_byte(-999).
Comparing the decimal128(value,precision,scale) just need to compare the value ([u8;16]) with the min and max.

Describe alternatives you've considered

Additional context

@liukun4515 liukun4515 added the enhancement Any new improvement worthy of a entry in the changelog label Aug 4, 2022
@liukun4515
Copy link
Contributor Author

This also can speed up the construction of decimal arrays like #2313

@liukun4515
Copy link
Contributor Author

FYI @viirya @alamb

I will file a pr tomorrow to do this.

@liukun4515 liukun4515 self-assigned this Aug 4, 2022
@liukun4515
Copy link
Contributor Author

Maybe we should add iter<Option<[u8]>> for decimal array and reduce construct the decimal128 and decimal258 which contain the useless precision/scale for validation.

@liukun4515
Copy link
Contributor Author

Maybe we should add iter<Option<[u8]>> for decimal array and reduce construct the decimal128 and decimal258 which contain the useless precision/scale for validation.

Also just iterate the array like value_unchecked to get the raw_data

@alamb
Copy link
Contributor

alamb commented Aug 4, 2022

I think it is important to:

  1. ensure that some sort of validate is called for every insertion of a decimal value into a Decimal128Array or Decimal256Array
  2. ensure that the validation is only done once per value
  3. Allow some way to skip this validation if the user wishes (e.g. ArrayData::new_unchecked https://docs.rs/arrow/19.0.0/arrow/array/struct.ArrayData.html#method.new_unchecked)

Performance wise I haven't checked if we are validating things twice

@liukun4515
Copy link
Contributor Author

also add the benchmark result of this function.

@tustvold
Copy link
Contributor

tustvold commented Aug 6, 2022

I find it unexpected that a variable length byte array outperforms a fixed-length i128?

reduce construct the decimal128 and decimal258 which contain the useless precision/scale for validation.

I think it would help to take a step back here, as I at least am very confused as to what is being validated when. For most array types we have an invariant that ArrayData only contains valid data, this means we can elide validation on access. This does not appear to be the case for DecimalArray, which I personally think is a bug.

For example, BasicDecimal::new is not unsafe, BasicDecimalArray::from_fixed_size_binary_array doesn't perform validation, etc... If we can make there an invariant of valid data, and consistently enforce this, we can then look to elide validation checks on access

@liukun4515
Copy link
Contributor Author

liukun4515 commented Aug 6, 2022

I find it unexpected that a variable length byte array outperforms a fixed-length i128?

reduce construct the decimal128 and decimal258 which contain the useless precision/scale for validation.

@tustvold
It's not variable length byte array.
I think should be a fixed length byte array, for example decimal128 is [u8;16], decimal256 is [u8;32].

Have you benched these two implementation?

In current logic, we should get the byte array and convert the byte array to i128 or bigint/string first, and then compare with the bound min/max value.

I guess we can just compare the fixed length byte array with the bound min/max with the format of fixed length byte array.

@liukun4515
Copy link
Contributor Author

liukun4515 commented Aug 6, 2022

I think it would help to take a step back here, as I at least am very confused as to what is being validated when. For most array types we have an invariant that ArrayData only contains valid data, this means we can elide validation on access. This does not appear to be the case for DecimalArray, which I personally think is a bug.

We can get the decimal array from decimal array builder.
When appending a decimal value to the builder, we need to verify it.

But in the arrow-rs, I find some case in which we can skip the validation. For example, reading decimal value from parquet file(INT32,INT64, OR byte array), taking data from decimal array. We can refile these cases.

@viirya
Copy link
Member

viirya commented Aug 6, 2022

Currently we only do decimal validation where:

  1. Changing precision/scale by with_precision_scale. It is also selective since it does validation only new precision is smaller.
  2. Appending value to decimal builders
  3. Performing full validation on ArrayData
  4. Reading CSV file

I think it is somehow selective right now. Actually we do decimal validation less than necessary, as mentioned above from_fixed_size_binary_array doesn't validate input values. We just assume that the inputs are all valid if users use such API do build decimal array.

@liukun4515
Copy link
Contributor Author

Currently we only do decimal validation where:

  1. Changing precision/scale by with_precision_scale. It is also selective since it does validation only new precision is smaller.

When we create the decimal array from iter<Option>, and need to do validation. Because the default precision is the max precision(38).

  1. Appending value to decimal builders
  2. Performing full validation on ArrayData
  3. Reading CSV file

I think it is somehow selective right now. Actually we do decimal validation less than necessary, as mentioned above from_fixed_size_binary_array doesn't validate input values. We just assume that the inputs are all valid if users use such API do build decimal array.

@HaoYang670
Copy link
Contributor

Maybe we'd better update the title if we don't update the decimal128 validation.

@alamb alamb changed the title optimize the validation of decimal: compare data using the byte array instead of i128 or string optimize the validation of Decimal256: compare data using the byte array instead of i128 or string Aug 12, 2022
@alamb alamb changed the title optimize the validation of Decimal256: compare data using the byte array instead of i128 or string optimize the validation of Decimal256: compare data using the byte array instead of string Aug 12, 2022
@alamb
Copy link
Contributor

alamb commented Aug 12, 2022

Title is updated

@alamb alamb changed the title optimize the validation of Decimal256: compare data using the byte array instead of string Optimize the validation of Decimal256: compare data using the byte array instead of string Aug 17, 2022
@alamb alamb added the arrow Changes to the arrow crate label Aug 17, 2022
@alamb alamb changed the title Optimize the validation of Decimal256: compare data using the byte array instead of string Optimize the validation of Decimal256 Aug 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants