[WIP]Change precision of decimal without validation #2357

liukun4515 · 2022-08-08T04:14:18Z

Which issue does this PR close?

Closes #2313

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

liukun4515 · 2022-08-08T04:30:37Z

Just change the API and change some usages as the example.

two example for no validation for creating/changing the decimal:

read the decimal data from parquet
cast one decimal to another. TODO: other primitive type to decimal

liukun4515 · 2022-08-08T04:31:52Z

PTAL this draft, this just an example for optimizing for decimal array.

If you have any option, please feel free to let me know.

@alamb @tustvold @viirya

tustvold

I think this method needs to be unsafe if it is skipping validation without knowing it is safe to do so? At least that would be consistent with the rest of the codebase where it is common to have unsafe *_unchecked methods. I suspect there are many places for decimals where we aren't following this and should be

Thoughts @viirya ?

viirya · 2022-08-08T08:24:43Z

Agreed that this should be unsafe.

I suspect there are many places for decimals where we aren't following this and should be

Do you mean that we should do this unsafe style precision changing for many places but we haven't?

viirya · 2022-08-08T08:35:06Z

arrow/src/compute/kernels/cast.rs

+    let output_array  = match output_precision-output_scale>=input_precision - input_scale {
+        true => {
+            output_array.with_precision_and_scale(*output_precision, *output_scale, false)
+        }
+        false => {
+            output_array.with_precision_and_scale(*output_precision, *output_scale, true)
+        }
+    }?;


Do we need this?

cast_decimal_to_decimal already adjusts scale so seems only precision does matter here. with_precision_and_scale also only does validation if new precision is less.

if cast decimal(5,2) to decimal(5,3).
a value of 123.45 will be convert to 123.450, but the 123450 is out of range of decimal(5,3).

I am still learning the logic. Here is just a nit:

Suggested change

let output_array = match output_precision-output_scale>=input_precision - input_scale {

true => {

output_array.with_precision_and_scale(*output_precision, *output_scale, false)

}

false => {

output_array.with_precision_and_scale(*output_precision, *output_scale, true)

}

}?;

let output_array = output_array.with_precision_and_scale(

*output_precision,

*output_scale,

output_precision - output_scale < input_precision - input_scale)?;

if cast decimal(5,2) to decimal(5,3). a value of 123.45 will be convert to 123.450, but the 123450 is out of range of decimal(5,3).

This out-of-range should already be caught now as with_precision_and_scale will do the validation.

But I got your point of the change here. with_precision_and_scale will do validation here all the time. But for some cases output precision >= input precision + (output scale - input scale), we can skip the validation.

I'd suggest to rewrite the condition as output precision >= input precision + (output scale - input scale).

viirya · 2022-08-08T08:42:23Z

parquet/src/arrow/array_reader/primitive_array.rs

@@ -208,7 +210,7 @@ where
                        ))
                    }
                }
-                .with_precision_and_scale(p, s)?;
+                .with_precision_and_scale(p, s, false)?;


Hmm, how do we know that we don't need to do validation here? The decimal type can be specified, what if user specify a smaller precision?

from the codebase and arrow_reader from parquet, the decimal type of arrow is converted from the decimal type of parquet schema/meatdata.

physical type of parquet just contains below types:

PhysicalType::BOOLEAN => Arc::new(BooleanArray::from(array_data)) as ArrayRef, PhysicalType::INT32 => Arc::new(Int32Array::from(array_data)) as ArrayRef, PhysicalType::INT64 => Arc::new(Int64Array::from(array_data)) as ArrayRef, PhysicalType::FLOAT => Arc::new(Float32Array::from(array_data)) as ArrayRef, PhysicalType::DOUBLE => Arc::new(Float64Array::from(array_data)) as ArrayRef, PhysicalType::INT96 | PhysicalType::BYTE_ARRAY | PhysicalType::FIXED_LEN_BYTE_ARRAY

The target_type arrow type is from logical type of parquet file, so when reading from the parquet file, the data must be right or valid.

the data must be right or valid.

I'm not sure we can make this assumption, we validate UTF-8 strings for example...

I'm not sure we can make this assumption, we validate UTF-8 strings for example...

Do you mean that the parquet data is may not match with its metadata or schema?

An API is unsound if you can violate invariants of a data structure, therefore potentially introducing UB, without using unsafe APIs to achieve this. As parquet IO is safe, if parquet did not validate the data on read it would be unsound.

This leads to a couple of options, and possibly more besides:

Don't care about decimals exceeding the precision and don't make this an invariant

We make this an invariant and validate on read

We add an unsafe flag to skip validation on read

I want to take a step back before undertaking any of these, as frankly I am deeply confused by what this precision argument is actually for - why arbitrarily truncate your value space?

Hmm, how do we know that we don't need to do validation here? The decimal type can be specified, what if user specify a smaller precision?

Why user can specify a new data type and not the type from parquet schema? If you can specify the type there may be other error when the specified type is not compatible with the data in the parquet. Although the current interface is this with specified data type.

I want to take a step back before undertaking any of these, as frankly I am deeply confused by what this precision argument is actually for - why arbitrarily truncate your value space?

In order to represent the decimal, some system or rule just use the siged integer to store the data, and scale/precision just as the metadata. We can get the exact value from this two parts.

For example decimal(3,1), the value of 10.1 will be stored/encoded as 101.

@tustvold
Maybe the parquet doc can help you
https://github.com/apache/parquet-format/blob/54e53e5d7794d383529dd30746378f19a12afd58/src/main/thrift/parquet.thrift#L67
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal

HaoYang670 · 2022-08-08T10:16:17Z

I suspect there are many places for decimals where we aren't following this and should be

I am not sure whether this is related to this PR, but the new method should also be unsafe: https://github.com/apache/arrow-rs/blob/master/arrow/src/util/decimal.rs#L81-L86

liukun4515 · 2022-08-08T10:55:10Z

I suspect there are many places for decimals where we aren't following this and should be

I am not sure whether this is related to this PR, but the new method should also be unsafe: https://github.com/apache/arrow-rs/blob/master/arrow/src/util/decimal.rs#L81-L86

Not same issue.

There are many unnecessary precision validation for decimal array, such as reading parquet to arrow, casting smaller precision decimal arrow array to another bigger precision decimal arrow type.

The validation of decimal array is heavy.

HaoYang670 · 2022-08-08T10:59:30Z

arrow/src/array/array_decimal.rs

    /// Returns an Error if:
    /// 1. `precision` is larger than [`Self::MAX_PRECISION`]
    /// 2. `scale` is larger than [`Self::MAX_SCALE`];
    /// 3. `scale` is > `precision`
-    fn with_precision_and_scale(self, precision: usize, scale: usize) -> Result<U>
+    /// 4. `need_validation` is `true`, but some values are out of ranges/bounds
+    fn with_precision_and_scale(self, precision: usize, scale: usize, need_validation: bool) -> Result<U>


need_validation is somewhat confused to me.
Do we need to check the precision, or check the scale?

just precision, because the decimal will to stored as a bigint without fractional part and just need to check the bounds/ranges for the max/min for the precision.
The max/min value is determined by precision.
For example decimal(4,n), the max bigint is 9999 and the min is -9999, we should do validation for the input decimal or value in the decimal array if needed.

Then I prefer renaming it and add more docs.

tustvold · 2022-08-08T11:06:26Z

Thank you for working on this, however, I have a feeling that there is something fundamentally incorrect, or at the very least inconsistent, with the way validation for Decimal types is currently handled that warrants a more holistic review. In particular there are a number of APIs that should be unsafe, that aren't, the construction of BasicDecimal has unnecessary bounds checking, etc...

I have asked on the ASF slack for clarification of what the purpose of the precision argument actually is, in particular what are the expected overflow/underflow characteristics. Once I have some clarity on what the correct behaviour is, I'll take a stab at fixing/filing the issues that arise.

I'm sorry that this does mean there will be a slight delay, but I think it is better that we fix this properly, as I at least am extremely confused by what the current code does...

HaoYang670 · 2022-08-08T11:06:44Z

arrow/src/compute/kernels/cast.rs

+    // do validation.
+    let output_array  = match output_precision-output_scale>=input_precision - input_scale {
+        true => {
+            output_array.with_precision_and_scale(*output_precision, *output_scale, false)


What does with_precision_and_scale check if you set need_validation to false?
It only checks:
1. output precision <=? max precision
2. output scale <=? max scale
3. output precision >= output scale I guess.

If this is the truth, why do we check these things after we doing the casting? We should check the output precision and scale at the beginning of this method.

What does with_precision_and_scale check if you set need_validation to false? It only checks: 1. output precision <=? max precision 2. output scale <=? max scale 3. output precision >= output scale I guess.

If this is the truth, why do we check these things after we doing the casting? We should check the output precision and scale at the beginning of this method.

You miss one point of the with_precision_and_scale which will reset/set the DataType for this array.

The decimaarray are created from below case

array .iter() .map(|v| v.map(|v| v.as_i128() * mul)) .collect::<Decimal128Array>()

, and need to reset/set the precision/scale for the result decimal array.

You miss one point of the with_precision_and_scale which will reset/set the DataType for this array.

Could you please share the code link, I don't find it.

@HaoYang670
create array by reading data from parquet

arrow-rs/parquet/src/arrow/array_reader/primitive_array.rs

Line 187 in 9a630a1

ArrowType::Decimal128(p, s) => {

create a new decimal array from casting

arrow-rs/arrow/src/compute/kernels/cast.rs

Line 1255 in 9a630a1

fn cast_decimal_to_decimal(

In cast case, if the target decimal type has larger range, we can ignore the validation.
For example, cast decimal(10,3) to decimal(12,3) don't need to do the validation, even int32 to decimal(12,0) also don't need to do validation.

alamb · 2022-08-08T21:20:29Z

I am not sure whether this is related to this PR, but the new method should also be unsafe:
https://github.com/apache/arrow-rs/blob/master/arrow/src/util/decimal.rs#L81-L86

I would recommend ensuringnew() created valid DecimalArrays rather than marking it unsafe. Possibly related to this PR: #2360

alamb · 2022-08-08T21:28:56Z

Possibly also related: #2362

alamb · 2022-08-17T15:56:56Z

What is the status of this PR?

liukun4515 · 2022-08-18T02:07:52Z

What is the status of this PR?

This pr will be closed.
I am ready to file some small sub-pr to push this forward.

github-actions bot added arrow Changes to the arrow crate parquet Changes to the parquet crate labels Aug 8, 2022

change the api of , add example of the usage

97e1f26

liukun4515 force-pushed the remove_unnecessary_decimal_valid branch from 39a19e1 to 97e1f26 Compare August 8, 2022 04:27

liukun4515 requested review from tustvold, viirya and alamb and removed request for tustvold and viirya August 8, 2022 04:28

liukun4515 changed the title ~~Change precision of decimal without validation~~ [WIP]Change precision of decimal without validation Aug 8, 2022

tustvold reviewed Aug 8, 2022

View reviewed changes

viirya reviewed Aug 8, 2022

View reviewed changes

HaoYang670 reviewed Aug 8, 2022

View reviewed changes

liukun4515 mentioned this pull request Aug 9, 2022

Rewrite Decimal and DecimalArray using const_generic #2383

Merged

liukun4515 closed this Aug 18, 2022

liukun4515 mentioned this pull request Aug 18, 2022

RFC: Simplify decimal (#2440) #2477

Merged

liukun4515 mentioned this pull request Aug 22, 2022

optimize: no validation for decimal array for cast/take #2551

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP]Change precision of decimal without validation #2357

[WIP]Change precision of decimal without validation #2357

liukun4515 commented Aug 8, 2022

liukun4515 commented Aug 8, 2022

liukun4515 commented Aug 8, 2022

tustvold left a comment

viirya commented Aug 8, 2022

viirya Aug 8, 2022

liukun4515 Aug 8, 2022

HaoYang670 Aug 8, 2022

viirya Aug 8, 2022

viirya Aug 8, 2022

liukun4515 Aug 8, 2022

tustvold Aug 8, 2022

liukun4515 Aug 8, 2022

tustvold Aug 8, 2022

liukun4515 Aug 8, 2022

liukun4515 Aug 8, 2022 •

edited

HaoYang670 commented Aug 8, 2022

liukun4515 commented Aug 8, 2022

HaoYang670 Aug 8, 2022

liukun4515 Aug 8, 2022

HaoYang670 Aug 8, 2022

tustvold commented Aug 8, 2022 •

edited

HaoYang670 Aug 8, 2022

liukun4515 Aug 8, 2022

HaoYang670 Aug 8, 2022

liukun4515 Aug 8, 2022 •

edited

alamb commented Aug 8, 2022

alamb commented Aug 8, 2022

alamb commented Aug 17, 2022

liukun4515 commented Aug 18, 2022

[WIP]Change precision of decimal without validation #2357

[WIP]Change precision of decimal without validation #2357

Conversation

liukun4515 commented Aug 8, 2022

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

liukun4515 commented Aug 8, 2022

liukun4515 commented Aug 8, 2022

tustvold left a comment

Choose a reason for hiding this comment

viirya commented Aug 8, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liukun4515 Aug 8, 2022 • edited

Choose a reason for hiding this comment

HaoYang670 commented Aug 8, 2022

liukun4515 commented Aug 8, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold commented Aug 8, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liukun4515 Aug 8, 2022 • edited

Choose a reason for hiding this comment

alamb commented Aug 8, 2022

alamb commented Aug 8, 2022

alamb commented Aug 17, 2022

liukun4515 commented Aug 18, 2022

liukun4515 Aug 8, 2022 •

edited

tustvold commented Aug 8, 2022 •

edited

liukun4515 Aug 8, 2022 •

edited