Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decimal Precision Validation #2387

Closed
tustvold opened this issue Aug 9, 2022 · 26 comments · Fixed by #2880
Closed

Decimal Precision Validation #2387

tustvold opened this issue Aug 9, 2022 · 26 comments · Fixed by #2880
Labels
question Further information is requested

Comments

@tustvold
Copy link
Contributor

tustvold commented Aug 9, 2022

Which part is this question about

Generally the approach taken by this crate is that a given ArrayData and by extension Array only contains valid data. For example, a StringArray is valid UTF-8 with each index at a codepoint boundary, a dictionary array only has valid indexes, etc... This allows eliding bound checks on access within kernels.

However, in order for this to be sound, it must be impossible to create invalid ArrayData using safe APIs. This means that safe APIs must either:

  • Generate valid data by construction - e.g. the builder APIs
  • Validate data - e.g. ArrayData::try_new

For the examples above incorrect validation can very clearly lead to UB. The situation for decimal values is a bit more confused, in particular I'm not really clear on what the implications of a value that exceeds the precision actually are. However, some notes:

  • As far as I can tell we don't protect against overflow of normal integer types
  • We don't have any decimal arithmetic kernels (yet)
  • The decimal types are fixed bit width and so the precision isn't used to impact their representation

Describe your question

My question boils down to:

  • What is the purpose of the precision argument? Is it just for interoperability with other non-arrow representations?
  • Is there a requirement to saturate/error at the bounds of the precision, or can we simply overflow/saturate at the bounds of the underlying representation
  • Does validating the precision on ingest to ArrayData actually elide any validation when performing computation?

The answers to this will dictate if we can just take a relaxed attitude to precision, and let users opt into validation if they care, and otherwise simply ignore it.

I tried to understand what the C++ implementation is doing, but I honestly got lost. It almost looks like it is performing floating point operations and then rounding them back, which seems surprising...

Additional context

@tustvold tustvold added the question Further information is requested label Aug 9, 2022
@alamb
Copy link
Contributor

alamb commented Aug 9, 2022

cc @HaoYang670 @viirya @liukun4515

This is an excellent question -- "why do we need to validate Decimal Precision at all" -- it will likely drive design decisions such as what @HaoYang670 raised on #2362

@HaoYang670
Copy link
Contributor

What is the purpose of the precision argument?

As far as I know. The precision is actually the decimal precision, or how many digits are there in decimal representation.

The precision add a runtime bound for the value. (The underlying bit width (128 or 256) is a compile time bound.). Whenever you change the value of precision at runtime, you might need to check the value validation.

why do we need to validate Decimal Precision at all.

Because Decimal type is somewhat a mixture of static type (128bits or 256bits) and dynamic type (precision and scale). Which means we have to do validation at runtime to check overflow.

I'm not really clear on what the implications of a value that exceeds the precision actually are.

I guess the behavior is undefined. Because False can imply anything.

@tustvold
Copy link
Contributor Author

tustvold commented Aug 9, 2022

Which means we have to do validation at runtime to check overflow.

Given we don't care about this for other types, because it has serious performance implications, why would we care about it for decimals? I guess my question can be phrased as: Given decimal overflow cannot lead to undefined behaviour why do we need to check precision?

A somewhat related point, is that in Rust signed integer overflow in rust is not undefined behaviour and is explicitly defined to wrap as twos complement, this is because checking at runtime is prohibitively expensive. I don't see an obvious reason we should treat decimals any differently, the overflow behaviour is perfectly well defined...

@alamb
Copy link
Contributor

alamb commented Aug 9, 2022

the overflow behaviour is perfectly well defined...

Thought definitely they are not ideal 😆 But if someone cares we can add the rust style "checked_add()` etc kernels and by default leave it unchecked 🤔

@viirya
Copy link
Member

viirya commented Aug 9, 2022

What is the purpose of the precision argument? Is it just for interoperability with other non-arrow representations?
Is there a requirement to saturate/error at the bounds of the precision, or can we simply overflow/saturate at the bounds of the underlying representation
Does validating the precision on ingest to ArrayData actually elide any validation when performing computation?

I think that the precision for decimal specifies the representation range like other types do. It's useful so we can know at which moment an overflow can be happened and what range of values are possible to use with the type without overflow.

As we don't have native decimal type in Rust, it is internally represented by bytes with maximum length for the maximum representation range. It could be easily specified with a value over its precision. For example, we can create a 128-bit decimal with precision 5 by giving it a value which is represent-able only by precision 10 (i.e., 10 digits). For native types, you might get a overlfow value, but for this case the 10-digits value still can be put into 128-bit bytes. We don't have it as overflow value there, I think.

I think that is why the things become complicated. We may need to validate decimal values at some moments. For now we throw some errors for invalid (overflow) values. Instead, I think it might also make sense to change the invalid values to overflow representation to simulate overflow. Although I think for some systems, overflow will throw error anyway. But as mentioned:

As far as I can tell we don't protect against overflow of normal integer types

I guess that decimal overflow should be no different.

I think that interoperability is a concern. Because I'm not sure how other systems will interpret the decimal values from this crate. Assume we don't do any validation on decimals. If they just take the values without validation, I think the behavior might be unexpected. It could be an overflow at that system, or it truncates the value?

@tustvold
Copy link
Contributor Author

If they just take the values without validation, I think the behavior might be unexpected. It could be an overflow at that system, or it truncates the value?

Unexpected, perhaps, but not undefined. I'm not saying we don't provide methods to validate decimals, just that we make this explicitly opt-in?

@alamb
Copy link
Contributor

alamb commented Aug 11, 2022

I don't have a strong opinion on this question, for what it is worth. Ensuring that array data has reasonable values for decimal seems ok to me (to fail fast rather than fail at some much later point) but if there is some compelling reason not to validate I could be convinced that way too

@tustvold
Copy link
Contributor Author

tustvold commented Aug 23, 2022

I'm really struggling to review the decimal array work without an answer to this, as the current approach is wildly inconsistent to the point of being effectively meaningless... As it isn't consistently enforced, there is no reliable way to elide validation as you cannot assume validation has actually been performed

  • from_iter - no validation
  • with_precision_and_scale - validation if precision less
  • DecimalBuilder - disable validation with safe API

The result is that it is perfectly possible to construct a DecimalArray with overflowed values using safe APIs, consequently it is unclear how you can then meaningfully elide validation, or reason about if an operation will overflow, as you can't guarantee the input data wasn't already overflowed... It is also unclear what the meaning of validation is...

For example, if you use from_iter to collect an iterator into a DecimalArray and then use with_precision_and_scale to set the precision and scale, this will elide the validation if the new precision is greater than the default, despite the validation never occurring...

I can see 3 possible paths forward:

Strict Validation

  • Invariant that no values in a DecimalArray exceed the precision
  • Aways validate on construction, and mutation

This is the most "correct" interpretation, but has significant performance drawbacks, and is much stronger than the guarantees we provide for other arithmetic types.

Loose Validation

  • Mutation operations on DecimalArray saturate at the bounds of the underlying stored type
  • Continue to provide APIs to validate no values have overflowed
  • Perfectly legal for a DecimalArray to contain overflowed values, and it must not cause UB for this to be the case

This would allow opt-in overflow detection, but would require the user to perform validation after every arithmetic operation

No Validation

Continue to provide APIs to validate all values are within bounds, but don't attempt to detect or handle overflow within kernels. This is consistent with how we handle overflow elsewhere and is perfectly well defined.

I think that is why the things become complicated. We may need to validate decimal values at some moments

My 2 cents on this is we only need to validate the invariants that arrow-rs needs to in order to prevent UB. If some other software component has additional invariants, it is on that component to verify them. Provided we clearly document the invariants we uphold, I don't think this is an issue.

My preference would therefore be to option 3, as it is the simplest and fastest to implement, and is consistent with how we handle overflow elsewhere.

@liukun4515
Copy link
Contributor

liukun4515 commented Aug 23, 2022

I'm really struggling to review the decimal array work without an answer to this, as the current approach is wildly inconsistent to the point of being effectively meaningless... As it isn't consistently enforced, there is no reliable way to elide validation as you cannot assume validation has actually been performed

  • from_iter - no validation
  • with_precision_and_scale - validation if precision less
  • DecimalBuilder - disable validation with safe API

The result is that it is perfectly possible to construct a DecimalArray with overflowed values using safe APIs, consequently it is unclear how you can then meaningfully elide validation, or reason about if an operation will overflow, as you can't guarantee the input data wasn't already overflowed... It is also unclear what the meaning of validation is...

For example, if you use from_iter to collect an iterator into a DecimalArray and then use with_precision_and_scale to set the precision and scale, this will elide the validation if the new precision is greater than the default, despite the validation never occurring...

I can see 3 possible paths forward:

Strict Validation

  • Invariant that no values in a DecimalArray exceed the precision
  • Aways validate on construction, and mutation

This is the most "correct" interpretation, but has significant performance drawbacks, and is much stronger than the guarantees we provide for other arithmetic types.

Loose Validation

  • Mutation operations on DecimalArray saturate at the bounds of the underlying stored type
  • Continue to provide APIs to validate no values have overflowed
  • Perfectly legal for a DecimalArray to contain overflowed values, and it must not cause UB for this to be the case

This would allow opt-in overflow detection, but would require the user to perform validation after every arithmetic operation

No Validation

Continue to provide APIs to validate all values are within bounds, but don't attempt to detect or handle overflow within kernels. This is consistent with how we handle overflow elsewhere and is perfectly well defined.

I think that is why the things become complicated. We may need to validate decimal values at some moments

My 2 cents on this is we only need to validate the invariants that arrow-rs needs to in order to prevent UB. If some other software component has additional invariants, it is on that component to verify them. Provided we clearly document the invariants we uphold, I don't think this is an issue.

My preference would therefore be to option 3, as it is the simplest and fastest to implement, and is consistent with how we handle overflow elsewhere.

In my opinion, the reason of doing validation for the decimal array is that we would like to make sure all the elements in the array are in within the range for the precision.
If we don't care about the if the element in the array is out of bounds, then we don't need to do validation in the kernel.

For example, I call the Cast method to cast int32 array to int8 array, do we need to do validation and check the overflow for the casting? If we need to do the check and return the error message to user, we need to do validation in the kernel api.

But as far as Decimal data type, If I cast the Decimal(8,0) array) to Decimal(12,0) array`, each element can't be overflow or out of the range for the new precision, we don't need to do validation and improve the performance.

@liukun4515
Copy link
Contributor

My preference would therefore be to option 3, as it is the simplest and fastest to implement, and is consistent with how we handle overflow elsewhere.

Can you tell me where we handle the overflow? I can't get your thoughts

@tustvold
Copy link
Contributor Author

tustvold commented Aug 23, 2022

Can you tell me where we handle the overflow? I can't get your thoughts

In all the places where we are currently doing validation?

But as far as Decimal data type, If I cast the Decimal(8,0) array) to Decimal(12,0) array`, each element can't be overflow or out of the range for the new precision, we don't need to do validation and improve the performance.

Only if you can assert that the data in the Decimal(8,0) was in bounds of the precision of 8 😁 That's what is so wildly inconsistent at the moment, if you don't consistently validate, you can't elide as you don't know if validation has actually occurred

do we need to do validation and check the overflow for the casting

No, which is why I am trying to argue for not bothering to do the same for decimals... But currently we sort of do and sort of don't and I don't really understand what is going on

Edit: apparently we are doing checked casting for numerics, so I don't know anymore...

@liukun4515
Copy link
Contributor

Can you tell me where we handle the overflow? I can't get your thoughts

In all the places where we are currently doing validation?

But as far as Decimal data type, If I cast the Decimal(8,0) array) to Decimal(12,0) array`, each element can't be overflow or out of the range for the new precision, we don't need to do validation and improve the performance.

Only if you can assert that the data in the Decimal(8,0) was in bounds of the precision of 8 😁 That's what is so wildly inconsistent at the moment, if you don't consistently validate, you can't elide as you don't know if validation has actually occurred

do we need to do validation and check the overflow for the casting

No, which is why I am trying to argue for not bothering to do the same for decimals... But currently we sort of do and sort of don't and I don't really understand what is going on

Edit: apparently we are doing checked casting for numerics, so I don't know anymore...

other critical path for decimal, do we need to do validation when reading the parquet data to decimal?
@tustvold

@liukun4515
Copy link
Contributor

I have no preference, but just don't want to do unnecessary validation which will impact the performance of reading data/casting.

@liukun4515
Copy link
Contributor

Can you tell me where we handle the overflow? I can't get your thoughts

In all the places where we are currently doing validation?

But as far as Decimal data type, If I cast the Decimal(8,0) array) to Decimal(12,0) array`, each element can't be overflow or out of the range for the new precision, we don't need to do validation and improve the performance.

Only if you can assert that the data in the Decimal(8,0) was in bounds of the precision of 8 😁 That's what is so wildly inconsistent at the moment, if you don't consistently validate, you can't elide as you don't know if validation has actually occurred

We can do the strict validation, but just skip the point where we make sure it doesn't need the validation.

If we follow the strict validation, we can make sure all the element in the DecimalArray(8,0) are in the bounds of the precision 8, so we don't need do the validation when casting from decimal(8,0) to `decimal(12,0).
That is what I am doing!!

I only find three scenes that we don't need to do validation when reading decimal data from parquet file (the schema is got from the parquet file metadata. The data/metadata in the parquet are matched); casting decimal array from the small range to the big range; take operation.

@tustvold

@tustvold
Copy link
Contributor Author

That is what I am doing!!

I know, my point is we aren't doing strict validation currently so the optimisation is ill-formed until such a time as we are doing strict validation, if we wish to do so

@liukun4515
Copy link
Contributor

liukun4515 commented Aug 24, 2022

I find other thing about the casting.
I write the test code for cast int32 to int8 like below

        let a = Int32Array::from(vec![10000, 17890]);
        let array = Arc::new(a) as ArrayRef;
        let b = cast(&array, &DataType::Int8).unwrap();
        let c = b.as_any().downcast_ref::<Int8Array>().unwrap();
        println!("{:?}", c);

All data are out of the range int8, we will get [NULL,NULL], there is no cast_options to control the behavior.

But for decimal data type conversion, we will get Error.

I think the two results are not inconsistent.
Which is the right behavior for the cast when occurs overflow? @alamb @tustvold

@liukun4515
Copy link
Contributor

That is what I am doing!!

I know, my point is we aren't doing strict validation currently so the optimisation is ill-formed until such a time as we are doing strict validation, if we wish to do so

Maybe we can follow the behavior the arrow c++?make consistent with c++ version.
I think this is good way. If no validation in the c++, we will don't do validation in rust.

Do you think so?
@alamb @tustvold

@liukun4515
Copy link
Contributor

any conclusion for this issue? @tustvold

@tustvold
Copy link
Contributor Author

I thought you were going to take a look at the C++ implementation?

@liukun4515
Copy link
Contributor

I thought you were going to take a look at the C++ implementation?

Yes, I think i will finish it today.

Maybe @viirya is familiar with the code base of c++, Do you have any option?

@viirya
Copy link
Member

viirya commented Aug 26, 2022

Not sure which part of decimal validation you meant?

If you were talking about cast kernel, as I just quickly take a look, C++ implementation provides an option to choose between truncate decimal (it calls unsafe upscale/downscale), or do a safe rescale which will check if scaling is okay (no overflow/truncate) and rescaled value fits in new precision.

@alamb
Copy link
Contributor

alamb commented Aug 31, 2022

I think it is a safe option to follow the C/C++ interface. 👍

I don't really have a strong opinion here other than like @tustvold I would like a consistent rule that we follow.

Which is the right behavior for the cast when occurs overflow? @alamb @tustvold

I think the idea is that the user can call cast to get "default" behavior or cast_with_options to control whether they would like to generate an error or NULL if the cast wasn't successful

https://docs.rs/arrow/21.0.0/arrow/compute/kernels/cast/index.html

I suggest making Decimal the same with respect to casting behavior (as in follow the default behavior of the other numeric types)

@tustvold
Copy link
Contributor Author

tustvold commented Sep 24, 2022

Proposal to make this handled consistently, among other things, is here - #2637 (comment)

@tustvold
Copy link
Contributor Author

With #2857 precision validation is explicitly opt-in, as we add support for checked arithmetic this will be at the boundaries of the underlying i128 / i256 and not according to the precision. This is fine as these kernels will prevent truncation / overflow resulting in data loss, and if the user wishes to ensure precision is respected they can explicitly make a call to validate it

@piyushdubey
Copy link

Is there a fix for this in C# as well. I am seeing the same issue while parsing Decimals in C# and consistently getting the error

Decimal scale cannot be greater than that in the Arrow vector: 4 != 3

@tustvold
Copy link
Contributor Author

tustvold commented Mar 24, 2024

You would need to ask on the main arrow repo, with a code reproducer. There is a good chance it isn't a bug

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants