Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removes the crate's dependency on BigInt, BigUint #767

Merged
merged 13 commits into from
May 21, 2024
Merged

Removes the crate's dependency on BigInt, BigUint #767

merged 13 commits into from
May 21, 2024

Conversation

zslayton
Copy link
Contributor

@zslayton zslayton commented May 14, 2024

This PR limits the range of integers that can be read and written to those that can be represented by the i128 type.

This reduces the size of Int, UInt, Decimal, Timestamp, RawValueRef, and ValueRef. It also allows those types to implement Copy.

It also allows us to remove several custom methods to compare across mixed i64/BigInt and u64/BigUint representations for equality and ordering.

Here's the benchmark output following this change:

image

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Zack Slayton and others added 5 commits May 14, 2024 14:51
This PR limits the range of integers that can be read and written
to those that can be represented by the `i128` type.

This reduces the size of `Int`, `UInt`, `Decimal`, and `Timestamp`,
`RawValueRef`, and `ValueRef`. It also allows those types to
implement `Copy`.

It also allows us to remove several custom methods to compare
across mixed `i64`/`BigInt` and `u64`/`BigUint` representations
for equality and ordering.
Copy link
Contributor Author

@zslayton zslayton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ PR Tour

# We need at least 1.65 for GATs[1] and 1.67 for `ilog`[2]
# [1] https://blog.rust-lang.org/2022/11/03/Rust-1.65.0.html
# [2] https://blog.rust-lang.org/2023/01/26/Rust-1.67.0.html#stabilized-apis
rust-version = "1.67"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ Small bump to the minimum Rust version to enable us to use the stabilized ilog10 method for counting decimal digits in integers.

@@ -62,7 +63,6 @@ chrono = { version = "0.4", default-features = false, features = ["clock", "std"
delegate = "0.12.0"
thiserror = "1.0"
nom = "7.1.1"
num-bigint = "0.4.3"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ 👋

@@ -99,7 +99,7 @@ pub fn criterion_benchmark(c: &mut Criterion) {
b.iter(|| {
let mut encoded_length: usize = 0;
for value in &unsigned_values {
encoded_length += black_box(FlexUInt::write_u64(&mut output, *value).unwrap());
encoded_length += black_box(FlexUInt::write(&mut output, *value).unwrap());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ The FlexUInt::write method can now accept any int size. I ran the encoding_primitives benchmark and confirmed that this is the same or faster than the u64-specific method in all cases.

magnitude: UInt {
data: UIntData::U64(0),
},
magnitude: UInt::ZERO,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ I added ZERO and NEGATIVE_ZERO const values to the Decimal and Coefficient types for easier reference.

Comment on lines 45 to 51
// TODO: copy from the slice and use from_be_bytes
let mut magnitude: u128 = 0;
for &byte in uint_bytes {
let byte = u128::from(byte);
magnitude <<= 8;
magnitude |= byte;
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ I moved this from the (now gone) BigUInt::from_bytes_be method. As the TODO notes, it could still be optimized, but that's for another PR.

Comment on lines +40 to +41
fn write_repr_decimal(&mut self, value: Option<Decimal>) -> IonResult<()>;
fn write_repr_timestamp(&mut self, value: Option<Timestamp>) -> IonResult<()>;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ Several small changes to ion-hash based on this update.

self.push_byte(0xE3); // Biased FlexUInt follows
FlexUInt::encode_u64(self.encoding_buffer, symbol_id as u64 - 65_792)
}
let _ = FlexUInt::write(self.encoding_buffer, symbol_id - 65_792).unwrap();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ This is encoding to a Vec, so it can't fail.

),
-40,
),
Decimal::new(Int::from(27182818284590452353602874713526624i128), -40),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ I switched this test value out for a marginally shorter one that fits in an i128.

Comment on lines -210 to -233
/// Extract the integer and fractional parts and the exponents of this as a `(BigInt, (BigInt, i64))`
///
/// ```ignore
/// # use num_bigint::BigInt;
/// # use ion_rs::Decimal;
/// let d1: Decimal = Decimal::new(123456789, -2);
/// let (int_part, (frac_part, exponent)) = d1.into_parts();
/// assert_eq!(int_part, BigInt::from(1234567u64));
/// assert_eq!(frac_part, BigInt::from(89));
/// assert_eq!(exponent, -2);
/// ```
pub(crate) fn into_parts(self) -> (BigInt, (BigInt, i64)) {
let magnitude: BigInt = self.coefficient.try_into().unwrap();
if self.exponent.is_zero() {
(magnitude, (BigInt::zero(), 0))
} else if self.exponent.is_negative() {
let divisor = BigInt::from(10u64).pow((-self.exponent) as u32);
let (i, f) = magnitude.div_rem(&divisor);
(i, (f, self.exponent))
} else {
let multiplicand = BigInt::from(10u64).pow(self.exponent as u32);
(magnitude * multiplicand, (BigInt::zero(), 0))
}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ This and related methods existed solely to enable conversion of decimal components to/from BigInt.

}

#[test]
fn test_difference_by_parts_lossy() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ Removed tests for the also-removed difference_by_parts_lossy & co.

@zslayton zslayton marked this pull request as ready for review May 15, 2024 20:36
@zslayton zslayton requested review from popematt and nirosys May 15, 2024 20:37
src/binary/decimal.rs Outdated Show resolved Hide resolved
Comment on lines 180 to 194

#[test]
fn oversized_decimal() {
let mut buf = vec![];
// Even though our output stream is a growable Vec, the encoder uses a
// fixed-size, stack-allocated buffer to encode the body of the decimal.
// If it's asked to encode a decimal that won't fit (>32 bytes), it should
// return an error.
let precise_pi = Decimal::new(
BigUint::from_str("31415926535897932384626433832795028841971693993751058209749445923078164062862089986280348253421170679").unwrap(),
-99
);
let result = buf.encode_decimal_value(&precise_pi);
assert!(result.is_err());
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to keep this test and assert that it is an error? Do we elsewhere have a test that verifies the new encoding error behavior?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sent me down a useful rabbit hole.

Because I had removed the constructor that accepted a BigUint coefficient, it became statically impossible to trigger this case. However, because the Coefficient type was a (UInt, Sign) tuple, it was possible for it to store magnitudes outside the i128 range but inside the u128 range.

In my latest commits I've modified Coefficient to an (Int, Sign) pair where the Int's sign information always agrees with Sign except when the coefficient is a negative zero. This is essentially the same logic as before but with a constructor that's now capped at Int::MAX. This had the nice side effect of also eliminating the need for constructors that take an explicit Sign parameter because that would only ever be necessary to construct a -0. There are now Coefficient::new(some_int_value) and Coefficient::negative_zero() constructors.

All of that too say, it's now statically impossible to programmatically construct a Decimal that's too big.

src/lazy/binary/immutable_buffer.rs Outdated Show resolved Hide resolved
src/lazy/binary/immutable_buffer.rs Outdated Show resolved Hide resolved
src/types/decimal/mod.rs Show resolved Hide resolved
src/types/decimal/coefficient.rs Outdated Show resolved Hide resolved
src/types/integer.rs Show resolved Hide resolved
src/types/timestamp.rs Show resolved Hide resolved
src/lazy/value_ref.rs Show resolved Hide resolved
src/lazy/encoder/binary/v1_1/value_writer.rs Show resolved Hide resolved
src/lazy/encoder/binary/v1_1/flex_uint.rs Outdated Show resolved Hide resolved
src/binary/uint.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@rmarrowstone rmarrowstone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great simplification. Particularly for the Decimal and Timestamp implementations. Love it.

I'm not concerned for the potential limiting of decimals and timestamps, but for Ion Integers I don't have a clear picture of what a users options are (could be none) when they have a value stream with an "overlarge" integer. I'm good with an answer like "they can access a vec of the two's complement bytes from this API", and am likely fine with they can't. But I'm not familiar enough with ion-rust to see it from this PR. Thanks.

src/binary/int.rs Show resolved Hide resolved
@zslayton
Copy link
Contributor Author

This is a great simplification. Particularly for the Decimal and Timestamp implementations. Love it.

I'm not concerned for the potential limiting of decimals and timestamps, but for Ion Integers I don't have a clear picture of what a users options are (could be none) when they have a value stream with an "overlarge" integer. I'm good with an answer like "they can access a vec of the two's complement bytes from this API", and am likely fine with they can't. But I'm not familiar enough with ion-rust to see it from this PR. Thanks.

For the moment, there's not a (stable) option. An upcoming release will expose the span APIs, allowing users to access a value's underlying data.

Longer-term, I've been thinking about augmenting the IonError type with variants to expose values that were outside the happy path but potentially usable. For example, returning an IonError::OversizedInt(oversized_int) variant where oversized_int had methods to access its bytes and determine its byte order. (This is related to #759, where users may wish to access top-level containers that are found to be incomplete or contain invalid child data--CC @sajidanw.) I haven't fully baked this idea though; it seems like it would require adding a lifetime to IonError, which could be a big change. Another alternative would be adding a tolerant_next() method (name tbd) alongside next() which returned a different enum of its own, allowing users to opt into handling such cases.

@zslayton zslayton requested a review from popematt May 21, 2024 15:47
src/binary/int.rs Outdated Show resolved Hide resolved
Co-authored-by: Joshua Barr <70981087+jobarr-amzn@users.noreply.github.com>
@zslayton zslayton merged commit 92b5f86 into main May 21, 2024
31 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants