Refactor integer type inference logic to fit smallest type #5406

jondo2010 · 2024-02-17T13:55:08Z

Which issue does this PR close?

Partially fixes #802

Rationale for this change

As explained in #802, the current infer_schema logic always returns DataType::Int64 for all integer types.

This PR adds additional inferring logic to find the smallest integer type that fits the data.

This PR also fixes an issue where the data in fact doesn't fit in a DataType::Int64 and instead requires a UInt64.

Are there any user-facing changes?

Any user code loading CSV data using infer_schema() and always expecting to receive Int64 fields would be affected.

No API changes.

Jefffrey

Will need to swap out usage of lexical_core

I'll take a closer look at the PR later, just noting this down first 👍

Jefffrey · 2024-02-18T08:59:23Z

arrow-csv/src/reader/mod.rs

+    } else {
+        match s.len() {
+            1..=3 => {
+                if lexical_core::parse::<u8>(s.as_bytes()).is_ok() {


There was a recent PR which highlighted that lexical_core has a longstanding issue where it can parse overflow incorrectly: #5398

For example, given this example program:

fn main() { let a = lexical_core::parse::<u8>(b"999"); println!("{:?}", a); let a = "999".parse::<u8>(); println!("{:?}", a); }

It returns:

Ok(231) Err(ParseIntError { kind: PosOverflow })

Using lexical_core = "0.8.5".

I don't think we can rely on lexical_core anymore, at least not for Integer parsing

Ah that's unfortunate.

tustvold

I worry that this will make schema inference very fragile, as it is typically only run on a subset of the data and therefore with this change may be overly restrictive on size or sign.

A few thoughts:

We should prefer signed types by default, as this most of the time is sufficient
The new behaviour should be opt-in
I think this will substantially regress performance as it now must parse integers

Perhaps we could see how other systems handle this, e.g. DuckDb or Spark?

tustvold · 2024-02-24T02:21:06Z

Marking as draft as not waiting on review, please feel free to mark as ready for review when you would like me to take another look

Refactor integer type inference logic to fit smallest type

3a36e1d

github-actions bot added the arrow Changes to the arrow crate label Feb 17, 2024

Refactor data type inference logic in CSV reader

abbe322

Jefffrey requested changes Feb 18, 2024

View reviewed changes

tustvold added the api-change Changes to the arrow API label Feb 18, 2024

tustvold reviewed Feb 18, 2024

View reviewed changes

tustvold marked this pull request as draft February 24, 2024 02:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor integer type inference logic to fit smallest type #5406

Refactor integer type inference logic to fit smallest type #5406

jondo2010 commented Feb 17, 2024

Jefffrey left a comment

Jefffrey Feb 18, 2024

jondo2010 Feb 18, 2024

tustvold left a comment

tustvold commented Feb 24, 2024 •

edited

Refactor integer type inference logic to fit smallest type #5406

Are you sure you want to change the base?

Refactor integer type inference logic to fit smallest type #5406

Conversation

jondo2010 commented Feb 17, 2024

Which issue does this PR close?

Rationale for this change

Are there any user-facing changes?

Jefffrey left a comment

Choose a reason for hiding this comment

Jefffrey Feb 18, 2024

Choose a reason for hiding this comment

jondo2010 Feb 18, 2024

Choose a reason for hiding this comment

tustvold left a comment

Choose a reason for hiding this comment

tustvold commented Feb 24, 2024 • edited

tustvold commented Feb 24, 2024 •

edited