Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor integer type inference logic to fit smallest type #5406

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

jondo2010
Copy link
Contributor

Which issue does this PR close?

Partially fixes #802

Rationale for this change

As explained in #802, the current infer_schema logic always returns DataType::Int64 for all integer types.

This PR adds additional inferring logic to find the smallest integer type that fits the data.

This PR also fixes an issue where the data in fact doesn't fit in a DataType::Int64 and instead requires a UInt64.

Are there any user-facing changes?

Any user code loading CSV data using infer_schema() and always expecting to receive Int64 fields would be affected.

No API changes.

@github-actions github-actions bot added the arrow Changes to the arrow crate label Feb 17, 2024
Copy link
Contributor

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will need to swap out usage of lexical_core

I'll take a closer look at the PR later, just noting this down first 👍

} else {
match s.len() {
1..=3 => {
if lexical_core::parse::<u8>(s.as_bytes()).is_ok() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was a recent PR which highlighted that lexical_core has a longstanding issue where it can parse overflow incorrectly: #5398

For example, given this example program:

fn main() {
    let a = lexical_core::parse::<u8>(b"999");
    println!("{:?}", a);
    let a = "999".parse::<u8>();
    println!("{:?}", a);
}

It returns:

Ok(231)
Err(ParseIntError { kind: PosOverflow })

Using lexical_core = "0.8.5".

I don't think we can rely on lexical_core anymore, at least not for Integer parsing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah that's unfortunate.

@tustvold tustvold added the api-change Changes to the arrow API label Feb 18, 2024
Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I worry that this will make schema inference very fragile, as it is typically only run on a subset of the data and therefore with this change may be overly restrictive on size or sign.

A few thoughts:

  • We should prefer signed types by default, as this most of the time is sufficient
  • The new behaviour should be opt-in
  • I think this will substantially regress performance as it now must parse integers

Perhaps we could see how other systems handle this, e.g. DuckDb or Spark?

@tustvold tustvold marked this pull request as draft February 24, 2024 02:20
@tustvold
Copy link
Contributor

tustvold commented Feb 24, 2024

Marking as draft as not waiting on review, please feel free to mark as ready for review when you would like me to take another look

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change Changes to the arrow API arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve CSV infer_field_schema to be more restrictive
3 participants