New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor integer type inference logic to fit smallest type #5406
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will need to swap out usage of lexical_core
I'll take a closer look at the PR later, just noting this down first 👍
} else { | ||
match s.len() { | ||
1..=3 => { | ||
if lexical_core::parse::<u8>(s.as_bytes()).is_ok() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a recent PR which highlighted that lexical_core
has a longstanding issue where it can parse overflow incorrectly: #5398
For example, given this example program:
fn main() {
let a = lexical_core::parse::<u8>(b"999");
println!("{:?}", a);
let a = "999".parse::<u8>();
println!("{:?}", a);
}
It returns:
Ok(231)
Err(ParseIntError { kind: PosOverflow })
Using lexical_core = "0.8.5"
.
I don't think we can rely on lexical_core anymore, at least not for Integer parsing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah that's unfortunate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I worry that this will make schema inference very fragile, as it is typically only run on a subset of the data and therefore with this change may be overly restrictive on size or sign.
A few thoughts:
- We should prefer signed types by default, as this most of the time is sufficient
- The new behaviour should be opt-in
- I think this will substantially regress performance as it now must parse integers
Perhaps we could see how other systems handle this, e.g. DuckDb or Spark?
Marking as draft as not waiting on review, please feel free to mark as ready for review when you would like me to take another look |
Which issue does this PR close?
Partially fixes #802
Rationale for this change
As explained in #802, the current
infer_schema
logic always returnsDataType::Int64
for all integer types.This PR adds additional inferring logic to find the smallest integer type that fits the data.
This PR also fixes an issue where the data in fact doesn't fit in a
DataType::Int64
and instead requires aUInt64
.Are there any user-facing changes?
Any user code loading CSV data using
infer_schema()
and always expecting to receiveInt64
fields would be affected.No API changes.