New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix page size on dictionary fallback #2854
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||
---|---|---|---|---|
|
@@ -551,7 +551,10 @@ where | |||
|
||||
match &mut encoder.dict_encoder { | ||||
Some(dict_encoder) => dict_encoder.encode(values, indices), | ||||
None => encoder.fallback.encode(values, indices), | ||||
None => { | ||||
encoder.num_values += indices.len(); | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should we be doing this regardless of if we've fallen back? I think currently this will fail to flush a dictionary encoded data page even if it has reached sufficient size? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe, when we do it that way it causes a panic which may also be a bug.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we need to reset num_values to 0 when we flush a data page There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it already does that right?
|
||||
encoder.fallback.encode(values, indices) | ||||
} | ||||
} | ||||
} | ||||
|
||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm guessing the problem was that whilst the estimated_data_page_size would increase, the lack of any values would cause it to erroneously not try to flush the page?
In particular https://github.com/apache/arrow-rs/blob/master/parquet/src/column/writer/mod.rs#L567
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, exactly