Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RowSelection::intersection Produces Invalid RowSelection #5036

Closed
tustvold opened this issue Nov 3, 2023 · 2 comments · Fixed by #5041
Closed

RowSelection::intersection Produces Invalid RowSelection #5036

tustvold opened this issue Nov 3, 2023 · 2 comments · Fixed by #5041
Assignees
Labels
bug parquet Changes to the parquet crate

Comments

@tustvold
Copy link
Contributor

tustvold commented Nov 3, 2023

Describe the bug

An invariant of RowSelection is that it alternates select and skip, and does not contain empty RowSelector.

This is typically enforced when a RowSelection is created from a slice (or vec) of RowSelector by from_selectors_and_combine.

When intersect_row_selections was imported from DataFusion in #3047 and subsequently exposed as a member function in https://github.com/apache/arrow-rs/pull/3084/files#diff-7638a63d118da0ac5321c1948eb9acfc59f7acee56598879eba8338b2c22ff9eR334 a subtle bug was introduced.

intersect_row_selections does not produce a Vec<RowSelector> that obey the invariants of RowSelection, and yet the member function doesn't call from_selectors_and_combine.

This results in RowSelection of the form [Skip(x), Skip(y)]. The async reader determines what data to fetch based on what rows are selected, however, when reading the data it performs each operation in turn. In order to perform the first skip, the reader must set up the decoders to the relevant position within the pages (as it doesn't know that the next operation is another skip). This in turn causes it to request data that wasn't fetched, and the reader bails out with an offset index error.

To Reproduce

#[test]
fn test_intersection() {
    let selection = RowSelection::from(vec![RowSelector::select(1048576)]);
    let result = selection.intersection(&selection);
    assert_eq!(result, selection);
}

Expected behavior

Additional context

@tustvold tustvold added the bug label Nov 3, 2023
@tustvold tustvold self-assigned this Nov 3, 2023
@alamb
Copy link
Contributor

alamb commented Nov 3, 2023

For context we found this in some internal data in IOx

tustvold added a commit to tustvold/arrow-rs that referenced this issue Nov 6, 2023
tustvold added a commit that referenced this issue Nov 7, 2023
* Fix RowSelection::intersection (#5036)

* Review feedback
alamb pushed a commit to alamb/arrow-rs that referenced this issue Nov 7, 2023
* Fix RowSelection::intersection (apache#5036)

* Review feedback
@tustvold tustvold added the parquet Changes to the parquet crate label Nov 7, 2023
@tustvold
Copy link
Contributor Author

tustvold commented Nov 7, 2023

label_issue.py automatically added labels {'parquet'} from #5049

@alamb alamb mentioned this issue Nov 8, 2023
6 tasks
alamb pushed a commit to alamb/arrow-rs that referenced this issue Nov 8, 2023
* Fix RowSelection::intersection (apache#5036)

* Review feedback
alamb added a commit that referenced this issue Nov 8, 2023
* Fix RowSelection::intersection (#5036)

* Review feedback

Co-authored-by: Raphael Taylor-Davies <1781103+tustvold@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants