Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix record delimiting on row group boundaries (#2025) #2027

Merged
merged 3 commits into from Jul 8, 2022

Conversation

tustvold
Copy link
Contributor

@tustvold tustvold commented Jul 8, 2022

Which issue does this PR close?

Closes #2025

Rationale for this change

The previous behaviour was incorrect

What changes are included in this PR?

Fixes the delimiting logic in RecordReader

Are there any user-facing changes?

No

@@ -124,7 +124,7 @@ impl<OffsetSize: OffsetSizeTrait> ArrayReader for ListArrayReader<OffsetSize> {

// The output offsets for the computed ListArray
let mut list_offsets: Vec<OffsetSize> =
Vec::with_capacity(next_batch_array.len());
Vec::with_capacity(next_batch_array.len() + 1);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drive by fix

@github-actions github-actions bot added the parquet Changes to the parquet crate label Jul 8, 2022
@@ -193,10 +180,7 @@ where
};

// Try to more value from parquet pages
let values_read = self.read_one_batch(batch_size)?;
if values_read < batch_size {
Copy link
Contributor Author

@tustvold tustvold Jul 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the cause of the bug, the end of the column chunk was being detected based on if read_on_batch returned less than the batch size. If the batch_size happens to exactly match the remaining records this would fail, and the reader would miss the last record in the chunk

@tustvold tustvold force-pushed the fix-list-row-group-boundary branch from 9472312 to 1fb75f8 Compare July 8, 2022 14:01
@github-actions github-actions bot added the arrow-flight Changes to the arrow-flight crate label Jul 8, 2022
@tustvold tustvold force-pushed the fix-list-row-group-boundary branch from 1fb75f8 to ad58424 Compare July 8, 2022 14:02
@tustvold tustvold removed the arrow-flight Changes to the arrow-flight crate label Jul 8, 2022
@tustvold tustvold force-pushed the fix-list-row-group-boundary branch from 7ffd362 to abefdd4 Compare July 8, 2022 14:44
@tustvold
Copy link
Contributor Author

tustvold commented Jul 8, 2022

I don't understand what the CI is doing, it seems to be checking out some ref that doesn't exist???

@tustvold tustvold force-pushed the fix-list-row-group-boundary branch 4 times, most recently from 5e12b71 to 63c7394 Compare July 8, 2022 15:45
@tustvold tustvold force-pushed the fix-list-row-group-boundary branch from 63c7394 to 7681047 Compare July 8, 2022 15:54
@tustvold tustvold force-pushed the fix-list-row-group-boundary branch from 7681047 to b9322fc Compare July 8, 2022 16:03
@tustvold
Copy link
Contributor Author

tustvold commented Jul 8, 2022

Turns out it was a combination of PEBCAC and a logical conflict when CI merged the lastest master in.

@@ -731,4 +739,48 @@ mod tests {
assert_eq!(5000, record_reader.num_values());
}
}

#[test]
fn test_row_group_boundary() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test passes on master, without the changes in this PR 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is somewhat expected, I'll see if I can make it fail

@@ -1528,4 +1528,57 @@ mod tests {

assert_eq!(total_rows, expected_rows);
}

#[test]
fn test_row_group_exact_multiple() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I remove the code in this PR this test fails like the following (I was expecting the reader to miss the last record)


---- arrow::arrow_reader::tests::test_row_group_exact_multiple stdout ----
thread 'arrow::arrow_reader::tests::test_row_group_exact_multiple' panicked at 'called `Result::unwrap()` on an `Err` value: ParquetError("Parquet error: Not all children array length are the same!")', parquet/src/arrow/arrow_reader.rs:1581:53
stack backtrace:
   0: rust_begin_unwind
             at /rustc/a8314ef7d0ec7b75c336af2c9857bfaf43002bfc/library/std/src/panicking.rs:584:5
   1: core::panicking::panic_fmt
             at /rustc/a8314ef7d0ec7b75c336af2c9857bfaf43002bfc/library/core/src/panicking.rs:142:14
   2: core::result::unwrap_failed
             at /rustc/a8314ef7d0ec7b75c336af2c9857bfaf43002bfc/library/core/src/result.rs:1785:5
   3: core::result::Result<T,E>::unwrap
             at /rustc/a8314ef7d0ec7b75c336af2c9857bfaf43002bfc/library/core/src/result.rs:1078:23
   4: parquet::arrow::arrow_reader::tests::test_row_group_exact_multiple
             at ./src/arrow/arrow_reader.rs:1581:23
   5: parquet::arrow::arrow_reader::tests::test_row_group_exact_multiple::{{closure}}
             at ./src/arrow/arrow_reader.rs:1533:5
   6: core::ops::function::FnOnce::call_once
             at /rustc/a8314ef7d0ec7b75c336af2c9857bfaf43002bfc/library/core/src/ops/function.rs:248:5
   7: core::ops::function::FnOnce::call_once
             at /rustc/a8314ef7d0ec7b75c336af2c9857bfaf43002bfc/library/core/src/ops/function.rs:248:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It only misses the last record for the repeated array, and so you end up with this error

@alamb
Copy link
Contributor

alamb commented Jul 8, 2022

@tustvold lint failed !

@tustvold tustvold force-pushed the fix-list-row-group-boundary branch from 0fe7e57 to e5f57df Compare July 8, 2022 18:01
@tustvold tustvold force-pushed the fix-list-row-group-boundary branch from e5f57df to 4ecf5d8 Compare July 8, 2022 18:04
@tustvold
Copy link
Contributor Author

tustvold commented Jul 8, 2022

So the batch_size bug was a bug, but would be masked by the MIN_BATCH_SIZE setting of 1024 and so would only occur with batch sizes >= 1024. The actual cause of the failure was more subtle. The bug was that once exhausted RecordReader would continue to re-read the last record on subsequent calls to read_records. This would only occur if it returned exactly the batch_size number of records, when reaching the end of a chunk. This wouldn't actually read any new data and so would end up actually returning less records than it claimed to have read.

So read_records would do the following

  • Read 8 records
  • Return the corresponding values to the caller
  • Read 1 phantom record
  • Read 7 records
  • Return this to the caller claiming it to be 8 records, but actually only being 7
  • Read the remaining 5 records

If only a single list column this somewhat funky behaviour would not be noticeable, as it would eventually read all the data, corresponding to the total number of rows. However, it would read batches of 8, 7, 5 instead of 8, 8, 4, which would throw off the StructArrayReader.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Array reader for list columns fails to decode if batches fall on row group boundaries
2 participants