Use offset index in ParquetRecordBatchStream #2526

thinkharderdev · 2022-08-19T13:27:34Z

Which issue does this PR close?

Closes #2430
Closes #2434

Rationale for this change

Leverage OffsetIndex (if available) to prune IO in ParquetRecordBatchStream

What changes are included in this PR?

Allow user to specify read options in ParquetRecordBatchStreamBuilder which will fetch index metadata when building.

Assorted bug fixes:

Bug in RowSelection::scan_ranges when skipping past the final page boundary
We were not fetching dictionary pages when creating SerializedPageReader from InMemoryColumnChunk if we had a page index.

Are there any user-facing changes?

thinkharderdev · 2022-08-19T13:33:49Z

parquet/src/file/page_index/index_reader.rs

@@ -34,8 +34,12 @@ pub fn read_columns_indexes<R: ChunkReader>(
    let (offset, lengths) = get_index_offset_and_lengths(chunks)?;
    let length = lengths.iter().sum::<usize>();

+    if length == 0 {


Not sure if this is right or we should return an empty vec?

thinkharderdev · 2022-08-19T14:16:48Z

parquet/src/arrow/async_reader.rs

+
+    /// Provides asynchronous access to the the page index for each column chunk in a
+    /// row group. Will panic if `row_group_idx` is greater than or equal to `num_row_groups`
+    fn get_column_indexes(


Not sure if we should make this a separate constructor or just have it always use the index. If the index doesn't exist, the cost of determining that is minimal.

I'm not sure of the value of exposing these on AsyncFileReader, and not just handling the logic internally. Ultimately if the implementer wants to override the way the index is fetched, they can just return ParquetMetadata from get_metadata with the index information already loaded.

tustvold

I think I would opt to keep more of this logic hidden, i.e. not exposed on AsyncFileReader, and make use get_byte_ranges to avoid making lots of separate small fetch requests

tustvold · 2022-08-19T15:08:35Z

object_store/src/local.rs

@@ -1068,6 +1068,7 @@ mod tests {
        integration.head(&path).await.unwrap();
    }

+    #[ignore]


tustvold · 2022-08-19T15:11:38Z

parquet/src/arrow/async_reader.rs

@@ -218,6 +296,36 @@ impl<T: AsyncFileReader + Send + 'static> ArrowReaderBuilder<AsyncReader<T>> {
        Self::new_builder(AsyncReader(input), metadata, Default::default())
    }

+    pub async fn new_with_index(mut input: T) -> Result<Self> {


I think it would be more consistent to have new_with_options accepting ArrowReaderOptions. This already has a field on it for the page index

tustvold · 2022-08-19T15:21:43Z

parquet/src/arrow/async_reader.rs

+            }
+        }
+
+        let metadata = Arc::new(ParquetMetaData::new_with_page_index(


Not part of this PR, but I still feel something is off with the way the index information is located on ParquetMetadata...

Edit: filed #2530

Yeah, I agree.

tustvold · 2022-08-19T15:23:15Z

parquet/src/file/page_index/index_reader.rs

@@ -64,6 +68,10 @@ pub fn read_pages_locations<R: ChunkReader>(
 ) -> Result<Vec<Vec<PageLocation>>, ParquetError> {
    let (offset, total_length) = get_location_offset_and_total_length(chunks)?;

+    if total_length == 0 {


I think this might fix #2434

tustvold · 2022-08-19T15:24:36Z

parquet/src/arrow/arrow_reader/selection.rs

@@ -162,7 +162,12 @@ impl RowSelection {
                    current_selector = selectors.next();
                }
            } else {
-                break;
+                if !(selector.skip || current_page_included) {


What are the implications of this change?

Before we would break if we were on the last page. So if you skipped from inside the second to last page into the last page, then this would short circuit and the last page range wouldn't be selected.

tustvold · 2022-08-19T15:26:45Z

parquet/src/arrow/async_reader.rs

+
+            //read all need data into buffer
+            let data = self
+                .get_bytes(offset as usize..offset as usize + length)


This will perform separate get_bytes requests to fetch the page index and column index information for each column chunk. This is likely not a good idea, especially since this will be performed serially.

Ideally we would identify the ranges of all the index information, and then call get_byte_ranges, this will allow coalescing proximate requests to ObjectStore, paralell fetch, etc...

Yeah, that makes sense.

tustvold · 2022-08-19T15:29:02Z

parquet/src/arrow/async_reader.rs

+        let mut offset_indexes = vec![];
+
+        for (idx, rg) in row_groups.iter_mut().enumerate() {
+            let column_index = input.get_column_indexes(metadata.clone(), idx).await?;


We should check if the column index has already been fetched to the metadata, and not fetch it again if it is already present

tustvold · 2022-08-19T15:29:30Z

parquet/src/arrow/async_reader.rs

+
+    /// Provides asynchronous access to the the page index for each column chunk in a
+    /// row group. Will panic if `row_group_idx` is greater than or equal to `num_row_groups`
+    fn get_column_indexes(


I'm not sure of the value of exposing these on AsyncFileReader, and not just handling the logic internally. Ultimately if the implementer wants to override the way the index is fetched, they can just return ParquetMetadata from get_metadata with the index information already loaded.

thinkharderdev · 2022-08-19T16:17:35Z

parquet/src/arrow/async_reader.rs

+                        // then we need to also fetch a dictionary page.
+                        let mut ranges = vec![];
+                        let (start, _len) = chunk_meta.byte_range();
+                        match page_locations[idx].first() {


Discovered this bug as well. We weren't fetching a dictionary page if it existed

thinkharderdev · 2022-08-19T22:06:21Z

parquet/src/column/reader.rs

@@ -312,7 +312,7 @@ where

                // If page has less rows than the remaining records to
                // be skipped, skip entire page
-                if metadata.num_rows < remaining {
+                if metadata.num_rows <= remaining {


Another bug I uncovered while testing. This causes the reader to try and fetch the pages unnecessarily which are not pre-fetched

Ted-Jiang · 2022-08-20T13:15:21Z

parquet/src/arrow/async_reader.rs

+                    return Self::new_builder(AsyncReader(input), metadata, options);
+                }
+
+                fetch_ranges.push(loc_offset as usize..loc_offset as usize + loc_length);


I think this will read one col_index and page_location alternately. but they are written separately.
https://github.com/apache/parquet-format/blob/master/doc/images/PageIndexLayout.png

I think if we not cache all bytes in memory, we should read whole col_index then page_location.
why not we use 🤔

/// Read on row group's all columns indexes and change into [`Index`] /// If not the format not available return an empty vector. pub fn read_columns_indexes<R: ChunkReader>( reader: &R, chunks: &[ColumnChunkMetaData], ) -> Result<Vec<Index>, ParquetError> {

This can't use read_columns_indexes as this needs to asynchronously fetch the bytes

Oh! Thanks! but we can still try combine all col_index or page_location together separately

The onus is on AsyncFileReader::get_ranges to handle this, e.g. ObjectStore::get_ranges does this already

tustvold · 2022-08-20T14:21:58Z

parquet/src/arrow/async_reader.rs

+                    index_reader::get_location_offset_and_total_length(rg.columns())?;
+
+                let (idx_offset, idx_lengths) =
+                    index_reader::get_index_offset_and_lengths(rg.columns())?;


It occurs to me that this method is making a pretty strong assumption that the column index data is contiguous, I'm not sure this is actually guaranteed... Definitely a separate issue from this PR though

tustvold · 2022-08-20T14:24:03Z

parquet/src/arrow/async_reader.rs

+                rg.set_page_offset(offset_index.clone());
+                offset_indexes.push(offset_index);


Again this interface seems really confused - something for #2530

tustvold

Looks good to me, thank you. There is definitely some cleanup to do with the page index plumbing, but that is beyond the scope of this PR.

ursabot · 2022-08-20T14:31:43Z

Benchmark runs are scheduled for baseline = 0f45932 and contender = 1eb6c45. 1eb6c45 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Use offset index in ParquetRecordBatchStream

1eabcab

github-actions bot added the parquet Changes to the parquet crate label Aug 19, 2022

remove debugging cruft and fix clippy warning

8729e16

thinkharderdev commented Aug 19, 2022

View reviewed changes

Do not use ReadOptions

bc3a2e0

github-actions bot added the object-store Object Store Interface label Aug 19, 2022

thinkharderdev commented Aug 19, 2022

View reviewed changes

tustvold reviewed Aug 19, 2022

View reviewed changes

Fix bug with dictionary pages

5481902

github-actions bot removed the object-store Object Store Interface label Aug 19, 2022

thinkharderdev commented Aug 19, 2022

View reviewed changes

thinkharderdev added 2 commits August 19, 2022 13:08

Review comments

6304227

Fix bug in page skipping logic

b82a26e

thinkharderdev commented Aug 19, 2022

View reviewed changes

Ted-Jiang reviewed Aug 20, 2022

View reviewed changes

tustvold reviewed Aug 20, 2022

View reviewed changes

tustvold approved these changes Aug 20, 2022

View reviewed changes

tustvold merged commit 1eb6c45 into apache:master Aug 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use offset index in ParquetRecordBatchStream #2526

Use offset index in ParquetRecordBatchStream #2526

thinkharderdev commented Aug 19, 2022 •

edited

thinkharderdev Aug 19, 2022

thinkharderdev Aug 19, 2022

tustvold Aug 19, 2022

tustvold left a comment

tustvold Aug 19, 2022

tustvold Aug 19, 2022

tustvold Aug 19, 2022 •

edited

thinkharderdev Aug 19, 2022

tustvold Aug 19, 2022

tustvold Aug 19, 2022

thinkharderdev Aug 19, 2022

tustvold Aug 19, 2022

thinkharderdev Aug 19, 2022

tustvold Aug 19, 2022

tustvold Aug 19, 2022

thinkharderdev Aug 19, 2022

thinkharderdev Aug 19, 2022

Ted-Jiang Aug 20, 2022

tustvold Aug 20, 2022

Ted-Jiang Aug 20, 2022

tustvold Aug 20, 2022

tustvold Aug 20, 2022 •

edited

tustvold Aug 20, 2022

tustvold left a comment

ursabot commented Aug 20, 2022

		rg.set_page_offset(offset_index.clone());
		offset_indexes.push(offset_index);

Use offset index in ParquetRecordBatchStream #2526

Use offset index in ParquetRecordBatchStream #2526

Conversation

thinkharderdev commented Aug 19, 2022 • edited

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold Aug 19, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold Aug 20, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold left a comment

Choose a reason for hiding this comment

ursabot commented Aug 20, 2022

thinkharderdev commented Aug 19, 2022 •

edited

tustvold Aug 19, 2022 •

edited

tustvold Aug 20, 2022 •

edited