Add ParquetRecordBatchReaderBuilder (#2427) #2435

tustvold · 2022-08-12T15:27:29Z

Which issue does this PR close?

Rationale for this change

This standardises the configuration of the async and sync arrow parquet readers, helping to avoid inconsistency and reducing duplication.

What changes are included in this PR?

Adds a new ParquetRecordBatchReaderBuilder and deprecates the old APIs

Are there any user-facing changes?

This deprecates old APIs, however, it doesn't remove any of them

tustvold · 2022-08-12T15:27:43Z

@Ted-Jiang could you perhaps take a look and make sure this makes sense

tustvold · 2022-08-12T15:29:08Z

parquet/src/arrow/arrow_reader/mod.rs

 #[derive(Debug, Clone, Default)]
 pub struct ArrowReaderOptions {
    skip_arrow_metadata: bool,
-    selection: Option<RowSelection>,
+    page_index: bool,


There is a detail worth highlighting here, this forces decoding of the page index for all row groups, as the row group selection isn't known at the point the metadata is read. I experimented with APIs to allow for this, but they were very clunky, and ultimately the index information should be relatively small and cheap to decode so I didn't think it was worth it.

Sounds reasonable.
I think read page_index should belong to the open file, have you find out how long read page_index cost🤔

Suggested change

page_index: bool,

/// if true, forces decoding of the page index for all row groups

/// as the group selection isn't known at the point the metadata is read

page_index: bool,

tustvold · 2022-08-12T15:29:26Z

parquet/src/arrow/arrow_reader/mod.rs

    ///
-    /// TODO: Revisit this API, as [`Self`] is provided before the file metadata is available
-    #[allow(unused)]
-    pub(crate) fn with_row_selection(self, selection: impl Into<RowSelection>) -> Self {


This API is removed, as it was impossible to use

Isn't it moved to ArrowReaderBuilder::with_row_selection?

tustvold · 2022-08-12T15:30:20Z

parquet/src/arrow/arrow_reader/mod.rs


        // Verify that the schema was correctly parsed
-        let original_schema = arrow_reader.get_schema().unwrap().fields().clone();
-        assert_eq!(original_schema, *record_batch_reader.schema().fields());
+        assert_eq!(original_schema.fields(), reader.schema().fields());


I'm not really sure this test makes sense anymore, but I kept the spirit of it

tustvold · 2022-08-12T15:30:58Z

parquet/src/arrow/async_reader.rs

-    }
-
-    /// Returns a reference to the [`ParquetMetaData`] for this parquet file
-    pub fn metadata(&self) -> &Arc<ParquetMetaData> {


This logic is moved into ArrowReaderBuilder

tustvold · 2022-08-12T15:31:33Z

parquet/src/arrow/arrow_reader/mod.rs

+/// * For an asynchronous API - [`ParquetRecordBatchStreamBuilder`]
+///
+/// [`ParquetRecordBatchStreamBuilder`]: [crate::arrow::async_reader::ParquetRecordBatchStreamBuilder]
+pub struct ArrowReaderBuilder<T> {


This is largely moved from ParquetRecordBatchStreamBuilder

tustvold · 2022-08-12T15:32:21Z

parquet/src/arrow/async_reader.rs

@@ -194,112 +194,23 @@ impl<T: AsyncRead + AsyncSeek + Unpin + Send> AsyncFileReader for T {
    }
 }

+#[doc(hidden)]
+/// A newtype used within [`ReaderOptionsBuilder`] to distinguish sync readers from async
+pub struct AsyncReader<T>(T);


This is the type trickery that allows sharing the same builder for both the sync and async versions, whilst also not breaking the existing ParquetRecordBatchStreamBuilder API

Ted-Jiang

This change sounds reasonable to me !👍
But i found something: I think now only ParquetRecordBatchReader has the ability to read the page index info near the footer.

Should we also support it in async reader or
because it's cost is small we could use the SyncReader before using async one

tustvold · 2022-08-13T07:46:58Z

I filed #2430 to track adding page index support to the async reader. There is a slight additional complication as it needs to perform IO to read the corresponding bytes, but nothing intractable. Thinking a bit more, I wonder if this should be handled by AsyncFileReader? 🤔

alamb

I think this looks great -- thanks @tustvold

The only question I had was if it made sense to put this API somewhere other than arrow as I think everything (except RowFilter) is applicable to other uses as well.

Also, another check that might be worth doing is to make a draft PR to DataFusion to ensure this API can be used without issue

alamb · 2022-08-15T12:59:13Z

parquet/src/arrow/array_reader/mod.rs

@@ -124,6 +124,49 @@ impl RowGroupCollection for Arc<dyn FileReader> {
    }
 }

+pub(crate) struct FileReaderRowGroupCollection {
+    reader: Arc<dyn FileReader>,
+    row_groups: Option<Vec<usize>>,


I think it would help to document what usize means here -- I assume it is the index of the row group within the parquet file? And that if this is None, all row groups will be read?

alamb · 2022-08-15T13:01:46Z

parquet/src/arrow/schema.rs

-        let read_schema = arrow_reader.get_schema()?;
-        assert_eq!(schema, read_schema);
-
-        // read all fields by columns


Isn't the usecase (and test) of reading a partial schema still valid?

There is no separate get_schema_by_columns API anymore, and there is no difference between specifying ProjectionMask::all and not specifying a mask, so this additional bit of the test no longer makes sense. It was added from when get_schema_by_columns used completely different logic from the array reader.

alamb · 2022-08-15T13:03:05Z

parquet/src/arrow/async_reader.rs

@@ -194,112 +194,23 @@ impl<T: AsyncRead + AsyncSeek + Unpin + Send> AsyncFileReader for T {
    }
 }

+#[doc(hidden)]
+/// A newtype used within [`ReaderOptionsBuilder`] to distinguish sync readers from async


Suggested change

/// A newtype used within [`ReaderOptionsBuilder`] to distinguish sync readers from async

/// A newtype used within [`ReaderOptionsBuilder`] to distinguish sync readers from async

/// Allows sharing the same builder for both the sync and async versions, whilst also not

/// breaking the existing ParquetRecordBatchStreamBuilder API

alamb · 2022-08-15T13:04:50Z

parquet/src/arrow/arrow_reader/mod.rs

+/// * For a synchronous API - [`ParquetRecordBatchReaderBuilder`]
+/// * For an asynchronous API - [`ParquetRecordBatchStreamBuilder`]
+///
+/// [`ParquetRecordBatchStreamBuilder`]: [crate::arrow::async_reader::ParquetRecordBatchStreamBuilder]


Eventually it would be great it update the examples to use this (much nicer) API as well: https://docs.rs/parquet/20.0.0/parquet/arrow/index.html

alamb · 2022-08-15T13:05:40Z

parquet/src/arrow/arrow_reader/mod.rs

+        &self.schema
+    }
+
+    /// Set the size of [`RecordBatch`] to produce


Suggested change

/// Set the size of [`RecordBatch`] to produce

/// Set the size of [`RecordBatch`] to produce. Defaults to 1024

alamb · 2022-08-15T13:06:50Z

parquet/src/arrow/arrow_reader/mod.rs

+        }
+    }
+
+    /// Provide a [`RowFilter`] to skip decoding rows


Suggested change

/// Provide a [`RowFilter`] to skip decoding rows

/// Provide a [`RowFilter`] to skip decoding rows. Row filters are applied

/// after row group selection and row selection

alamb · 2022-08-15T13:08:03Z

parquet/src/arrow/arrow_reader/mod.rs

+    pub(crate) selection: Option<RowSelection>,
+}
+
+impl<T> ArrowReaderBuilder<T> {


This is looking like a very nice api 👌 👨‍🍳

🎩 tip to you @tustvold @Ted-Jiang and @thinkharderdev for this. Very cool

alamb · 2022-08-15T13:10:35Z

parquet/src/arrow/arrow_reader/mod.rs

 #[derive(Debug, Clone, Default)]
 pub struct ArrowReaderOptions {
    skip_arrow_metadata: bool,
-    selection: Option<RowSelection>,
+    page_index: bool,


Suggested change

page_index: bool,

/// if true, forces decoding of the page index for all row groups

/// as the group selection isn't known at the point the metadata is read

page_index: bool,

alamb · 2022-08-15T13:11:33Z

parquet/src/arrow/arrow_reader/mod.rs

    ///
-    /// TODO: Revisit this API, as [`Self`] is provided before the file metadata is available
-    #[allow(unused)]
-    pub(crate) fn with_row_selection(self, selection: impl Into<RowSelection>) -> Self {


Isn't it moved to ArrowReaderBuilder::with_row_selection?

alamb · 2022-08-15T13:13:53Z

parquet/src/arrow/arrow_reader/mod.rs

+        let file = File::open(&path).unwrap();
+        let builder = ParquetRecordBatchReaderBuilder::try_new(file).unwrap();
+
+        let mask = ProjectionMask::leaves(builder.parquet_schema(), [3, 8, 10]);


👌 very nice

tustvold · 2022-08-15T13:19:54Z

as I think everything (except RowFilter) is applicable to other uses as well.

I would rather leave this for when I eventually get to cleaning up the lower level APIs, this PR can be viewed as decoupling the arrow implementation from the other APIs, so that a subsequent PR can revisit them

thinkharderdev · 2022-08-15T13:38:23Z

parquet/src/arrow/arrow_reader/mod.rs

+    #[allow(unused)]
+    pub(crate) fn with_row_selection(self, selection: RowSelection) -> Self {
+        Self {
+            selection: Some(selection),
+            ..self
+        }
+    }


Hmm, would it make sense to collapse with_row_selection and with_row_filter. The API is a bit confusing with both. And you could always just define a RowSelection as an ArrowPredicate.

Edit: To clarify a bit, I'm not sure I understand the use case in which you would have the RowSelection when constructing the reader. Obviously defining the selection in terms of an ArrowPredicate is not ideal since it is only applied after decoding, which pretty much defeats the purpose.

They are related but different, in particular with_row_selection exists to allow you to specify a row selection before reading any data, e.g. based on information in the PageIndex

Ah, yeah that makes sense.

might be good to clarify this in the comments (as others will likely have the same question)

ursabot · 2022-08-15T14:32:22Z

Benchmark runs are scheduled for baseline = 3f0e12d and contender = 76cfe83. 76cfe83 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Add ParquetRecordBatchReaderBuilder (apache#2427)

aeee9f8

github-actions bot added the parquet Changes to the parquet crate label Aug 12, 2022

tustvold commented Aug 12, 2022

View reviewed changes

tustvold added 2 commits August 12, 2022 16:33

Clippy

6177465

Fix doc

c53c6a5

Ted-Jiang approved these changes Aug 13, 2022

View reviewed changes

alamb approved these changes Aug 15, 2022

View reviewed changes

thinkharderdev reviewed Aug 15, 2022

View reviewed changes

Review feedback

5f2e37e

tustvold merged commit 76cfe83 into apache:master Aug 15, 2022

alamb mentioned this pull request Sep 13, 2022

[EPIC] Parquet filter pushdown into scan apache/datafusion#3462

Open

27 tasks

tustvold mentioned this pull request Apr 25, 2023

Remove deprecated parquet ArrowReader #4125

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ParquetRecordBatchReaderBuilder (#2427) #2435

Add ParquetRecordBatchReaderBuilder (#2427) #2435

tustvold commented Aug 12, 2022

tustvold commented Aug 12, 2022

tustvold Aug 12, 2022

Ted-Jiang Aug 13, 2022 •

edited

alamb Aug 15, 2022

tustvold Aug 12, 2022

alamb Aug 15, 2022

tustvold Aug 12, 2022

tustvold Aug 12, 2022

tustvold Aug 12, 2022

tustvold Aug 12, 2022

Ted-Jiang left a comment

tustvold commented Aug 13, 2022

alamb left a comment

alamb Aug 15, 2022

alamb Aug 15, 2022

tustvold Aug 15, 2022

alamb Aug 15, 2022

alamb Aug 15, 2022

alamb Aug 15, 2022

alamb Aug 15, 2022

alamb Aug 15, 2022

alamb Aug 15, 2022

alamb Aug 15, 2022

alamb Aug 15, 2022

tustvold commented Aug 15, 2022

thinkharderdev Aug 15, 2022 •

edited

tustvold Aug 15, 2022

thinkharderdev Aug 15, 2022

alamb Aug 15, 2022

ursabot commented Aug 15, 2022

	/// Set the size of [`RecordBatch`] to produce
	/// Set the size of [`RecordBatch`] to produce. Defaults to 1024

	/// Provide a [`RowFilter`] to skip decoding rows
	/// Provide a [`RowFilter`] to skip decoding rows. Row filters are applied
	/// after row group selection and row selection

Add ParquetRecordBatchReaderBuilder (#2427) #2435

Add ParquetRecordBatchReaderBuilder (#2427) #2435

Conversation

tustvold commented Aug 12, 2022

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

tustvold commented Aug 12, 2022

Choose a reason for hiding this comment

Ted-Jiang Aug 13, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Ted-Jiang left a comment

Choose a reason for hiding this comment

tustvold commented Aug 13, 2022

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold commented Aug 15, 2022

thinkharderdev Aug 15, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ursabot commented Aug 15, 2022

Ted-Jiang Aug 13, 2022 •

edited

thinkharderdev Aug 15, 2022 •

edited