[feat] Support using offset index in ParquetRecordBatchStream when pu… #3616

Ted-Jiang · 2022-09-26T03:35:49Z

…shing down RowFilter.

Signed-off-by: yangjiang yangjiang@ebay.com

Which issue does this PR close?

Closes #3456

Rationale for this change

enable read page index in ParquetScanOptions
If true, the reader will read pageIndex, If exit:

First we can use it create the RowSelector before read the file (like row-group pruning avoid I/O)
Second with pageIndex it will accelerate skip records (avoid decode pageHeader) when reading values from chunk with RowSelector.

What changes are included in this PR?

update submodule testing and parquet-testing

Are there any user-facing changes?

datafusion/core/src/datasource/file_format/parquet.rs

codecov-commenter · 2022-09-26T04:09:53Z

Codecov Report

Merging #3616 (875542d) into master (b54a56f) will decrease coverage by 0.01%.
The diff coverage is 90.24%.

@@            Coverage Diff             @@
##           master    #3616      +/-   ##
==========================================
- Coverage   86.03%   86.02%   -0.01%     
==========================================
  Files         300      300              
  Lines       56253    56456     +203     
==========================================
+ Hits        48395    48564     +169     
- Misses       7858     7892      +34

Impacted Files	Coverage Δ
...sion/core/src/physical_plan/file_format/parquet.rs	`94.33% <57.14%> (-0.32%)`	⬇️
...afusion/core/src/datasource/file_format/parquet.rs	`86.30% <97.05%> (+0.73%)`	⬆️
datafusion/physical-expr/src/regex_expressions.rs	`64.38% <0.00%> (-18.66%)`	⬇️
datafusion/core/src/physical_plan/sorts/cursor.rs	`57.14% <0.00%> (-7.15%)`	⬇️
datafusion/expr/src/operator.rs	`98.43% <0.00%> (-1.57%)`	⬇️
datafusion/core/src/execution/context.rs	`78.71% <0.00%> (-0.60%)`	⬇️
datafusion/physical-expr/src/expressions/binary.rs	`97.42% <0.00%> (-0.21%)`	⬇️
datafusion/optimizer/src/simplify_expressions.rs	`82.67% <0.00%> (-0.13%)`	⬇️
datafusion/physical-expr/src/functions.rs	`92.66% <0.00%> (-0.10%)`	⬇️
datafusion/core/tests/dataframe.rs	`98.56% <0.00%> (-0.06%)`	⬇️
... and 43 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Ted-Jiang · 2022-09-26T06:00:16Z

@alamb @liukun4515 @thinkharderdev PTAL

thinkharderdev · 2022-09-26T11:10:54Z

Thanks @Ted-Jiang. I think it may be better to just pass the option to ParquetRecordBatchStreamBuilder. It will fetch the pages indexes in concurrently and if users would like to cache the indexes (or fetch them from somewhere other than the file metadata they can do so with a custom AsyncFileReader)

Ted-Jiang · 2022-09-26T14:12:47Z

pass the option to ParquetRecordBatchStreamBuilder

Thanks for your advice! @thinkharderdev , IMOP, if we want to read page_index we must get its location from metadata, so it must read after metadata. i think put it directly behind read metadata is better.

if users would like to cache the indexes other than the file metadata

the reason why i put the index in metadata is to reduce the code change in repo, if the user want to cache it anywhere else can use pub API in page_index::index_reader to read with metadata.

If i am wrong or misunderstand plz correct me 😊?

thinkharderdev · 2022-09-26T15:03:20Z

pass the option to ParquetRecordBatchStreamBuilder

Thanks for your advice! @thinkharderdev , IMOP, if we want to read page_index we must get its location from metadata, so it must read after metadata. i think put it directly behind read metadata is better.

if users would like to cache the indexes other than the file metadata

the reason why i put the index in metadata is to reduce the code change in repo, if the user want to cache it anywhere else can use pub API in page_index::index_reader to read with metadata.

If i am wrong or misunderstand plz correct me 😊?

Correct, the advantage of what ParquetRecordBatchStreamBuilder does in arrow-rs is that it fill fetch all the indexes concurrently using AsyncFileReader::get_ranges. So if you don't read the metadata up front then it will fetch it automatically for you. If we do want to grab it up front in ParquetExec then we should copy the implementation in arrow-rs to grab everything concurrently.

Ted-Jiang · 2022-09-27T02:42:10Z

AsyncFileReader

@thinkharderdev Thanks! If am right, we should make the read index api into async, so i will file a ticket to replace below into async base on AsyncFileReader::get_ranges add to arrow-rs to make it concurrently as you mentioned.

  // TODO add async version in arrow-rs avoid read whole file.
        let bytes = store.get_range(&meta.location, 0..meta.size).await?;
        let mut location_vec = vec![];
        let mut index_vec = vec![];
        for rg in result_meta.row_groups() {
            location_vec.push(index_reader::read_pages_locations(&bytes, rg.columns())?);
            index_vec.push(index_reader::read_columns_indexes(&bytes, rg.columns())?);
        }

I prefer keep read page_index in ParquetFileReader::get_metadata and save them in ParquetMetaData already define in arrow-rs

/// Global Parquet metadata.
#[derive(Debug, Clone)]
pub struct ParquetMetaData {
    file_metadata: FileMetaData,
    row_groups: Vec<RowGroupMetaData>,
    /// Page index for all pages in each column chunk
    page_indexes: Option<ParquetColumnIndex>,
    /// Offset index for all pages in each column chunk
    offset_indexes: Option<ParquetOffsetIndex>,
}

because it can reduce the code change ,
Secondly in parquet open file first thing we should do is read file metadata, following build_row_filter, build_selection_base_on_index(todo) should depend on this.🤔

thinkharderdev · 2022-09-27T10:42:59Z

AsyncFileReader

@thinkharderdev Thanks! If am right, we should make the read index api into async, so i will file a ticket to replace below into async base on AsyncFileReader::get_ranges add to arrow-rs to make it concurrently as you mentioned.
  // TODO add async version in arrow-rs avoid read whole file.
        let bytes = store.get_range(&meta.location, 0..meta.size).await?;
        let mut location_vec = vec![];
        let mut index_vec = vec![];
        for rg in result_meta.row_groups() {
            location_vec.push(index_reader::read_pages_locations(&bytes, rg.columns())?);
            index_vec.push(index_reader::read_columns_indexes(&bytes, rg.columns())?);
        }
I prefer keep read page_index in ParquetFileReader::get_metadata and save them in ParquetMetaData already define in arrow-rs
/// Global Parquet metadata.
#[derive(Debug, Clone)]
pub struct ParquetMetaData {
    file_metadata: FileMetaData,
    row_groups: Vec<RowGroupMetaData>,
    /// Page index for all pages in each column chunk
    page_indexes: Option<ParquetColumnIndex>,
    /// Offset index for all pages in each column chunk
    offset_indexes: Option<ParquetOffsetIndex>,
}
because it can reduce the code change , Secondly in parquet open file first thing we should do is read file metadata, following build_row_filter, build_selection_base_on_index(todo) should depend on this.🤔

I think we may be talking about different things :).

I'm saying the code to fetch the indexes already exists in arrow-rs so we don't need to duplicate the code in datafusion. You can just construct the ArrowReadOptions to enable the page index and ParquetRecordBatchStreamBuilder will fetch the indexes (and do so concurrently) (see https://github.com/apache/arrow-rs/blob/a7cf274765945af4111fddaeec26d672715de9d0/parquet/src/arrow/async_reader.rs#L225).

let mut options = ArrowReaderOptions::new().with_page_index(true);

if enable_page_index {
   options = options.with_page_index(true);
}

let builder =
   ParquetRecordBatchStreamBuilder::new_with_options(async_reader, options)
      .await?

Ted-Jiang · 2022-09-27T13:25:37Z

AsyncFileReader

@thinkharderdev Thanks! If am right, we should make the read index api into async, so i will file a ticket to replace below into async base on AsyncFileReader::get_ranges add to arrow-rs to make it concurrently as you mentioned.
  // TODO add async version in arrow-rs avoid read whole file.
        let bytes = store.get_range(&meta.location, 0..meta.size).await?;
        let mut location_vec = vec![];
        let mut index_vec = vec![];
        for rg in result_meta.row_groups() {
            location_vec.push(index_reader::read_pages_locations(&bytes, rg.columns())?);
            index_vec.push(index_reader::read_columns_indexes(&bytes, rg.columns())?);
        }
I prefer keep read page_index in ParquetFileReader::get_metadata and save them in ParquetMetaData already define in arrow-rs
/// Global Parquet metadata.
#[derive(Debug, Clone)]
pub struct ParquetMetaData {
    file_metadata: FileMetaData,
    row_groups: Vec<RowGroupMetaData>,
    /// Page index for all pages in each column chunk
    page_indexes: Option<ParquetColumnIndex>,
    /// Offset index for all pages in each column chunk
    offset_indexes: Option<ParquetOffsetIndex>,
}
because it can reduce the code change , Secondly in parquet open file first thing we should do is read file metadata, following build_row_filter, build_selection_base_on_index(todo) should depend on this.🤔
I think we may be talking about different things :).

I'm saying the code to fetch the indexes already exists in arrow-rs so we don't need to duplicate the code in datafusion. You can just construct the ArrowReadOptions to enable the page index and ParquetRecordBatchStreamBuilder will fetch the indexes (and do so concurrently) (see https://github.com/apache/arrow-rs/blob/a7cf274765945af4111fddaeec26d672715de9d0/parquet/src/arrow/async_reader.rs#L225).
let mut options = ArrowReaderOptions::new().with_page_index(true);

if enable_page_index {
   options = options.with_page_index(true);
}

let builder =
   ParquetRecordBatchStreamBuilder::new_with_options(async_reader, options)
      .await?

oh! i miss this part 😂 , using the order version arrow-rs

…shing down RowFilter. Signed-off-by: yangjiang <yangjiang@ebay.com>

Ted-Jiang · 2022-09-28T02:49:00Z

@thinkharderdev Sorry for the mistake, i think its ready.

tustvold · 2022-09-28T06:52:00Z

This is modifying submodules, is this intentional?

Ted-Jiang · 2022-09-28T07:22:46Z

This is modifying submodules, is this intentional?

update the submodules for using the new added test parquet files alltypes_tiny_pages.parquet and alltypes_tiny_pages_plain.parquet with page index
apache/parquet-testing#25

tustvold

Looks good to me, only a minor suggestion.

Thank you for working on this, do you intend to work on hooking up the column index as well?

datafusion/core/src/physical_plan/file_format/parquet.rs

Co-authored-by: Raphael Taylor-Davies <1781103+tustvold@users.noreply.github.com>

Ted-Jiang · 2022-09-28T07:39:50Z

Looks good to me, only a minor suggestion.

Thank you for working on this, do you intend to work on hooking up the column index as well?

yes, i plan to do using min/max stats and filter to create the RowSelector 🤔

ursabot · 2022-09-28T09:52:51Z

Benchmark runs are scheduled for baseline = 451e441 and contender = 87faf86. 87faf86 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

alamb

This is great -- thank you @Ted-Jiang and @tustvold

github-actions bot added the core Core datafusion crate label Sep 26, 2022

Ted-Jiang commented Sep 26, 2022

View reviewed changes

datafusion/core/src/datasource/file_format/parquet.rs Outdated Show resolved Hide resolved

datafusion/core/src/datasource/file_format/parquet.rs Outdated Show resolved Hide resolved

Ted-Jiang closed this Sep 27, 2022

Ted-Jiang reopened this Sep 27, 2022

Ted-Jiang marked this pull request as draft September 27, 2022 13:25

[feat] Support using offset index in ParquetRecordBatchStream when pu…

c849a30

…shing down RowFilter. Signed-off-by: yangjiang <yangjiang@ebay.com>

Ted-Jiang force-pushed the ISSUE_3456 branch from 232429f to c849a30 Compare September 27, 2022 14:06

Ted-Jiang marked this pull request as ready for review September 27, 2022 16:04

tustvold approved these changes Sep 28, 2022

View reviewed changes

datafusion/core/src/physical_plan/file_format/parquet.rs Outdated Show resolved Hide resolved

Update datafusion/core/src/physical_plan/file_format/parquet.rs

875542d

Co-authored-by: Raphael Taylor-Davies <1781103+tustvold@users.noreply.github.com>

tustvold merged commit 87faf86 into apache:master Sep 28, 2022

alamb reviewed Sep 28, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feat] Support using offset index in ParquetRecordBatchStream when pu… #3616

[feat] Support using offset index in ParquetRecordBatchStream when pu… #3616

Ted-Jiang commented Sep 26, 2022 •

edited

codecov-commenter commented Sep 26, 2022 •

edited

Ted-Jiang commented Sep 26, 2022

thinkharderdev commented Sep 26, 2022

Ted-Jiang commented Sep 26, 2022 •

edited

thinkharderdev commented Sep 26, 2022

Ted-Jiang commented Sep 27, 2022 •

edited

thinkharderdev commented Sep 27, 2022

Ted-Jiang commented Sep 27, 2022 •

edited

Ted-Jiang commented Sep 28, 2022

tustvold commented Sep 28, 2022

Ted-Jiang commented Sep 28, 2022

tustvold left a comment

Ted-Jiang commented Sep 28, 2022

ursabot commented Sep 28, 2022

alamb left a comment

[feat] Support using offset index in ParquetRecordBatchStream when pu… #3616

[feat] Support using offset index in ParquetRecordBatchStream when pu… #3616

Conversation

Ted-Jiang commented Sep 26, 2022 • edited

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

codecov-commenter commented Sep 26, 2022 • edited

Codecov Report

Ted-Jiang commented Sep 26, 2022

thinkharderdev commented Sep 26, 2022

Ted-Jiang commented Sep 26, 2022 • edited

thinkharderdev commented Sep 26, 2022

Ted-Jiang commented Sep 27, 2022 • edited

thinkharderdev commented Sep 27, 2022

Ted-Jiang commented Sep 27, 2022 • edited

Ted-Jiang commented Sep 28, 2022

tustvold commented Sep 28, 2022

Ted-Jiang commented Sep 28, 2022

tustvold left a comment

Choose a reason for hiding this comment

Ted-Jiang commented Sep 28, 2022

ursabot commented Sep 28, 2022

alamb left a comment

Choose a reason for hiding this comment

Ted-Jiang commented Sep 26, 2022 •

edited

codecov-commenter commented Sep 26, 2022 •

edited

Ted-Jiang commented Sep 26, 2022 •

edited

Ted-Jiang commented Sep 27, 2022 •

edited

Ted-Jiang commented Sep 27, 2022 •

edited