Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add integration test for scan rows with selection #2158

Merged
merged 9 commits into from Jul 27, 2022

Conversation

Ted-Jiang
Copy link
Member

Which issue does this PR close?

Closes #2106 .

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

@github-actions github-actions bot added the parquet Changes to the parquet crate label Jul 25, 2022
@codecov-commenter
Copy link

codecov-commenter commented Jul 25, 2022

Codecov Report

Merging #2158 (881752c) into master (1621c71) will increase coverage by 0.21%.
The diff coverage is 90.24%.

@@            Coverage Diff             @@
##           master    #2158      +/-   ##
==========================================
+ Coverage   82.85%   83.06%   +0.21%     
==========================================
  Files         237      237              
  Lines       61381    61620     +239     
==========================================
+ Hits        50856    51186     +330     
+ Misses      10525    10434      -91     
Impacted Files Coverage Δ
...et/src/arrow/array_reader/byte_array_dictionary.rs 85.76% <0.00%> (-1.21%) ⬇️
parquet/src/arrow/array_reader/null_array.rs 70.96% <0.00%> (-10.52%) ⬇️
...uet/src/arrow/array_reader/complex_object_array.rs 94.02% <66.66%> (+0.82%) ⬆️
parquet/src/arrow/array_reader/byte_array.rs 86.20% <75.00%> (+0.41%) ⬆️
parquet/src/arrow/array_reader/primitive_array.rs 89.83% <75.00%> (+0.81%) ⬆️
parquet/src/column/reader.rs 68.79% <83.33%> (+5.92%) ⬆️
parquet/src/arrow/arrow_reader.rs 95.72% <97.08%> (+2.95%) ⬆️
parquet/src/arrow/record_reader/mod.rs 94.22% <100.00%> (+4.62%) ⬆️
arrow/src/array/iterator.rs 86.45% <0.00%> (-13.55%) ⬇️
arrow/src/util/decimal.rs 86.92% <0.00%> (-4.59%) ⬇️
... and 24 more

Help us with your feedback. Take ten seconds to tell us how you rate us.

let to_read = match front.row_count.checked_sub(self.batch_size) {
Some(remaining) => {
selection.push_front(RowSelection::skip(remaining));
// if page row count less than batch_size we must set batch size to page row count.
// add check avoid dead loop
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix wrong logic, remaining record need read

None => {
// If we skip records before all read operation
// we need set `column_reader` by `set_page_reader`
if let Some(page_reader) = pages.next() {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix skip before all read operator, need set column_reader

@Ted-Jiang
Copy link
Member Author

@tustvold @thinkharderdev PTAL😊

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had a brief look, will review in more detail later (flying today)

parquet/src/arrow/arrow_reader.rs Outdated Show resolved Hide resolved
parquet/src/arrow/record_reader/mod.rs Outdated Show resolved Hide resolved
// we need set `column_reader` by `set_page_reader`
if let Some(page_reader) = pages.next() {
self.set_page_reader(page_reader?)?;
false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is wrong, as it will now only mark end_of_column when it reaches the end of the file, instead of the end of a column chunk within a row group. This will break record delimiting for repeated fields.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tustvold i move it out to

 fn skip_records(&mut self, num_records: usize) -> Result<usize> {
        if self.record_reader.column_reader().is_none() {
            // If we skip records before all read operation
            // we need set `column_reader` by `set_page_reader`
            if let Some(page_reader) = self.pages.next() {
                self.record_reader.set_page_reader(page_reader?)?;
            } else {
                return Ok(0);
            }
        }
        self.record_reader.skip_records(num_records)
    }

I think in this situation , only skip the first page without read any record the column_reader is none. related #2171 if
we create it in colchunk, then we will remove this check.

@@ -312,13 +312,20 @@ where

// If page has less rows than the remaining records to
// be skipped, skip entire page
if metadata.num_rows < remaining {
while metadata.num_rows < remaining {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this necessary, there is already an outer while loop?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because first add below

// because self.num_buffered_values == self.num_decoded_values means
// we need reads a new page and set up the decoders for levels
  self.read_new_page()?;

if we still use if, we may read needless page header

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This while loop should result in the same behaviour as the previous continue??

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh... it's an useless loop

@alamb
Copy link
Contributor

alamb commented Jul 25, 2022

Thank you @Ted-Jiang -- the project to add page index and skipping is really coming along very nicely. It is a very nice piece of work.

@@ -120,6 +120,15 @@ impl<I: OffsetSizeTrait + ScalarValue> ArrayReader for ByteArrayReader<I> {
}

fn skip_records(&mut self, num_records: usize) -> Result<usize> {
if self.record_reader.column_reader().is_none() {
Copy link
Contributor

@tustvold tustvold Jul 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This now behaves differently from next_batch which will potentially read from multiple column chunks for the same "batch". Can we extract this logic into a free function, similar to read_records, that performs the same loop?

This would also avoid duplicating this code in every ArrayReader

self.batch_size
}
Some(_) => self.batch_size,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Some(_) => self.batch_size,
_ => self.batch_size,

And remove the None case below. If remaining == 0 then front.row_count == self.batch_size

Copy link
Member Author

@Ted-Jiang Ted-Jiang Jul 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes more elegance 👍

}
// because self.num_buffered_values == self.num_decoded_values means
// we need reads a new page and set up the decoders for levels
self.read_new_page()?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we could check the return type of this, and short-circuit if it returns false?

@tustvold tustvold merged commit d10d962 into apache:master Jul 27, 2022
@ursabot
Copy link

ursabot commented Jul 27, 2022

Benchmark runs are scheduled for baseline = e96ae8a and contender = d10d962. d10d962 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@Ted-Jiang
Copy link
Member Author

@tustvold thanks for your patient review 👍

@tustvold
Copy link
Contributor

FYI I'm working on a follow up PR to address some stuff, e.g. get this integrated into the fuzz tests

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add integration test for scan rows with selection
5 participants