Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Separate ArrayReader::next_batch with read_records and consume_batch #2237

Merged
merged 7 commits into from Aug 3, 2022

Conversation

Ted-Jiang
Copy link
Member

Which issue does this PR close?

Closes #2236.
Related #2197

Rationale for this change

Separate add read_records and consume_batch in ArrayReader, so we can read_records multi times in buffer and consume once (avoid small batch in skipping read).

What changes are included in this PR?

Are there any user-facing changes?

@github-actions github-actions bot added the parquet Changes to the parquet crate label Jul 30, 2022
@@ -80,8 +80,15 @@ where

/// Reads at most `batch_size` records into array.
fn next_batch(&mut self, batch_size: usize) -> Result<ArrayRef> {
read_records(&mut self.record_reader, self.pages.as_mut(), batch_size)?;
let size = self.read_records(batch_size)?;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, i think we can keep next_batch , it just use read_records and consume_batch in order.

@@ -349,30 +398,32 @@ mod tests {

let mut accu_len: usize = 0;

let array = array_reader.next_batch(values_per_page / 2).unwrap();
assert_eq!(array.len(), values_per_page / 2);
let len = array_reader.read_records(values_per_page / 2).unwrap();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now, after consume_batch we will clean up the buffer which has been used.
So now we should get get_def_levels after read_records before consume_batch

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again I think removing batch_size from consume_batch allows preserving the existing behaviour

@codecov-commenter
Copy link

codecov-commenter commented Jul 30, 2022

Codecov Report

Merging #2237 (631f55a) into master (985760f) will decrease coverage by 0.21%.
The diff coverage is 92.24%.

@@            Coverage Diff             @@
##           master    #2237      +/-   ##
==========================================
- Coverage   82.53%   82.31%   -0.22%     
==========================================
  Files         239      241       +2     
  Lines       62304    62500     +196     
==========================================
+ Hits        51422    51447      +25     
- Misses      10882    11053     +171     
Impacted Files Coverage Δ
parquet/src/arrow/array_reader/struct_array.rs 84.84% <73.33%> (-1.71%) ⬇️
parquet/src/arrow/array_reader/map_array.rs 60.00% <80.00%> (+1.17%) ⬆️
...uet/src/arrow/array_reader/complex_object_array.rs 93.47% <94.64%> (-0.56%) ⬇️
parquet/src/arrow/array_reader/byte_array.rs 86.36% <100.00%> (+0.03%) ⬆️
...et/src/arrow/array_reader/byte_array_dictionary.rs 87.01% <100.00%> (+0.04%) ⬆️
parquet/src/arrow/array_reader/empty_array.rs 57.69% <100.00%> (+12.23%) ⬆️
parquet/src/arrow/array_reader/list_array.rs 92.77% <100.00%> (+0.08%) ⬆️
parquet/src/arrow/array_reader/mod.rs 93.33% <100.00%> (+0.74%) ⬆️
parquet/src/arrow/array_reader/null_array.rs 82.14% <100.00%> (+0.66%) ⬆️
parquet/src/arrow/array_reader/primitive_array.rs 90.22% <100.00%> (+0.05%) ⬆️
... and 24 more

Help us with your feedback. Take ten seconds to tell us how you rate us.

@Ted-Jiang
Copy link
Member Author

@tustvold PTAL😊

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, mostly minor nits, but I would remove batch_size from consume_batch.

This is for two major reasons:

parquet/src/arrow/array_reader/byte_array.rs Outdated Show resolved Hide resolved
parquet/src/arrow/array_reader/byte_array.rs Outdated Show resolved Hide resolved
parquet/src/arrow/array_reader/complex_object_array.rs Outdated Show resolved Hide resolved
@@ -349,30 +398,32 @@ mod tests {

let mut accu_len: usize = 0;

let array = array_reader.next_batch(values_per_page / 2).unwrap();
assert_eq!(array.len(), values_per_page / 2);
let len = array_reader.read_records(values_per_page / 2).unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again I think removing batch_size from consume_batch allows preserving the existing behaviour

parquet/src/arrow/array_reader/empty_array.rs Outdated Show resolved Hide resolved
parquet/src/arrow/array_reader/mod.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a still an issue with ComplexObjectArrayReader. I think it needs to keep the levels data around for longer to avoid breaking parent ArrayReader. Otherwise looking very nice 😃

Edit: I think ComplexObjectArrayReader might just be broken, I'll experiment when I get to a computer, and potentially accelerate plans to just remove it

@@ -160,6 +181,10 @@ where
array = arrow::compute::cast(&array, &self.data_type)?;
}

self.data_buffer = vec![];
self.def_levels_buffer = None;
self.rep_levels_buffer = None;
Copy link
Contributor

@tustvold tustvold Jul 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will break, as parent ArrayReader assume the definition levels live until the next call to consume_batch?

I'm not sure we actually have test coverage of say a nullable StructArray containing a DecimalArray 🤔

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean structArray which contains one child use complex_object_array reader? 🤔 why need the child re?

Copy link
Contributor

@tustvold tustvold Jul 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If StructArray reader is itself nullable it needs to read the definition levels read by its child in order to work out where it's NULLs are located. It's a similar story for ListArrayReader.

I'll try to rustle up some tests this evening, so that we can be confident this PR won't break anything. Longer term I want to remove ComplexObjectArrayReader as it is slow, complicated and largely replaced by the newer generics.

Edit: ran out of time today, will look into first thing tomorrow

Copy link
Contributor

@tustvold tustvold Aug 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I suspected the test added in #2254 is now failing, I think it should be simple enough to fix, but let me know if you get stuck 😄

Copy link
Member Author

@Ted-Jiang Ted-Jiang Aug 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tustvold Sorry for the late reply, fix at fix new ut

parquet/src/arrow/array_reader/complex_object_array.rs Outdated Show resolved Hide resolved
@tustvold
Copy link
Contributor

tustvold commented Aug 1, 2022

I took the liberty of merging in master to get #2254

@@ -1161,7 +1161,7 @@ mod tests {
Some(props),
)
.expect("Unable to write file");
writer.write(&expected_batch).unwrap();
writer.write(expected_batch).unwrap();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not related, but got clippy error at local.

@Ted-Jiang Ted-Jiang requested a review from tustvold August 2, 2022 15:53
Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor nits, looking good

@@ -206,11 +213,19 @@ where
}

fn get_def_levels(&self) -> Option<&[i16]> {
self.def_levels_buffer.as_deref()
if self.before_consume {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can just return self.def_levels_buffer.as_deref(), if you look at PrimitiveArrayReader it will only make the data available after the call to consume_batch

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

without this, will fail in

failures:
    arrow::array_reader::complex_object_array::tests::test_complex_array_reader_def_and_rep_levels
    arrow::array_reader::complex_object_array::tests::test_complex_array_reader_dict_enc_string

this cause by get_level before consume complex_object_array (this is the common situation like other readers), but complex_object_array sometimes(self is nullable) need get_level after consume, so i think we should keep this check

parquet/src/arrow/array_reader/complex_object_array.rs Outdated Show resolved Hide resolved
@tustvold tustvold merged commit 1f9973c into apache:master Aug 3, 2022
@ursabot
Copy link

ursabot commented Aug 3, 2022

Benchmark runs are scheduled for baseline = 6b2c757 and contender = 1f9973c. 1f9973c is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Separate ArrayReader::next_batchwith ArrayReader::read_records and ArrayReader::consume_batch
4 participants