Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add get_byte_ranges method to AsyncFileReader trait #2115

Merged
merged 4 commits into from Jul 21, 2022

Conversation

thinkharderdev
Copy link
Contributor

Which issue does this PR close?

Closes #2110

Rationale for this change

In certain cases it is better from a performance perspective to fetch data in parallel such as reading from object storage. This adds a hook into the AsyncFileReader trait to allow upstream consumers of this API to do that.

What changes are included in this PR?

Add get_byte_ranges(&mut self, ranges: Vec<Range<usize>>) method to AsyncFileReader trait with a default implementation that will fallback to calling get_bytes serially for the provided ranges.

Are there any user-facing changes?

No

@thinkharderdev
Copy link
Contributor Author

@tustvold @alamb

@github-actions github-actions bot added the parquet Changes to the parquet crate label Jul 20, 2022
@codecov-commenter
Copy link

codecov-commenter commented Jul 20, 2022

Codecov Report

Merging #2115 (76647dd) into master (3096591) will decrease coverage by 0.02%.
The diff coverage is 0.00%.

@@            Coverage Diff             @@
##           master    #2115      +/-   ##
==========================================
- Coverage   83.76%   83.74%   -0.03%     
==========================================
  Files         225      225              
  Lines       59457    59473      +16     
==========================================
- Hits        49806    49805       -1     
- Misses       9651     9668      +17     
Impacted Files Coverage Δ
parquet/src/arrow/async_reader.rs 0.00% <0.00%> (ø)
parquet_derive/src/parquet_field.rs 65.98% <0.00%> (-0.23%) ⬇️
parquet/src/encodings/encoding.rs 93.43% <0.00%> (-0.20%) ⬇️
arrow/src/datatypes/datatype.rs 64.41% <0.00%> (+0.35%) ⬆️

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only some minor nits

.get_byte_ranges(fetch_ranges)
.await?
.into_iter()
.enumerate()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.zip(update_chunks.iter_mut()) might be cleaner?

let mut fetch_ranges =
Vec::with_capacity(column_chunks.len());

let mut update_chunks: Vec<(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My gut says that it would be cleaner to just iterate through the column_chunks and use filter_map to extract the ranges, pass this to AsyncFileReader. Convert the result to an iterator and then iterate the column_chunks again, popping the next element from the iterator for each included column.

Not a big deal though

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I think this was cleaner

@tustvold tustvold merged commit be0d34d into apache:master Jul 21, 2022
@ursabot
Copy link

ursabot commented Jul 21, 2022

Benchmark runs are scheduled for baseline = 3096591 and contender = be0d34d. be0d34d is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Permit parallel fetching of column chunks in ParquetRecordBatchStream
4 participants