Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Parquet OffsetIndex to prune IO with RowSelection #2473

Merged
merged 8 commits into from Aug 17, 2022

Conversation

thinkharderdev
Copy link
Contributor

Which issue does this PR close?

Closes #2426.

Rationale for this change

When we have a RowSelection and an OffsetIndex we can reduce IO by fetching only the pages selected.

This also builds on #2464 to remove InMemoryColumnChunk and unify everything to use SerializedPageReader

What changes are included in this PR?

We can represent pre-fetched column chunks as either a "dense" encoding (just Bytes) or a "sparse" encoding which contains only the pages relevant to a given RowSelection.

Also remove InMemoryColumnChunk to help unify the sync and async parquet paths.

Are there any user-facing changes?

@github-actions github-actions bot added object-store Object Store Interface parquet Changes to the parquet crate labels Aug 16, 2022
@thinkharderdev
Copy link
Contributor Author

@tustvold

Leaving this in draft until #2464 is merged as this includes those changes.

@@ -65,7 +65,7 @@ pub fn read_pages_locations<R: ChunkReader>(
let (offset, total_length) = get_location_offset_and_total_length(chunks)?;

//read all need data into buffer
let mut reader = reader.get_read(offset, reader.len() as usize)?;
let mut reader = reader.get_read(offset, total_length)?;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty sure this was a bug

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, I remember fixing it in a PR that got abandoned at some point

@tustvold
Copy link
Contributor

Thank you for this, I'll review first thing tomorrow. I like that you've found a way to allow sharing the page muxing logic in SerializedPageReader 👍

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, will review again once rebased, but mostly minor nits

@@ -1068,6 +1068,7 @@ mod tests {
integration.head(&path).await.unwrap();
}

#[ignore]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, this test fails on my machine because of a permissions issue. Meant to revert before submitting.

(mask, ranges)
}

pub fn selectors(&self) -> &[RowSelector] {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't appear to be being used, and so I think can go. I've been trying to avoid exposing the internal layout of this type externally

let (mask, ranges) = selection.page_mask(&index);

assert_eq!(mask, vec![false, true, true, false, true, true, false]);
assert_eq!(ranges, vec![10..20, 20..30, 40..50, 50..60]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we get a test where the final PageLocation is selected?

@@ -116,6 +118,62 @@ impl RowSelection {
Self { selectors }
}

/// Given an offset index, return a mask indicating which pages are selected along with their locations by `self`
pub fn page_mask(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pub fn page_mask(
pub(crate) fn page_mask(

I don't think this likely to be useful outside the crate

pub fn page_mask(
&self,
page_locations: &[PageLocation],
) -> (Vec<bool>, Vec<Range<usize>>) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems strange to me that this method would return Vec<Range<usize>> when it is called page_mask, and the caller clearly already has &[PageLocation] that can easily be combined with the mask...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea was to just do it it one shot to avoid iterating over the locations again to get the ranges, but perhaps it's better to avoid overloading

Edit: Looking at this again, the mask was part of a previous design that is no longer relevant, so I think we can just rename this and only return the ranges.

Sparse {
/// Length of the full column chunk
length: usize,
data: Vec<(usize, Bytes)>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A comment explaining what these are would go a long way

.find(|(offset, bytes)| {
*offset <= start as usize && (start as usize - *offset) < bytes.len()
})
.map(|(_, bytes)| bytes.slice(0..length).reader())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The line above allows offset to be greater than start, but this won't return the correct slice in such a case?

ColumnChunkData::Sparse { data, .. } => data
.iter()
.find(|(offset, bytes)| {
*offset <= start as usize && (start as usize - *offset) < bytes.len()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should do an exact match? I think this should work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't sure whether there is ever a case in which we fetch some subset of the page. Thinking about it more I don't believe that would ever be a valid use case.

self.offset += page_header.compressed_page_size as usize;
fn get_read(&self, start: u64, length: usize) -> Result<Self::T> {
match &self {
ColumnChunkData::Sparse { data, .. } => data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As data is sorted, you could consider https://doc.rust-lang.org/std/primitive.slice.html#method.binary_search or friends

let page_header = read_page_header(&mut cursor)?;
self.offset += cursor.position() as usize;
self.offset += page_header.compressed_page_size as usize;
fn get_read(&self, start: u64, length: usize) -> Result<Self::T> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is worth noting this will currently represent a performance regression, as avoided a copy - https://github.com/apache/arrow-rs/pull/2473/files#diff-f6b1a106d47a16504d4a16d57a6632872ddf596f337ac0640a13523dccc2d4d4L615

I will add a get_bytes method to ChunkReader to avoid this

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is good to go, thank you, there are some minor nits still, but I'm happy for these to be addressed in a follow up. Once merged I will rebase #2478 onto this

@github-actions github-actions bot removed the object-store Object Store Interface label Aug 17, 2022
@tustvold
Copy link
Contributor

Docs appear to be failing

@tustvold tustvold merged commit 2185ce2 into apache:master Aug 17, 2022
@ursabot
Copy link

ursabot commented Aug 17, 2022

Benchmark runs are scheduled for baseline = 42e9531 and contender = 2185ce2. 2185ce2 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@tustvold tustvold changed the title Use OffsetIndex to prune IO with RowSelection Use Parquet OffsetIndex to prune IO with RowSelection Aug 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use OffsetIndex to Prune IO in ParquetRecordBatchStream
3 participants