Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid Allocating When Decoding Thrift #5777

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

tustvold
Copy link
Contributor

@tustvold tustvold commented May 16, 2024

This is a proof of concept, to show that this is possible

Which issue does this PR close?

Closes #5775

Rationale for this change

I haven't had time to properly benchmark this, but in a quick test this at least halved the number of allocations associated with reading a parquet file.

What changes are included in this PR?

Are there any user-facing changes?

@github-actions github-actions bot added the parquet Changes to the parquet crate label May 16, 2024
};
use thrift::transport::TReadTransport;

pub trait TInputProtocolRef<'de>: TInputProtocol {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what allows the reader to borrow from the slice instead of allocating, fortunately #4892 already did the heavy lifting here

@@ -78,12 +78,22 @@ pub fn decode_metadata(buf: &[u8]) -> Result<ParquetMetaData> {
row_groups.push(RowGroupMetaData::from_thrift(schema_descr.clone(), rg)?);
}
let column_orders = parse_column_orders(t_file_metadata.column_orders, &schema_descr);
let kv_metadata = t_file_metadata.key_value_metadata.map(|x| {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This formulation is a little obtuse, we'd likely want to do something to make this better

@@ -210,6 +211,12 @@ impl<'a> From<&'a [u8]> for ByteArray {
}
}

impl<'a> From<Cow<'a, [u8]>> for ByteArray {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is worth highlighting that this does perform an allocation, and does mean that when decoding into the Rust versions of statistics, etc... we still allocate. However, we now only do this for ByteArray types, whereas previously all columns would have associated allocations, and theoretically the reader could perform projection pushdown at this point.

@tustvold
Copy link
Contributor Author

We don't have very much benchmark coverage of metadata parsing, #5770 will hopefully help address this, but what we have shows a non-trivial performance uplift

open(default)           time:   [13.672 µs 13.677 µs 13.682 µs]
                        change: [-12.499% -12.231% -11.980%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  6 (6.00%) high mild
  3 (3.00%) high severe

open(page index)        time:   [428.13 µs 428.27 µs 428.42 µs]
                        change: [-33.856% -33.806% -33.754%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  5 (5.00%) high mild
  3 (3.00%) high severe

I'm confident it will be even more pronounced for wider schemas

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Reduce Allocations When Reading Parquet Metadata
1 participant