Avoid Allocating When Decoding Thrift #5777

tustvold · 2024-05-16T17:51:48Z

This is a proof of concept, to show that this is possible

Which issue does this PR close?

Rationale for this change

I haven't had time to properly benchmark this, but in a quick test this at least halved the number of allocations associated with reading a parquet file.

What changes are included in this PR?

Are there any user-facing changes?

tustvold · 2024-05-16T17:53:40Z

parquet/src/thrift.rs

 };
+use thrift::transport::TReadTransport;
+
+pub trait TInputProtocolRef<'de>: TInputProtocol {


This is what allows the reader to borrow from the slice instead of allocating, fortunately #4892 already did the heavy lifting here

tustvold · 2024-05-16T17:59:02Z

parquet/src/file/footer.rs

@@ -78,12 +78,22 @@ pub fn decode_metadata(buf: &[u8]) -> Result<ParquetMetaData> {
        row_groups.push(RowGroupMetaData::from_thrift(schema_descr.clone(), rg)?);
    }
    let column_orders = parse_column_orders(t_file_metadata.column_orders, &schema_descr);
+    let kv_metadata = t_file_metadata.key_value_metadata.map(|x| {


This formulation is a little obtuse, we'd likely want to do something to make this better

tustvold · 2024-05-16T18:02:34Z

parquet/src/data_type.rs

@@ -210,6 +211,12 @@ impl<'a> From<&'a [u8]> for ByteArray {
    }
 }

+impl<'a> From<Cow<'a, [u8]>> for ByteArray {


It is worth highlighting that this does perform an allocation, and does mean that when decoding into the Rust versions of statistics, etc... we still allocate. However, we now only do this for ByteArray types, whereas previously all columns would have associated allocations, and theoretically the reader could perform projection pushdown at this point.

tustvold · 2024-05-16T20:36:02Z

We don't have very much benchmark coverage of metadata parsing, #5770 will hopefully help address this, but what we have shows a non-trivial performance uplift

open(default)           time:   [13.672 µs 13.677 µs 13.682 µs]
                        change: [-12.499% -12.231% -11.980%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  6 (6.00%) high mild
  3 (3.00%) high severe

open(page index)        time:   [428.13 µs 428.27 µs 428.42 µs]
                        change: [-33.856% -33.806% -33.754%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  5 (5.00%) high mild
  3 (3.00%) high severe

I'm confident it will be even more pronounced for wider schemas

github-actions bot added the parquet Changes to the parquet crate label May 16, 2024

tustvold commented May 16, 2024

View reviewed changes

Support lifetiems in thrift format

d3e9150

tustvold force-pushed the thrift-borrow branch from 5ddbd83 to d3e9150 Compare May 16, 2024 17:58

tustvold commented May 16, 2024

View reviewed changes

tustvold mentioned this pull request May 16, 2024

Report / blog on parquet metadata sizes for "large" (1000+) numbers of columns #5770

Open

This was referenced May 25, 2024

Reduce Allocations When Reading Parquet Metadata #5775

Open

Release arrow-rs / parquet version 52.0.0 #5688

Closed

jhorstmann mentioned this pull request Jun 7, 2024

Use custom thrift decoder to improve speed of parsing parquet metadata #5854

Open

tustvold mentioned this pull request Jun 7, 2024

Implement selective decoding of a subset (e.g. columns or row groups) of parquet metadata #5855

Open

jhorstmann mentioned this pull request Jun 11, 2024

Benchmarks for custom parquet format #5869

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid Allocating When Decoding Thrift #5777

Avoid Allocating When Decoding Thrift #5777

tustvold commented May 16, 2024 •

edited

tustvold May 16, 2024

tustvold May 16, 2024

tustvold May 16, 2024

tustvold commented May 16, 2024

Avoid Allocating When Decoding Thrift #5777

Are you sure you want to change the base?

Avoid Allocating When Decoding Thrift #5777

Conversation

tustvold commented May 16, 2024 • edited

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

tustvold May 16, 2024

Choose a reason for hiding this comment

tustvold May 16, 2024

Choose a reason for hiding this comment

tustvold May 16, 2024

Choose a reason for hiding this comment

tustvold commented May 16, 2024

tustvold commented May 16, 2024 •

edited