Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

Reduced re-alloc in parquet #1337

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

Reduced re-alloc in parquet #1337

wants to merge 2 commits into from

Conversation

jorgecarleitao
Copy link
Owner

Closes #1324

@codecov
Copy link

codecov bot commented Dec 18, 2022

Codecov Report

Base: 83.63% // Head: 83.78% // Increases project coverage by +0.14% 🎉

Coverage data is based on head (ef34171) compared to base (3a8da98).
Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1337      +/-   ##
==========================================
+ Coverage   83.63%   83.78%   +0.14%     
==========================================
  Files         373      373              
  Lines       40284    40392     +108     
==========================================
+ Hits        33692    33841     +149     
+ Misses       6592     6551      -41     
Impacted Files Coverage Δ
src/io/parquet/read/deserialize/binary/basic.rs 80.72% <100.00%> (+1.04%) ⬆️
...c/io/parquet/read/deserialize/binary/dictionary.rs 89.87% <100.00%> (ø)
src/io/parquet/read/deserialize/binary/nested.rs 80.70% <100.00%> (ø)
src/io/parquet/read/deserialize/binary/utils.rs 67.92% <100.00%> (+2.61%) ⬆️
src/io/parquet/read/deserialize/boolean/basic.rs 92.91% <100.00%> (+0.11%) ⬆️
src/io/parquet/read/deserialize/dictionary/mod.rs 76.75% <100.00%> (+0.25%) ⬆️
...arquet/read/deserialize/fixed_size_binary/basic.rs 95.05% <100.00%> (+0.13%) ⬆️
src/io/parquet/read/deserialize/primitive/basic.rs 95.58% <100.00%> (+0.04%) ⬆️
...c/io/parquet/read/deserialize/primitive/integer.rs 85.92% <100.00%> (+0.65%) ⬆️
src/io/parquet/read/deserialize/utils.rs 82.23% <100.00%> (ø)
... and 5 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@ritchie46
Copy link
Collaborator

I noticed a significant performance regression on reading utf8 data. I think it is related to now defaulting to a values_capacity of 0 instead of 24 * capacity as a reasonable default.

| State::OptionalDictionary(_, _)
| State::OptionalDelta(_, _)
| State::FilteredOptionalDelta(_, _) => (
Binary::<O>::with_capacity(capacity, 0),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This value capacity of 0 is very costly in most cases.

| State::RequiredDictionary(_)
| State::Delta(_)
| State::FilteredDelta(_) => (
Binary::<O>::with_capacity(capacity, 0),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This value capacity of 0 is very costly in most cases.

@@ -59,7 +59,7 @@ fn read_dict<O: Offset>(data_type: DataType, dict: &DictPage) -> Box<dyn Array>

let values = SizedBinaryIter::new(&dict.buffer, dict.num_values);

let mut data = Binary::<O>::with_capacity(dict.num_values);
let mut data = Binary::<O>::with_capacity(dict.num_values, 0);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This value capacity of 0 is very costly in most cases.

@@ -36,10 +36,10 @@ impl<O: Offset> Pushable<usize> for Offsets<O> {

impl<O: Offset> Binary<O> {
#[inline]
pub fn with_capacity(capacity: usize) -> Self {
pub fn with_capacity(capacity: usize, values_capacity: usize) -> Self {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should have this signature:

pub fn with_capacity(capacity: usize, values_capacity: Option<usize>) -> Self {

and then

let values_capacity = values_capacity.unwrap_or(capacity.min(100) * 24);

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the values_capacity is zero by default, maybe we could use a need_estimated to reserve the space.

We use this in databend

@jorgecarleitao
Copy link
Owner Author

Sorry for the delay on this one - if anyone would like to take an extra pass go ahead, otherwise I will merge it in.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Arrow2 read parquet file did not reuse the page decoder buffer to array
3 participants