Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimized writing of byte array to parquet (#1764) (2x faster) #2221

Merged
merged 4 commits into from Aug 1, 2022

Conversation

tustvold
Copy link
Contributor

@tustvold tustvold commented Jul 29, 2022

Which issue does this PR close?

Part of #1764
Closes #1753

Rationale for this change

write_batch primitive/4096 values string                                                                            
                        time:   [482.71 us 482.93 us 483.18 us]
                        thrpt:  [164.72 MiB/s 164.81 MiB/s 164.88 MiB/s]
                 change:
                        time:   [-51.256% -51.214% -51.172%] (p = 0.00 < 0.05)
                        thrpt:  [+104.80% +104.98% +105.16%]
                        Performance has improved.
write_batch primitive/4096 values string non-null                                                                            
                        time:   [497.39 us 497.69 us 498.02 us]
                        thrpt:  [157.85 MiB/s 157.96 MiB/s 158.05 MiB/s]
                 change:
                        time:   [-51.444% -51.392% -51.343%] (p = 0.00 < 0.05)
                        thrpt:  [+105.52% +105.73% +105.95%]
                        Performance has improved.

And there is still low-hanging fruit for optimisation here

What changes are included in this PR?

Switches encoding arrow arrays to a specialized write path

Are there any user-facing changes?

No

/// Returns the min and max values in this collection, skipping any NaN values
///
/// Returns `None` if no values found
fn min_max(&self, descr: &ColumnDescriptor) -> Option<(&Self::T, &Self::T)>;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is moved onto Encoder so that ColumnValues can be a type-erased type, e.g. ArrayRef. This will be critical to support dictionaries without needing GATs, as the TypedDictionary (#2136) contains a lifetime.

_ => self.encoder.put(slice),
}
fn write_gather(&mut self, values: &Self::Values, indices: &[usize]) -> Result<()> {
let slice: Vec<_> = indices.iter().map(|idx| values[*idx].clone()).collect();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pushed down from get_numeric_array_slice in arrow writer

mut valid: impl Iterator<Item = usize>,
) -> Option<(ByteArray, ByteArray)>
where
T: ArrayAccessor,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the new ArrayAccessor 😄

} else {
num_required_bits(num_entries as u64 - 1)
}
num_required_bits(self.num_entries().saturating_sub(1) as u64)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic was actually previously incorrect as it would return a bit_width of 1 for num_entries == 1 when it only needed to be 0. This is largely harmless, but is worth fixing.

@github-actions github-actions bot added the parquet Changes to the parquet crate label Jul 29, 2022

impl ColumnValueEncoder for ByteArrayEncoder {
type T = ByteArray;
type Values = ArrayRef;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially I had the concrete type here, i.e. StringArray. This works, however, would present difficulties in adapting this to preserve dictionaries, as TypedDictionary (#2136) will contain a lifetime, which would then require GATs here

@codecov-commenter
Copy link

codecov-commenter commented Jul 29, 2022

Codecov Report

Merging #2221 (22f52cd) into master (b879977) will decrease coverage by 0.01%.
The diff coverage is 81.61%.

@@            Coverage Diff             @@
##           master    #2221      +/-   ##
==========================================
- Coverage   82.29%   82.27%   -0.02%     
==========================================
  Files         244      245       +1     
  Lines       62443    62654     +211     
==========================================
+ Hits        51386    51549     +163     
- Misses      11057    11105      +48     
Impacted Files Coverage Δ
parquet/src/arrow/arrow_writer/byte_array.rs 76.72% <76.72%> (ø)
parquet/src/column/writer/mod.rs 92.85% <84.61%> (-0.15%) ⬇️
parquet/src/column/writer/encoder.rs 89.01% <91.42%> (+1.35%) ⬆️
parquet/src/arrow/arrow_writer/mod.rs 97.66% <100.00%> (+0.01%) ⬆️
parquet/src/encodings/encoding/dict_encoder.rs 90.74% <100.00%> (-0.49%) ⬇️
parquet/src/util/interner.rs 91.66% <100.00%> (+0.75%) ⬆️
parquet_derive/src/parquet_field.rs 65.75% <0.00%> (ø)
parquet/src/data_type.rs 74.62% <0.00%> (+0.21%) ⬆️
arrow/src/datatypes/datatype.rs 62.61% <0.00%> (+0.31%) ⬆️
arrow/src/array/array_binary.rs 95.45% <0.00%> (+0.64%) ⬆️
... and 2 more

Help us with your feedback. Take ten seconds to tell us how you rate us.

}

// TODO: These methods don't handle non null indices correctly (#1753)
def_get_binary_array_fn!(get_binary_array, arrow_array::BinaryArray);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing these fixes #1753

T::Item: AsRef<[u8]>,
{
self.num_values += indices.len();
match &mut self.encoder {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See https://github.com/apache/parquet-format/blob/master/Encodings.md for what the various encodings are. They are all relatively self-explantory

@tustvold tustvold marked this pull request as ready for review July 29, 2022 16:27
parquet/src/column/writer/encoder.rs Outdated Show resolved Hide resolved
parquet/src/column/writer/encoder.rs Outdated Show resolved Hide resolved
@tustvold tustvold merged commit 2c09ba4 into apache:master Aug 1, 2022
@ursabot
Copy link

ursabot commented Aug 1, 2022

Benchmark runs are scheduled for baseline = 42b15a8 and contender = 2c09ba4. 2c09ba4 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Arrow Parquet BinaryArray Writer Ignores LevelInfo Indices
4 participants