Optimized writing of byte array to parquet (#1764) (2x faster) #2221

tustvold · 2022-07-29T09:43:46Z

Which issue does this PR close?

Part of #1764
Closes #1753

Rationale for this change

write_batch primitive/4096 values string                                                                            
                        time:   [482.71 us 482.93 us 483.18 us]
                        thrpt:  [164.72 MiB/s 164.81 MiB/s 164.88 MiB/s]
                 change:
                        time:   [-51.256% -51.214% -51.172%] (p = 0.00 < 0.05)
                        thrpt:  [+104.80% +104.98% +105.16%]
                        Performance has improved.
write_batch primitive/4096 values string non-null                                                                            
                        time:   [497.39 us 497.69 us 498.02 us]
                        thrpt:  [157.85 MiB/s 157.96 MiB/s 158.05 MiB/s]
                 change:
                        time:   [-51.444% -51.392% -51.343%] (p = 0.00 < 0.05)
                        thrpt:  [+105.52% +105.73% +105.95%]
                        Performance has improved.

And there is still low-hanging fruit for optimisation here

What changes are included in this PR?

Switches encoding arrow arrays to a specialized write path

Are there any user-facing changes?

No

tustvold · 2022-07-29T09:45:19Z

parquet/src/column/writer/encoder.rs

-    /// Returns the min and max values in this collection, skipping any NaN values
-    ///
-    /// Returns `None` if no values found
-    fn min_max(&self, descr: &ColumnDescriptor) -> Option<(&Self::T, &Self::T)>;


This is moved onto Encoder so that ColumnValues can be a type-erased type, e.g. ArrayRef. This will be critical to support dictionaries without needing GATs, as the TypedDictionary (#2136) contains a lifetime.

tustvold · 2022-07-29T09:46:23Z

parquet/src/column/writer/encoder.rs

-            _ => self.encoder.put(slice),
-        }
+    fn write_gather(&mut self, values: &Self::Values, indices: &[usize]) -> Result<()> {
+        let slice: Vec<_> = indices.iter().map(|idx| values[*idx].clone()).collect();


This is pushed down from get_numeric_array_slice in arrow writer

tustvold · 2022-07-29T09:47:28Z

parquet/src/arrow/arrow_writer/byte_array.rs

+    mut valid: impl Iterator<Item = usize>,
+) -> Option<(ByteArray, ByteArray)>
+where
+    T: ArrayAccessor,


Using the new ArrayAccessor 😄

tustvold · 2022-07-29T09:49:08Z

parquet/src/encodings/encoding/dict_encoder.rs

-        } else {
-            num_required_bits(num_entries as u64 - 1)
-        }
+        num_required_bits(self.num_entries().saturating_sub(1) as u64)


This logic was actually previously incorrect as it would return a bit_width of 1 for num_entries == 1 when it only needed to be 0. This is largely harmless, but is worth fixing.

tustvold · 2022-07-29T09:51:43Z

parquet/src/arrow/arrow_writer/byte_array.rs

+
+impl ColumnValueEncoder for ByteArrayEncoder {
+    type T = ByteArray;
+    type Values = ArrayRef;


Initially I had the concrete type here, i.e. StringArray. This works, however, would present difficulties in adapting this to preserve dictionaries, as TypedDictionary (#2136) will contain a lifetime, which would then require GATs here

codecov-commenter · 2022-07-29T10:25:08Z

Codecov Report

Merging #2221 (22f52cd) into master (b879977) will decrease coverage by 0.01%.
The diff coverage is 81.61%.

@@            Coverage Diff             @@
##           master    #2221      +/-   ##
==========================================
- Coverage   82.29%   82.27%   -0.02%     
==========================================
  Files         244      245       +1     
  Lines       62443    62654     +211     
==========================================
+ Hits        51386    51549     +163     
- Misses      11057    11105      +48

Impacted Files	Coverage Δ
parquet/src/arrow/arrow_writer/byte_array.rs	`76.72% <76.72%> (ø)`
parquet/src/column/writer/mod.rs	`92.85% <84.61%> (-0.15%)`	⬇️
parquet/src/column/writer/encoder.rs	`89.01% <91.42%> (+1.35%)`	⬆️
parquet/src/arrow/arrow_writer/mod.rs	`97.66% <100.00%> (+0.01%)`	⬆️
parquet/src/encodings/encoding/dict_encoder.rs	`90.74% <100.00%> (-0.49%)`	⬇️
parquet/src/util/interner.rs	`91.66% <100.00%> (+0.75%)`	⬆️
parquet_derive/src/parquet_field.rs	`65.75% <0.00%> (ø)`
parquet/src/data_type.rs	`74.62% <0.00%> (+0.21%)`	⬆️
arrow/src/datatypes/datatype.rs	`62.61% <0.00%> (+0.31%)`	⬆️
arrow/src/array/array_binary.rs	`95.45% <0.00%> (+0.64%)`	⬆️
... and 2 more

Help us with your feedback. Take ten seconds to tell us how you rate us.

parquet/src/arrow/arrow_writer/mod.rs

tustvold · 2022-07-29T15:39:27Z

parquet/src/arrow/arrow_writer/mod.rs

-}
-
-// TODO: These methods don't handle non null indices correctly (#1753)
-def_get_binary_array_fn!(get_binary_array, arrow_array::BinaryArray);


Removing these fixes #1753

tustvold · 2022-07-29T15:40:08Z

parquet/src/arrow/arrow_writer/byte_array.rs

+        T::Item: AsRef<[u8]>,
+    {
+        self.num_values += indices.len();
+        match &mut self.encoder {


See https://github.com/apache/parquet-format/blob/master/Encodings.md for what the various encodings are. They are all relatively self-explantory

parquet/src/column/writer/encoder.rs

ursabot · 2022-08-01T11:41:29Z

Benchmark runs are scheduled for baseline = 42b15a8 and contender = 2c09ba4. 2c09ba4 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

tustvold commented Jul 29, 2022

View reviewed changes

github-actions bot added the parquet Changes to the parquet crate label Jul 29, 2022

tustvold commented Jul 29, 2022

View reviewed changes

tustvold force-pushed the faster-byte-array branch from af94d6c to f90e5ae Compare July 29, 2022 15:38

tustvold commented Jul 29, 2022

View reviewed changes

parquet/src/arrow/arrow_writer/mod.rs Show resolved Hide resolved

tustvold commented Jul 29, 2022

View reviewed changes

Optimized writing of byte array to parquet (apache#1764)

7d6a5b9

tustvold force-pushed the faster-byte-array branch from f90e5ae to 7d6a5b9 Compare July 29, 2022 16:19

tustvold marked this pull request as ready for review July 29, 2022 16:27

nevi-me approved these changes Jul 31, 2022

View reviewed changes

parquet/src/column/writer/encoder.rs Outdated Show resolved Hide resolved

parquet/src/column/writer/encoder.rs Outdated Show resolved Hide resolved

tustvold added 3 commits August 1, 2022 11:30

Review feedback

a579d5a

Merge remote-tracking branch 'upstream/master' into faster-byte-array

c48b7d6

Fix logical conflict

22f52cd

tustvold merged commit 2c09ba4 into apache:master Aug 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimized writing of byte array to parquet (#1764) (2x faster) #2221

Optimized writing of byte array to parquet (#1764) (2x faster) #2221

tustvold commented Jul 29, 2022 •

edited

tustvold Jul 29, 2022

tustvold Jul 29, 2022

tustvold Jul 29, 2022

tustvold Jul 29, 2022

tustvold Jul 29, 2022

codecov-commenter commented Jul 29, 2022 •

edited

tustvold Jul 29, 2022

tustvold Jul 29, 2022

ursabot commented Aug 1, 2022

Optimized writing of byte array to parquet (#1764) (2x faster) #2221

Optimized writing of byte array to parquet (#1764) (2x faster) #2221

Conversation

tustvold commented Jul 29, 2022 • edited

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

tustvold Jul 29, 2022

Choose a reason for hiding this comment

tustvold Jul 29, 2022

Choose a reason for hiding this comment

tustvold Jul 29, 2022

Choose a reason for hiding this comment

tustvold Jul 29, 2022

Choose a reason for hiding this comment

tustvold Jul 29, 2022

Choose a reason for hiding this comment

codecov-commenter commented Jul 29, 2022 • edited

Codecov Report

tustvold Jul 29, 2022

Choose a reason for hiding this comment

tustvold Jul 29, 2022

Choose a reason for hiding this comment

ursabot commented Aug 1, 2022

tustvold commented Jul 29, 2022 •

edited

codecov-commenter commented Jul 29, 2022 •

edited