Automatically grow parquet BitWriter (#2226) (~10% faster) #2231

tustvold · 2022-07-29T21:19:21Z

Which issue does this PR close?

Rationale for this change

This makes the BitWriter significantly easier to use, simplifies the implementation, and will allow removing a lot of fallibility from the encode path

What changes are included in this PR?

Alters BitWriter to use a growable Vec instead of a fixed size Vec

Are there any user-facing changes?

No, both the encoding module and the util module are marked experimental

tustvold · 2022-07-29T21:22:36Z

parquet/src/encodings/levels.rs

@@ -484,15 +452,6 @@ mod tests {
        test_internal_roundtrip_underflow(Encoding::RLE, &levels, max_level, true);
    }

-    #[test]


This test no longer makes sense, as the Vec can just grow

tustvold · 2022-07-29T21:25:50Z

parquet/src/util/bit_util.rs

        self.buffered_values = 0;
-        self.byte_offset = self.start;


This could perhaps be considered a behaviour change, previously a start parameter was passed in the constructor and resetting the writer would reset to this offset. In practice this functionality was never being used, and I found it very confusing, so I just removed it

tustvold · 2022-07-29T21:26:37Z

parquet/src/util/bit_util.rs

 }

 impl BitWriter {
    pub fn new(max_bytes: usize) -> Self {
        Self {
-            buffer: vec![0; max_bytes],
-            max_bytes,
+            buffer: Vec::with_capacity(max_bytes),


An obvious benefit from this is we no longer zero-initialize this data, which may improve performance (will run benchmarks tomorrow)

Also, as I reviewed the rest of this PR, it may means that the hotpath has at least one fewer error checks

tustvold · 2022-07-29T21:26:49Z

parquet/src/util/bit_util.rs

@@ -138,7 +137,7 @@ where
 /// This function should be removed after
 /// [`int_roundings`](https://github.com/rust-lang/rust/issues/88581) is stable.
 #[inline]
-pub fn ceil(value: i64, divisor: i64) -> i64 {
+pub fn ceil<T: num::Integer>(value: T, divisor: T) -> T {


Drive by cleanup to eliminate some unnecessary casting

tustvold · 2022-07-29T21:29:28Z

parquet/src/data_type.rs

@@ -658,20 +658,8 @@ pub(crate) mod private {
            _: &mut W,
            bit_writer: &mut BitWriter,
        ) -> Result<()> {
-            if bit_writer.bytes_written() + values.len() / 8 >= bit_writer.capacity() {


This is a little bit funky, bool::encode would actually increase the BitWriter capacity instead of erroring 😅 This is now done automatically everywhere

codecov-commenter · 2022-07-30T02:40:31Z

Codecov Report

Merging #2231 (773ed67) into master (393f006) will decrease coverage by 0.23%.
The diff coverage is 96.92%.

@@            Coverage Diff             @@
##           master    #2231      +/-   ##
==========================================
- Coverage   82.52%   82.29%   -0.24%     
==========================================
  Files         240      241       +1     
  Lines       62267    62354      +87     
==========================================
- Hits        51387    51312      -75     
- Misses      10880    11042     +162

Impacted Files	Coverage Δ
parquet/src/encodings/rle.rs	`91.91% <95.65%> (-0.52%)`	⬇️
parquet/src/encodings/levels.rs	`94.26% <95.83%> (+0.85%)`	⬆️
parquet/src/util/bit_util.rs	`93.92% <96.82%> (-0.28%)`	⬇️
parquet/src/column/writer/mod.rs	`92.99% <100.00%> (-0.02%)`	⬇️
parquet/src/data_type.rs	`74.40% <100.00%> (+0.14%)`	⬆️
parquet/src/encodings/encoding/dict_encoder.rs	`91.22% <100.00%> (-0.16%)`	⬇️
parquet/src/encodings/encoding/mod.rs	`94.13% <100.00%> (+0.41%)`	⬆️
parquet/src/util/test_common/page_util.rs	`86.59% <100.00%> (-0.14%)`	⬇️
... and 15 more

Help us with your feedback. Take ten seconds to tell us how you rate us.

tustvold · 2022-07-30T09:38:30Z

Benchmarks

write_batch primitive/4096 values primitive                                                                             
                        time:   [1.4416 ms 1.4419 ms 1.4422 ms]
                        thrpt:  [122.33 MiB/s 122.35 MiB/s 122.37 MiB/s]
                 change:
                        time:   [-22.390% -22.341% -22.291%] (p = 0.00 < 0.05)
                        thrpt:  [+28.684% +28.767% +28.849%]
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe
Benchmarking write_batch primitive/4096 values primitive non-null: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 7.1s, enable flat sampling, or reduce sample count to 50.
write_batch primitive/4096 values primitive non-null                                                                             
                        time:   [1.4112 ms 1.4119 ms 1.4127 ms]
                        thrpt:  [122.46 MiB/s 122.53 MiB/s 122.59 MiB/s]
                 change:
                        time:   [-9.7045% -9.6526% -9.6050%] (p = 0.00 < 0.05)
                        thrpt:  [+10.626% +10.684% +10.748%]
                        Performance has improved.
Found 15 outliers among 100 measurements (15.00%)
  7 (7.00%) low mild
  6 (6.00%) high mild
  2 (2.00%) high severe
write_batch primitive/4096 values bool                                                                            
                        time:   [101.05 us 101.09 us 101.14 us]
                        thrpt:  [11.240 MiB/s 11.245 MiB/s 11.250 MiB/s]
                 change:
                        time:   [-3.9386% -3.8687% -3.7961%] (p = 0.00 < 0.05)
                        thrpt:  [+3.9459% +4.0244% +4.1001%]
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low severe
  2 (2.00%) high mild
  4 (4.00%) high severe
write_batch primitive/4096 values bool non-null                                                                             
                        time:   [43.685 us 43.707 us 43.728 us]
                        thrpt:  [14.830 MiB/s 14.837 MiB/s 14.845 MiB/s]
                 change:
                        time:   [+3.4561% +3.5779% +3.7335%] (p = 0.00 < 0.05)
                        thrpt:  [-3.5992% -3.4543% -3.3406%]
                        Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
  2 (2.00%) low mild
  1 (1.00%) high mild
  3 (3.00%) high severe
write_batch primitive/4096 values string                                                                            
                        time:   [844.97 us 845.63 us 846.37 us]
                        thrpt:  [94.037 MiB/s 94.119 MiB/s 94.192 MiB/s]
                 change:
                        time:   [-14.692% -14.560% -14.353%] (p = 0.00 < 0.05)
                        thrpt:  [+16.758% +17.041% +17.222%]
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  2 (2.00%) high mild
  4 (4.00%) high severe
write_batch primitive/4096 values string non-null                                                                            
                        time:   [961.22 us 961.63 us 962.07 us]
                        thrpt:  [81.713 MiB/s 81.750 MiB/s 81.785 MiB/s]
                 change:
                        time:   [-6.1525% -6.0546% -5.9641%] (p = 0.00 < 0.05)
                        thrpt:  [+6.3424% +6.4448% +6.5558%]
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe

Benchmarking write_batch nested/4096 values primitive list: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 9.1s, enable flat sampling, or reduce sample count to 50.
write_batch nested/4096 values primitive list                                                                             
                        time:   [1.7915 ms 1.7917 ms 1.7919 ms]
                        thrpt:  [91.399 MiB/s 91.411 MiB/s 91.423 MiB/s]
                 change:
                        time:   [-6.7029% -6.6786% -6.6536%] (p = 0.00 < 0.05)
                        thrpt:  [+7.1279% +7.1566% +7.1845%]
                        Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  2 (2.00%) low severe
  6 (6.00%) high mild
  5 (5.00%) high severe
write_batch nested/4096 values primitive list non-null                                                                             
                        time:   [2.1664 ms 2.1672 ms 2.1680 ms]
                        thrpt:  [87.847 MiB/s 87.878 MiB/s 87.910 MiB/s]
                 change:
                        time:   [-9.0613% -9.0201% -8.9817%] (p = 0.00 < 0.05)
                        thrpt:  [+9.8680% +9.9144% +9.9642%]
                        Performance has improved.
Found 15 outliers among 100 measurements (15.00%)
  15 (15.00%) low mild

So a roughly 10% performance improvement, likely from not needing to zero-allocate arrays

alamb

I went through this PR carefully and I think it makes the code easier to understand (total bonus points that it makes it faster too) -- I agree that using Vec's accounting rather than a custom type of accounting is a win all around.

Thanks @tustvold

cc @nevi-me and @sunchao

alamb · 2022-07-31T11:16:07Z

parquet/src/encodings/encoding/dict_encoder.rs

@@ -127,12 +126,11 @@ impl<T: DataType> DictEncoder<T> {
    /// the result.
    pub fn write_indices(&mut self) -> Result<ByteBufferPtr> {
        let buffer_len = self.estimated_data_encoded_size();
-        let mut buffer = vec![0; buffer_len];
-        buffer[0] = self.bit_width() as u8;
+        let mut buffer = Vec::with_capacity(buffer_len);


alamb · 2022-07-31T11:16:50Z

parquet/src/encodings/encoding/mod.rs

-            self.encoder = Some(RleEncoder::new(1, DEFAULT_RLE_BUFFER_LEN));
-        }
-        let rle_encoder = self.encoder.as_mut().unwrap();
+        let rle_encoder = self.encoder.get_or_insert_with(|| {


TIL: https://doc.rust-lang.org/std/option/enum.Option.html#method.get_or_insert_with 👍

alamb · 2022-07-31T11:18:28Z

parquet/src/encodings/encoding/mod.rs

-            let len = (buf.len() as i32).to_le();
-            let len_bytes = len.as_bytes();
-            let mut encoded_data = vec![];
-            encoded_data.extend_from_slice(len_bytes);


Is this part of the speed increase -- avoid copying encoded bytes and instead reserve the space for the length up front and updated it at the end?

Potentially, I suspect it might have contributed

Edit: This is actually only used for encoding booleans, which I don't think the benchmarks actually cover

alamb · 2022-07-31T11:18:57Z

parquet/src/encodings/encoding/mod.rs

@@ -873,7 +869,7 @@ mod tests {
        let mut values = vec![];
        values.extend_from_slice(&[true; 16]);
        values.extend_from_slice(&[false; 16]);
-        run_test::<BoolType>(Encoding::RLE, -1, &values, 0, 2, 0);
+        run_test::<BoolType>(Encoding::RLE, -1, &values, 0, 6, 0);


why did this change from 2 to 6?

Because we reserve space for the length up front now

alamb · 2022-07-31T11:19:46Z

parquet/src/encodings/levels.rs

-                mem::size_of::<i32>(),
-            )),
+            Encoding::RLE => {
+                buffer.extend_from_slice(&[0; 8]);


Suggested change

buffer.extend_from_slice(&[0; 8]);

// reserve space for length

buffer.extend_from_slice(&[0; 8]);

Is that what this initial 0 is for?

Yes, although it occurs to me that this allocated 8 bytes instead of 4... Something isn't right here 🤔

So this was working because the extra 4 zero bytes would be interpreted as empty RLE runs, I will fix this and add a test that would have caught it

alamb · 2022-07-31T11:23:23Z

parquet/src/util/bit_util.rs

 }

 impl BitWriter {
    pub fn new(max_bytes: usize) -> Self {
        Self {
-            buffer: vec![0; max_bytes],
-            max_bytes,
+            buffer: Vec::with_capacity(max_bytes),


Also, as I reviewed the rest of this PR, it may means that the hotpath has at least one fewer error checks

alamb · 2022-07-31T11:28:38Z

parquet/src/util/bit_util.rs

-            &mut self.buffer[self.byte_offset..],
-        );
+        let num_bytes = ceil(self.bit_offset, 8);
+        let slice = &self.buffered_values.to_ne_bytes()[..num_bytes as usize];


I double checked that to_ne_bytes() does the same as memcpy_value in terms of byte order. I think it does

alamb · 2022-07-31T11:30:51Z

parquet/src/util/bit_util.rs

-            self.buffered_values = 0;
+        if let Some(remaining) = self.bit_offset.checked_sub(64) {
+            self.buffer
+                .extend_from_slice(&self.buffered_values.to_le_bytes());


Why does this path use to_le_bytes and the path above use to_ne_bytes? Maybe they should all be to_le_bytes for consistency (and portability?)

I agree, they should all use to_le_bytes

alamb · 2022-07-31T11:32:11Z

parquet/src/util/bit_util.rs

@@ -846,16 +760,16 @@ mod tests {
    fn test_bit_reader_skip() {
        let buffer = vec![255, 0];
        let mut bit_reader = BitReader::from(buffer);
-        let skipped = bit_reader.skip(1,1);
+        let skipped = bit_reader.skip(1, 1);


it is strange that cargo fmt seems to have started caring about this file 🤷

Yeah, I have no idea either. Perhaps my IDE runs it with some more aggressive setting 🤔

alamb · 2022-07-31T11:32:39Z

parquet/src/util/bit_util.rs

        writer.put_aligned(42, 4);
        writer.put_aligned_offset(0x10, 1, old_offset);
        let result = writer.consume();
        assert_eq!(result.as_ref(), [0x10, 42, 0, 0, 0]);

        writer = BitWriter::new(4);
        let result = writer.skip(5);
-        assert!(result.is_err());
+        assert_eq!(result, 0);


this is ok now because the buffer automatically grows, right?

ursabot · 2022-07-31T20:51:23Z

Benchmark runs are scheduled for baseline = 6c3f9a2 and contender = 99ad915. 99ad915 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

github-actions bot added the parquet Changes to the parquet crate label Jul 29, 2022

tustvold commented Jul 29, 2022

View reviewed changes

Automatically grow parquet BitWriter (apache#2226)

2383ede

tustvold force-pushed the bit-writer-grow branch from f124959 to 2383ede Compare July 29, 2022 21:23

tustvold commented Jul 29, 2022

View reviewed changes

tustvold changed the title ~~Automatically grow parquet BitWriter (#2226)~~ Automatically grow parquet BitWriter (#2226) (~10% faster) Jul 30, 2022

alamb approved these changes Jul 31, 2022

View reviewed changes

Review feedback

773ed67

tustvold merged commit 99ad915 into apache:master Jul 31, 2022

alamb mentioned this pull request Aug 2, 2022

Remove fallibility from paruqet RleEncoder (#2226) #2259

Merged

tustvold mentioned this pull request Aug 23, 2022

RleValueEncoder::flush_buffer Panics if No Encoded Values #2558

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automatically grow parquet BitWriter (#2226) (~10% faster) #2231

Automatically grow parquet BitWriter (#2226) (~10% faster) #2231

tustvold commented Jul 29, 2022 •

edited

tustvold Jul 29, 2022

tustvold Jul 29, 2022

tustvold Jul 29, 2022

alamb Jul 31, 2022

tustvold Jul 29, 2022

tustvold Jul 29, 2022

codecov-commenter commented Jul 30, 2022 •

edited

tustvold commented Jul 30, 2022

alamb left a comment

alamb Jul 31, 2022

alamb Jul 31, 2022

alamb Jul 31, 2022

tustvold Jul 31, 2022 •

edited

alamb Jul 31, 2022

tustvold Jul 31, 2022

alamb Jul 31, 2022

tustvold Jul 31, 2022

tustvold Jul 31, 2022

alamb Jul 31, 2022

alamb Jul 31, 2022

alamb Jul 31, 2022

tustvold Jul 31, 2022

alamb Jul 31, 2022

tustvold Jul 31, 2022 •

edited

alamb Jul 31, 2022

ursabot commented Jul 31, 2022

	buffer.extend_from_slice(&[0; 8]);
	// reserve space for length
	buffer.extend_from_slice(&[0; 8]);

Automatically grow parquet BitWriter (#2226) (~10% faster) #2231

Automatically grow parquet BitWriter (#2226) (~10% faster) #2231

Conversation

tustvold commented Jul 29, 2022 • edited

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-commenter commented Jul 30, 2022 • edited

Codecov Report

tustvold commented Jul 30, 2022

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold Jul 31, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold Jul 31, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ursabot commented Jul 31, 2022

tustvold commented Jul 29, 2022 •

edited

codecov-commenter commented Jul 30, 2022 •

edited

tustvold Jul 31, 2022 •

edited

tustvold Jul 31, 2022 •

edited