Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatically grow parquet BitWriter (#2226) (~10% faster) #2231

Merged
merged 2 commits into from Jul 31, 2022

Conversation

tustvold
Copy link
Contributor

@tustvold tustvold commented Jul 29, 2022

Which issue does this PR close?

Closes #2226

Rationale for this change

This makes the BitWriter significantly easier to use, simplifies the implementation, and will allow removing a lot of fallibility from the encode path

What changes are included in this PR?

Alters BitWriter to use a growable Vec instead of a fixed size Vec

Are there any user-facing changes?

No, both the encoding module and the util module are marked experimental

@github-actions github-actions bot added the parquet Changes to the parquet crate label Jul 29, 2022
@@ -484,15 +452,6 @@ mod tests {
test_internal_roundtrip_underflow(Encoding::RLE, &levels, max_level, true);
}

#[test]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test no longer makes sense, as the Vec can just grow

self.buffered_values = 0;
self.byte_offset = self.start;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could perhaps be considered a behaviour change, previously a start parameter was passed in the constructor and resetting the writer would reset to this offset. In practice this functionality was never being used, and I found it very confusing, so I just removed it

}

impl BitWriter {
pub fn new(max_bytes: usize) -> Self {
Self {
buffer: vec![0; max_bytes],
max_bytes,
buffer: Vec::with_capacity(max_bytes),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An obvious benefit from this is we no longer zero-initialize this data, which may improve performance (will run benchmarks tomorrow)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, as I reviewed the rest of this PR, it may means that the hotpath has at least one fewer error checks

@@ -138,7 +137,7 @@ where
/// This function should be removed after
/// [`int_roundings`](https://github.com/rust-lang/rust/issues/88581) is stable.
#[inline]
pub fn ceil(value: i64, divisor: i64) -> i64 {
pub fn ceil<T: num::Integer>(value: T, divisor: T) -> T {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drive by cleanup to eliminate some unnecessary casting

@@ -658,20 +658,8 @@ pub(crate) mod private {
_: &mut W,
bit_writer: &mut BitWriter,
) -> Result<()> {
if bit_writer.bytes_written() + values.len() / 8 >= bit_writer.capacity() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a little bit funky, bool::encode would actually increase the BitWriter capacity instead of erroring 😅 This is now done automatically everywhere

@codecov-commenter
Copy link

codecov-commenter commented Jul 30, 2022

Codecov Report

Merging #2231 (773ed67) into master (393f006) will decrease coverage by 0.23%.
The diff coverage is 96.92%.

@@            Coverage Diff             @@
##           master    #2231      +/-   ##
==========================================
- Coverage   82.52%   82.29%   -0.24%     
==========================================
  Files         240      241       +1     
  Lines       62267    62354      +87     
==========================================
- Hits        51387    51312      -75     
- Misses      10880    11042     +162     
Impacted Files Coverage Δ
parquet/src/encodings/rle.rs 91.91% <95.65%> (-0.52%) ⬇️
parquet/src/encodings/levels.rs 94.26% <95.83%> (+0.85%) ⬆️
parquet/src/util/bit_util.rs 93.92% <96.82%> (-0.28%) ⬇️
parquet/src/column/writer/mod.rs 92.99% <100.00%> (-0.02%) ⬇️
parquet/src/data_type.rs 74.40% <100.00%> (+0.14%) ⬆️
parquet/src/encodings/encoding/dict_encoder.rs 91.22% <100.00%> (-0.16%) ⬇️
parquet/src/encodings/encoding/mod.rs 94.13% <100.00%> (+0.41%) ⬆️
parquet/src/util/test_common/page_util.rs 86.59% <100.00%> (-0.14%) ⬇️
... and 15 more

Help us with your feedback. Take ten seconds to tell us how you rate us.

@tustvold tustvold changed the title Automatically grow parquet BitWriter (#2226) Automatically grow parquet BitWriter (#2226) (~10% faster) Jul 30, 2022
@tustvold
Copy link
Contributor Author

Benchmarks

write_batch primitive/4096 values primitive                                                                             
                        time:   [1.4416 ms 1.4419 ms 1.4422 ms]
                        thrpt:  [122.33 MiB/s 122.35 MiB/s 122.37 MiB/s]
                 change:
                        time:   [-22.390% -22.341% -22.291%] (p = 0.00 < 0.05)
                        thrpt:  [+28.684% +28.767% +28.849%]
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe
Benchmarking write_batch primitive/4096 values primitive non-null: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 7.1s, enable flat sampling, or reduce sample count to 50.
write_batch primitive/4096 values primitive non-null                                                                             
                        time:   [1.4112 ms 1.4119 ms 1.4127 ms]
                        thrpt:  [122.46 MiB/s 122.53 MiB/s 122.59 MiB/s]
                 change:
                        time:   [-9.7045% -9.6526% -9.6050%] (p = 0.00 < 0.05)
                        thrpt:  [+10.626% +10.684% +10.748%]
                        Performance has improved.
Found 15 outliers among 100 measurements (15.00%)
  7 (7.00%) low mild
  6 (6.00%) high mild
  2 (2.00%) high severe
write_batch primitive/4096 values bool                                                                            
                        time:   [101.05 us 101.09 us 101.14 us]
                        thrpt:  [11.240 MiB/s 11.245 MiB/s 11.250 MiB/s]
                 change:
                        time:   [-3.9386% -3.8687% -3.7961%] (p = 0.00 < 0.05)
                        thrpt:  [+3.9459% +4.0244% +4.1001%]
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low severe
  2 (2.00%) high mild
  4 (4.00%) high severe
write_batch primitive/4096 values bool non-null                                                                             
                        time:   [43.685 us 43.707 us 43.728 us]
                        thrpt:  [14.830 MiB/s 14.837 MiB/s 14.845 MiB/s]
                 change:
                        time:   [+3.4561% +3.5779% +3.7335%] (p = 0.00 < 0.05)
                        thrpt:  [-3.5992% -3.4543% -3.3406%]
                        Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
  2 (2.00%) low mild
  1 (1.00%) high mild
  3 (3.00%) high severe
write_batch primitive/4096 values string                                                                            
                        time:   [844.97 us 845.63 us 846.37 us]
                        thrpt:  [94.037 MiB/s 94.119 MiB/s 94.192 MiB/s]
                 change:
                        time:   [-14.692% -14.560% -14.353%] (p = 0.00 < 0.05)
                        thrpt:  [+16.758% +17.041% +17.222%]
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  2 (2.00%) high mild
  4 (4.00%) high severe
write_batch primitive/4096 values string non-null                                                                            
                        time:   [961.22 us 961.63 us 962.07 us]
                        thrpt:  [81.713 MiB/s 81.750 MiB/s 81.785 MiB/s]
                 change:
                        time:   [-6.1525% -6.0546% -5.9641%] (p = 0.00 < 0.05)
                        thrpt:  [+6.3424% +6.4448% +6.5558%]
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe

Benchmarking write_batch nested/4096 values primitive list: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 9.1s, enable flat sampling, or reduce sample count to 50.
write_batch nested/4096 values primitive list                                                                             
                        time:   [1.7915 ms 1.7917 ms 1.7919 ms]
                        thrpt:  [91.399 MiB/s 91.411 MiB/s 91.423 MiB/s]
                 change:
                        time:   [-6.7029% -6.6786% -6.6536%] (p = 0.00 < 0.05)
                        thrpt:  [+7.1279% +7.1566% +7.1845%]
                        Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  2 (2.00%) low severe
  6 (6.00%) high mild
  5 (5.00%) high severe
write_batch nested/4096 values primitive list non-null                                                                             
                        time:   [2.1664 ms 2.1672 ms 2.1680 ms]
                        thrpt:  [87.847 MiB/s 87.878 MiB/s 87.910 MiB/s]
                 change:
                        time:   [-9.0613% -9.0201% -8.9817%] (p = 0.00 < 0.05)
                        thrpt:  [+9.8680% +9.9144% +9.9642%]
                        Performance has improved.
Found 15 outliers among 100 measurements (15.00%)
  15 (15.00%) low mild

So a roughly 10% performance improvement, likely from not needing to zero-allocate arrays

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went through this PR carefully and I think it makes the code easier to understand (total bonus points that it makes it faster too) -- I agree that using Vec's accounting rather than a custom type of accounting is a win all around.

Thanks @tustvold

cc @nevi-me and @sunchao

@@ -127,12 +126,11 @@ impl<T: DataType> DictEncoder<T> {
/// the result.
pub fn write_indices(&mut self) -> Result<ByteBufferPtr> {
let buffer_len = self.estimated_data_encoded_size();
let mut buffer = vec![0; buffer_len];
buffer[0] = self.bit_width() as u8;
let mut buffer = Vec::with_capacity(buffer_len);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

self.encoder = Some(RleEncoder::new(1, DEFAULT_RLE_BUFFER_LEN));
}
let rle_encoder = self.encoder.as_mut().unwrap();
let rle_encoder = self.encoder.get_or_insert_with(|| {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let len = (buf.len() as i32).to_le();
let len_bytes = len.as_bytes();
let mut encoded_data = vec![];
encoded_data.extend_from_slice(len_bytes);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this part of the speed increase -- avoid copying encoded bytes and instead reserve the space for the length up front and updated it at the end?

Copy link
Contributor Author

@tustvold tustvold Jul 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potentially, I suspect it might have contributed

Edit: This is actually only used for encoding booleans, which I don't think the benchmarks actually cover

@@ -873,7 +869,7 @@ mod tests {
let mut values = vec![];
values.extend_from_slice(&[true; 16]);
values.extend_from_slice(&[false; 16]);
run_test::<BoolType>(Encoding::RLE, -1, &values, 0, 2, 0);
run_test::<BoolType>(Encoding::RLE, -1, &values, 0, 6, 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did this change from 2 to 6?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we reserve space for the length up front now

mem::size_of::<i32>(),
)),
Encoding::RLE => {
buffer.extend_from_slice(&[0; 8]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
buffer.extend_from_slice(&[0; 8]);
// reserve space for length
buffer.extend_from_slice(&[0; 8]);

Is that what this initial 0 is for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, although it occurs to me that this allocated 8 bytes instead of 4... Something isn't right here 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this was working because the extra 4 zero bytes would be interpreted as empty RLE runs, I will fix this and add a test that would have caught it

}

impl BitWriter {
pub fn new(max_bytes: usize) -> Self {
Self {
buffer: vec![0; max_bytes],
max_bytes,
buffer: Vec::with_capacity(max_bytes),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, as I reviewed the rest of this PR, it may means that the hotpath has at least one fewer error checks

&mut self.buffer[self.byte_offset..],
);
let num_bytes = ceil(self.bit_offset, 8);
let slice = &self.buffered_values.to_ne_bytes()[..num_bytes as usize];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I double checked that to_ne_bytes() does the same as memcpy_value in terms of byte order. I think it does

self.buffered_values = 0;
if let Some(remaining) = self.bit_offset.checked_sub(64) {
self.buffer
.extend_from_slice(&self.buffered_values.to_le_bytes());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this path use to_le_bytes and the path above use to_ne_bytes? Maybe they should all be to_le_bytes for consistency (and portability?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, they should all use to_le_bytes

@@ -846,16 +760,16 @@ mod tests {
fn test_bit_reader_skip() {
let buffer = vec![255, 0];
let mut bit_reader = BitReader::from(buffer);
let skipped = bit_reader.skip(1,1);
let skipped = bit_reader.skip(1, 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is strange that cargo fmt seems to have started caring about this file 🤷

Copy link
Contributor Author

@tustvold tustvold Jul 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I have no idea either. Perhaps my IDE runs it with some more aggressive setting 🤔

writer.put_aligned(42, 4);
writer.put_aligned_offset(0x10, 1, old_offset);
let result = writer.consume();
assert_eq!(result.as_ref(), [0x10, 42, 0, 0, 0]);

writer = BitWriter::new(4);
let result = writer.skip(5);
assert!(result.is_err());
assert_eq!(result, 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is ok now because the buffer automatically grows, right?

@tustvold tustvold merged commit 99ad915 into apache:master Jul 31, 2022
@ursabot
Copy link

ursabot commented Jul 31, 2022

Benchmark runs are scheduled for baseline = 6c3f9a2 and contender = 99ad915. 99ad915 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Automatically Grow Parquet BitWriter Buffer
4 participants