Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Push gather down to Parquet Encoder #2109

Closed
wants to merge 3 commits into from

Conversation

tustvold
Copy link
Contributor

Which issue does this PR close?

Part of #1764

Rationale for this change

Data is unnecessarily copied prior to writing it out

What changes are included in this PR?

Previously the take operation necessary to handle nulls, lists, etc... was implemented prior to writing the data. This PR pushes it down into the encoder, avoiding an unnecessary copy

Are there any user-facing changes?

No

@github-actions github-actions bot added the parquet Changes to the parquet crate label Jul 19, 2022
let array = arrow::compute::cast(column, &ArrowDataType::Date32)?;
arrow::compute::cast(&array, &ArrowDataType::Int32)?
} else {
arrow::compute::cast(column, &ArrowDataType::Int32)?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This if statement was somewhat redundant, I suspect it dates from a refactor at some point

@@ -77,6 +81,9 @@ pub trait ColumnValueEncoder {
/// Write the corresponding values to this [`ColumnValueEncoder`]
fn write(&mut self, values: &Self::Values, offset: usize, len: usize) -> Result<()>;

/// Write the corresponding values to this [`ColumnValueEncoder`]
fn write_gather(&mut self, values: &Self::Values, indices: &[usize]) -> Result<()>;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not totally sold on this name, suggestions welcome

@tustvold
Copy link
Contributor Author

I intend to run the benchmarks shortly and report back

@codecov-commenter
Copy link

Codecov Report

Merging #2109 (33a9c2d) into master (efd3152) will decrease coverage by 0.02%.
The diff coverage is 73.86%.

@@            Coverage Diff             @@
##           master    #2109      +/-   ##
==========================================
- Coverage   83.73%   83.70%   -0.03%     
==========================================
  Files         225      225              
  Lines       59412    59442      +30     
==========================================
+ Hits        49748    49758      +10     
- Misses       9664     9684      +20     
Impacted Files Coverage Δ
parquet/src/encodings/encoding.rs 90.38% <20.83%> (-3.25%) ⬇️
parquet/src/column/writer/mod.rs 92.39% <72.72%> (-0.15%) ⬇️
parquet/src/column/writer/encoder.rs 90.90% <96.29%> (+2.44%) ⬆️
parquet/src/arrow/arrow_writer/mod.rs 97.62% <100.00%> (+0.08%) ⬆️
arrow/src/datatypes/datatype.rs 64.05% <0.00%> (-0.36%) ⬇️

@tustvold
Copy link
Contributor Author

tustvold commented Jul 19, 2022

Interestingly the performance gain is extremely minor, this might suggest the bottleneck is elsewhere 🤔

write_batch primitive/4096 values primitive                                                                             
                        time:   [1.8651 ms 1.8660 ms 1.8670 ms]
                        thrpt:  [94.492 MiB/s 94.544 MiB/s 94.587 MiB/s]
                 change:
                        time:   [-3.4958% -3.4494% -3.4001%] (p = 0.00 < 0.05)
                        thrpt:  [+3.5198% +3.5727% +3.6224%]
                        Performance has improved.
Found 16 outliers among 100 measurements (16.00%)
  5 (5.00%) low mild
  5 (5.00%) high mild
  6 (6.00%) high severe
Benchmarking write_batch primitive/4096 values primitive non-null: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 8.1s, enable flat sampling, or reduce sample count to 50.
write_batch primitive/4096 values primitive non-null                                                                             
                        time:   [1.6058 ms 1.6061 ms 1.6064 ms]
                        thrpt:  [107.69 MiB/s 107.71 MiB/s 107.73 MiB/s]
                 change:
                        time:   [-4.8854% -4.8376% -4.7946%] (p = 0.00 < 0.05)
                        thrpt:  [+5.0361% +5.0836% +5.1363%]
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  4 (4.00%) high mild
write_batch primitive/4096 values bool                                                                            
                        time:   [108.44 us 108.48 us 108.52 us]
                        thrpt:  [10.476 MiB/s 10.479 MiB/s 10.483 MiB/s]
                 change:
                        time:   [-3.5762% -3.4930% -3.3425%] (p = 0.00 < 0.05)
                        thrpt:  [+3.4581% +3.6195% +3.7089%]
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  4 (4.00%) high severe

Edit: See #2123

@tustvold tustvold marked this pull request as draft July 19, 2022 21:40
@tustvold tustvold marked this pull request as ready for review July 21, 2022 21:56
@tustvold
Copy link
Contributor Author

This is a necessary precondition of #1764 so marking ready for review. Will fix up merge conflicts once #2124 is merged

@tustvold tustvold marked this pull request as draft July 23, 2022 18:12
@tustvold
Copy link
Contributor Author

Marking as a draft whilst I think a bit more on this

@tustvold
Copy link
Contributor Author

Going to roll this into the optimized byte array PR

@tustvold tustvold closed this Jul 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants