Faster parquet DictEncoder (~20%) #2123

tustvold · 2022-07-21T21:29:52Z

Which issue does this PR close?

Part of #1764

Rationale for this change

The existing implementation is complex, and slower

What changes are included in this PR?

Gives the encoder the same treatment as #1861, switching to using ahash and hashbrown.

Are there any user-facing changes?

No

tustvold · 2022-07-21T21:32:05Z

Running benchmarks with just the change to ahash show no significant performance change. This is not entirely surprising as the current implementation uses crc32 which is very cheap to compute (although not DOS resistant).

The change to hashbrown nets a non-trivial return where value encoding is the major bottleneck, this diminishes as additional overheads from nulls, lists, etc... take effect.

write_batch primitive/4096 values primitive                                                                             
                        time:   [1.5325 ms 1.5331 ms 1.5338 ms]
                        thrpt:  [115.02 MiB/s 115.07 MiB/s 115.12 MiB/s]
                 change:
                        time:   [-20.677% -20.632% -20.590%] (p = 0.00 < 0.05)
                        thrpt:  [+25.929% +25.995% +26.068%]
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild
Benchmarking write_batch primitive/4096 values primitive non-null: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 7.5s, enable flat sampling, or reduce sample count to 50.
write_batch primitive/4096 values primitive non-null                                                                             
                        time:   [1.4838 ms 1.4847 ms 1.4857 ms]
                        thrpt:  [116.44 MiB/s 116.52 MiB/s 116.59 MiB/s]
                 change:
                        time:   [-12.080% -12.017% -11.954%] (p = 0.00 < 0.05)
                        thrpt:  [+13.577% +13.659% +13.739%]
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe
write_batch primitive/4096 values bool                                                                            
                        time:   [111.01 us 111.09 us 111.19 us]
                        thrpt:  [10.224 MiB/s 10.233 MiB/s 10.240 MiB/s]
                 change:
                        time:   [-0.8794% -0.6831% -0.4488%] (p = 0.00 < 0.05)
                        thrpt:  [+0.4508% +0.6878% +0.8872%]
                        Change within noise threshold.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild
write_batch primitive/4096 values bool non-null                                                                            
                        time:   [52.931 us 53.012 us 53.094 us]
                        thrpt:  [21.411 MiB/s 21.444 MiB/s 21.477 MiB/s]
                 change:
                        time:   [-2.2177% -2.1085% -1.9913%] (p = 0.00 < 0.05)
                        thrpt:  [+2.0318% +2.1539% +2.2680%]
                        Performance has improved.
Found 15 outliers among 100 measurements (15.00%)
  5 (5.00%) high mild
  10 (10.00%) high severe
write_batch primitive/4096 values string                                                                            
                        time:   [891.20 us 891.52 us 891.88 us]
                        thrpt:  [89.239 MiB/s 89.275 MiB/s 89.306 MiB/s]
                 change:
                        time:   [-8.4838% -8.4391% -8.3955%] (p = 0.00 < 0.05)
                        thrpt:  [+9.1650% +9.2170% +9.2703%]
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild
Benchmarking write_batch primitive/4096 values string non-null: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 5.2s, enable flat sampling, or reduce sample count to 60.
write_batch primitive/4096 values string non-null                                                                             
                        time:   [1.0208 ms 1.0213 ms 1.0218 ms]
                        thrpt:  [77.889 MiB/s 77.931 MiB/s 77.970 MiB/s]
                 change:
                        time:   [+0.0730% +0.1746% +0.2545%] (p = 0.00 < 0.05)
                        thrpt:  [-0.2538% -0.1743% -0.0730%]
                        Change within noise threshold.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

Benchmarking write_batch nested/4096 values primitive list: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 9.8s, enable flat sampling, or reduce sample count to 50.
write_batch nested/4096 values primitive list                                                                             
                        time:   [1.9798 ms 2.0064 ms 2.0368 ms]
                        thrpt:  [80.409 MiB/s 81.627 MiB/s 82.725 MiB/s]
                 change:
                        time:   [+0.9435% +1.8832% +3.0013%] (p = 0.00 < 0.05)
                        thrpt:  [-2.9139% -1.8484% -0.9347%]
                        Change within noise threshold.
Found 19 outliers among 100 measurements (19.00%)
  1 (1.00%) high mild
  18 (18.00%) high severe
write_batch nested/4096 values primitive list non-null                                                                             
                        time:   [2.4385 ms 2.4696 ms 2.5038 ms]
                        thrpt:  [76.896 MiB/s 77.959 MiB/s 78.952 MiB/s]
                 change:
                        time:   [-0.1096% +1.1302% +2.5102%] (p = 0.10 > 0.05)
                        thrpt:  [-2.4488% -1.1176% +0.1097%]
                        No change in performance detected.

codecov-commenter · 2022-07-21T21:54:49Z

Codecov Report

Merging #2123 (8be38af) into master (5e3facf) will decrease coverage by 1.15%.
The diff coverage is 90.24%.

@@            Coverage Diff             @@
##           master    #2123      +/-   ##
==========================================
- Coverage   83.71%   82.55%   -1.16%     
==========================================
  Files         225      240      +15     
  Lines       59567    62199    +2632     
==========================================
+ Hits        49865    51349    +1484     
- Misses       9702    10850    +1148

Impacted Files	Coverage Δ
parquet/src/encodings/encoding/mod.rs	`93.72% <50.00%> (ø)`
parquet/src/util/interner.rs	`90.90% <90.90%> (ø)`
parquet/src/encodings/encoding/dict_encoder.rs	`91.37% <91.37%> (ø)`
parquet/src/column/page.rs	`83.33% <0.00%> (-15.36%)`	⬇️
arrow/src/array/iterator.rs	`86.45% <0.00%> (-9.66%)`	⬇️
arrow/src/array/array_string.rs	`92.05% <0.00%> (-5.71%)`	⬇️
arrow/src/util/decimal.rs	`86.92% <0.00%> (-4.59%)`	⬇️
arrow/src/array/array.rs	`87.75% <0.00%> (-4.06%)`	⬇️
arrow/src/datatypes/schema.rs	`70.41% <0.00%> (-3.05%)`	⬇️
arrow/src/array/builder/generic_list_builder.rs	`92.59% <0.00%> (-2.47%)`	⬇️
... and 58 more

Help us with your feedback. Take ten seconds to tell us how you rate us.

…ct-encoder

Dandandan · 2022-07-22T05:34:07Z

parquet/Cargo.toml

@@ -49,6 +50,7 @@ serde_json = { version = "1.0", default-features = false, features = ["std"], op
 rand = { version = "0.8", default-features = false, features = ["std", "std_rng"] }
 futures = { version = "0.3", default-features = false, features = ["std"], optional = true }
 tokio = { version = "1.0", optional = true, default-features = false, features = ["macros", "fs", "rt", "io-util"] }
+hashbrown = { version = "0.12", default-features = false }


There is a feature 'inline-more" which is enabled by default in hashbrown which gives sometimes a bit better performance.

By disabling this here, we can delegate that decision downstream

Dandandan · 2022-07-22T09:27:24Z

parquet/src/encodings/encoding/dict_encoder.rs

+
+impl<T: DataType> Encoder<T> for DictEncoder<T> {
+    fn put(&mut self, values: &[T::T]) -> Result<()> {
+        for i in values {


Not sure if it's a bottleneck, it might be faster to compute the hashes for values in one go (i.e. vectorized)?

alamb

The code looks good to me but I am concerned about the new dependencies as I believe some people use parquet after compiling to WASM or on embedded devices.

I am curious what other maintainers think too

cc @sunchao @nevi-me @viirya @HaoYang670

alamb · 2022-07-22T10:32:34Z

parquet/Cargo.toml

@@ -30,6 +30,7 @@ edition = "2021"
 rust-version = "1.57"

 [dependencies]
+ahash = "0.7"


These seem to be new dependencies (if optional features are not enabled)

alamb · 2022-07-22T10:43:30Z

parquet/src/encodings/encoding/dict_encoder.rs

+
+    state: ahash::RandomState,
+
+    /// Used to provide a lookup from value to unique value


Given the replication of this pattern (maybe now in three places?) perhaps we can factor it into its own structure, mostly for readability as the use of HashMap to implement a HashSet takes some thought to totally grok

I did consider this, but I was unsure where to put it. It can't live in arrow, as parquet needs to compile without arrow, but aside from creating a new crate I wasn't really sure where to put it...

https://www.youtube.com/watch?v=PAAkCSZUG1c&t=9m28s 🤷

tustvold · 2022-07-29T07:30:33Z

I'm going to get this in as I need it for #1764, we have time until the next release to address any issues.

ursabot · 2022-07-29T08:12:08Z

Benchmark runs are scheduled for baseline = 985760f and contender = 6ce4c4e. 6ce4c4e is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

alamb · 2022-07-29T11:32:54Z

parquet/src/util/interner.rs

+use std::hash::Hash;
+
+/// Storage trait for [`Interner`]
+pub trait Storage {


Faster parquet DictEncoder

5d8d756

github-actions bot added the parquet Changes to the parquet crate label Jul 21, 2022

tustvold changed the title ~~Faster parquet DictEncoder~~ Faster parquet DictEncoder (~20%) Jul 21, 2022

tustvold mentioned this pull request Jul 21, 2022

Push gather down to Parquet Encoder #2109

Closed

Merge remote-tracking branch 'upstream/master' into faster-parquet-di…

4c38a63

…ct-encoder

Dandandan reviewed Jul 22, 2022

View reviewed changes

alamb approved these changes Jul 22, 2022

View reviewed changes

tustvold added 3 commits July 23, 2022 09:16

Reserve dictionary capacity

e9b527a

Split out interner

8be38af

Fix RAT

a07d513

tustvold merged commit 6ce4c4e into apache:master Jul 29, 2022

alamb reviewed Jul 29, 2022

View reviewed changes

parquet/src/util/interner.rs

use std::hash::Hash;

/// Storage trait for [`Interner`]

pub trait Storage {

Copy link

Contributor

alamb Jul 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster parquet DictEncoder (~20%) #2123

Faster parquet DictEncoder (~20%) #2123

tustvold commented Jul 21, 2022 •

edited

tustvold commented Jul 21, 2022 •

edited

codecov-commenter commented Jul 21, 2022 •

edited

Dandandan Jul 22, 2022

tustvold Jul 29, 2022

Dandandan Jul 22, 2022

alamb left a comment

alamb Jul 22, 2022

alamb Jul 22, 2022

tustvold Jul 22, 2022

alamb Jul 22, 2022

tustvold commented Jul 29, 2022

ursabot commented Jul 29, 2022

alamb Jul 29, 2022


		state: ahash::RandomState,

		/// Used to provide a lookup from value to unique value

Faster parquet DictEncoder (~20%) #2123

Faster parquet DictEncoder (~20%) #2123

Conversation

tustvold commented Jul 21, 2022 • edited

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

tustvold commented Jul 21, 2022 • edited

codecov-commenter commented Jul 21, 2022 • edited

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold commented Jul 29, 2022

ursabot commented Jul 29, 2022

Choose a reason for hiding this comment

tustvold commented Jul 21, 2022 •

edited

tustvold commented Jul 21, 2022 •

edited

codecov-commenter commented Jul 21, 2022 •

edited