Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster StringDictionaryBuilder (~60% faster) (#1851) #1861

Merged
merged 1 commit into from Jun 29, 2022

Conversation

tustvold
Copy link
Contributor

@tustvold tustvold commented Jun 13, 2022

Which issue does this PR close?

Closes #1851
Relates to #1843

Rationale for this change

StringDictionaryBuilder can be made significantly faster

What changes are included in this PR?

There are two major changes in this PR

  • Switch to ahash
  • Avoid caching string keys in HashMap

The first is ~40% uplift regardless of data shape, the latter adds a further performance improvement ranging from ~10-20% depending on the dictionary size.

string_dictionary_builder/(dict_size:20, len:1000, key_len: 5)                                                                             
                        time:   [15.148 us 15.179 us 15.213 us]
                        change: [-49.937% -49.607% -49.183%] (p = 0.00 < 0.05)
                        Performance has improved.
string_dictionary_builder/(dict_size:100, len:1000, key_len: 5)                                                                             
                        time:   [15.334 us 15.372 us 15.408 us]
                        change: [-60.780% -60.676% -60.577%] (p = 0.00 < 0.05)
                        Performance has improved.
string_dictionary_builder/(dict_size:100, len:1000, key_len: 10)                                                                             
                        time:   [14.638 us 14.653 us 14.668 us]
                        change: [-66.763% -66.716% -66.673%] (p = 0.00 < 0.05)
                        Performance has improved.
string_dictionary_builder/(dict_size:100, len:10000, key_len: 10)                                                                            
                        time:   [131.08 us 131.15 us 131.23 us]
                        change: [-61.008% -60.966% -60.922%] (p = 0.00 < 0.05)
                        Performance has improved.
string_dictionary_builder/(dict_size:100, len:10000, key_len: 100)                                                                            
                        time:   [379.73 us 379.89 us 380.06 us]
                        change: [-61.999% -61.946% -61.887%] (p = 0.00 < 0.05)
                        Performance has improved.

Are there any user-facing changes?

No

@github-actions github-actions bot added the arrow Changes to the arrow crate label Jun 13, 2022
@tustvold tustvold force-pushed the faster-string-dictionary-builder branch from 7aed1fe to 14bbd49 Compare June 13, 2022 11:42
arrow/Cargo.toml Outdated
serde = { version = "1.0" }
serde_derive = "1.0"
serde_json = { version = "1.0", features = ["preserve_order"] }
hashbrown = "0.12"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is "technically" a new dependency as indexmap depends on hashbrown = 0.11, I wasn't sure whether to use an out of data dependency or not here...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe it is time to submit a PR to update index map?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There has been an open issue since February - indexmap-rs/indexmap#217

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update here -- I spent a few minutes looking into the use of indexmap -- I think with a small amount of effort we could remove that dependency. I'll write up a ticket and maybe a PR for it shortly

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

arrow/Cargo.toml Outdated
@@ -38,9 +38,11 @@ path = "src/lib.rs"
bench = false

[dependencies]
ahash = "0.7"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a new dependency, although it is the default hash function for hashbrown so...

@tustvold tustvold force-pushed the faster-string-dictionary-builder branch from 14bbd49 to c054466 Compare June 13, 2022 11:46
@tustvold tustvold force-pushed the faster-string-dictionary-builder branch from c054466 to 0bffee1 Compare June 24, 2022 11:19
@github-actions github-actions bot added the arrow-flight Changes to the arrow-flight crate label Jun 24, 2022
@tustvold tustvold force-pushed the faster-string-dictionary-builder branch from 0bffee1 to 8032029 Compare June 24, 2022 11:19
@tustvold tustvold marked this pull request as ready for review June 24, 2022 11:20
@tustvold tustvold force-pushed the faster-string-dictionary-builder branch from 8032029 to 0e893cf Compare June 24, 2022 11:20
@@ -107,6 +107,11 @@ where
&mut self.values_builder
}

/// Returns the child array builder as an immutable reference
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These APIs are effectively part of #1860

@tustvold tustvold force-pushed the faster-string-dictionary-builder branch from 0e893cf to 75f4945 Compare June 24, 2022 11:36
@tustvold tustvold removed the arrow-flight Changes to the arrow-flight crate label Jun 24, 2022
@@ -38,13 +38,15 @@ path = "src/lib.rs"
bench = false

[dependencies]
ahash = { version = "0.7", default-features = false }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding these new dependencies are my only hesitation for this PR.

Any other thoughts from maintainers or users?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hashbrown isn't a new dependency as it is used by indexmap (and the standard library). It also by default brings in ahash so that I think isn't new either.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean that hashbrown is used by the standard library?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hashmap implementation in the standard library is hashbrown since Rust 1.36 - https://blog.rust-lang.org/2019/07/04/Rust-1.36.0.html#a-new-hashmapk-v-implementation

Copy link
Contributor

@Dandandan Dandandan Jun 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The standard library includes its own version though (so it won't pull in the same version / dependency as adding the dependency in a project does)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was being a bit disingenuous. I more meant from the perspective, "can I rely on this to be well-supported going forward", etc... it is pretty safe 😆

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will find time to review this PR in detail in a day or two. I think the dependency issue has been resolved (as in the new dependencies are ok, as they were transitive dependencies anyways)

@tustvold
Copy link
Contributor Author

@alamb if it helps this is a direct port of the string interner I added to IOx in https://github.com/influxdata/influxdb_iox/pull/1273

Copy link
Contributor

@Dandandan Dandandan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tustvold tustvold merged commit 903b24a into apache:master Jun 29, 2022
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice 👍

for (idx, maybe_value) in dictionary_values.iter().enumerate() {
match maybe_value {
Some(value) => {
let hash = compute_hash(&state, value.as_bytes());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my own curiosity, I stared at this for a while

I would describie this as using HashMap as a HashSet of pointers (indexes) to canonical string values in StringBuilder. It then uses the low level hash brown APIs to checks if the same string has previously been inserted.

Very nice

I wondered a little why we needed to store any value for key 🤔 at all, but then I realized it is needed so we the code can find the corresponding string value

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make StringDictionaryBuilder faster
3 participants