Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make StringDictionaryBuilder faster #1851

Closed
tustvold opened this issue Jun 11, 2022 · 2 comments · Fixed by #1861
Closed

Make StringDictionaryBuilder faster #1851

tustvold opened this issue Jun 11, 2022 · 2 comments · Fixed by #1861
Assignees
Labels
arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog

Comments

@tustvold
Copy link
Contributor

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

A while back I implemented an optimized string dictionary builder for IOx. This contains two major tricks to provide better performance:

  • Use ahash instead of SipHash - this alone provides a 40% speedup
  • Use hashbrown's raw_entry_mut to not duplicate string values into the hashmap

I have an implementation of this for arrow that needs a bit more polish, but leads to a 60% speedup over the current implementation in arrow. Unfortunately it depends on #1850 as it needs to be able to read the string data from an in-progress StringBuilder

Describe the solution you'd like

Implement #1850 and then add this functionality

Describe alternatives you've considered

We could not do this

@tustvold tustvold added the enhancement Any new improvement worthy of a entry in the changelog label Jun 11, 2022
@tustvold tustvold self-assigned this Jun 11, 2022
@jhorstmann
Copy link
Contributor

I also have an implementation that I could contribute, it would need some changes for generic key types and maybe for null values. It's based on using hashbrown directly and storing a (start_index, end_index) tuple in the hashmap into a backing MutableBuffer, also keeping track of the offsets in a MutableBuffer. Probably very similar to your impl, but using mutable buffers directly.

@tustvold
Copy link
Contributor Author

Exciting, lets see how they compare 😄

tustvold added a commit to tustvold/arrow-rs that referenced this issue Jun 13, 2022
tustvold added a commit to tustvold/arrow-rs that referenced this issue Jun 13, 2022
tustvold added a commit to tustvold/arrow-rs that referenced this issue Jun 13, 2022
tustvold added a commit to tustvold/arrow-rs that referenced this issue Jun 13, 2022
tustvold added a commit to tustvold/arrow-rs that referenced this issue Jun 24, 2022
tustvold added a commit to tustvold/arrow-rs that referenced this issue Jun 24, 2022
tustvold added a commit to tustvold/arrow-rs that referenced this issue Jun 24, 2022
tustvold added a commit to tustvold/arrow-rs that referenced this issue Jun 24, 2022
@alamb alamb changed the title Faster StringDictionaryBuilder Make StringDictionaryBuilder faster Jul 7, 2022
@alamb alamb added the arrow Changes to the arrow crate label Jul 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants