Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster multi column sort. #5443

Closed
ghuls opened this issue Nov 7, 2022 · 4 comments
Closed

Faster multi column sort. #5443

ghuls opened this issue Nov 7, 2022 · 4 comments
Labels
enhancement New feature or an improvement of an existing feature performance Performance issues or improvements

Comments

@ghuls
Copy link
Collaborator

ghuls commented Nov 7, 2022

Problem description

Faster multi column sorting by sorting encoded forms of each column.
https://arrow.apache.org/blog/2022/11/07/multi-column-sorts-in-arrow-rust-part-2/

https://arrow.apache.org/blog/2022/11/07/multi-column-sorts-in-arrow-rust-part-1/

apache/arrow-rs#2929

@ghuls ghuls added the enhancement New feature or an improvement of an existing feature label Nov 7, 2022
@ritchie46
Copy link
Member

Yeah, I read the duckdb post a while ago already. I think we first need to do some comparisons. How are we doing against duckdb and arrow on this?

@ghuls
Copy link
Collaborator Author

ghuls commented Nov 8, 2022

Not sure how fast we are compared with duckdb and arrow-rs, but it seems that compared with their old implementation, they get a big speedup.

Using this format for lexicographic sorting is more than 3x faster than the comparator based approach, with the benefits especially pronounced for strings, dictionaries and sorts with large numbers of columns.

We have also already used it to more than double the performance of sort preserving merge in the DataFusion project, and expect similar or greater performance uplift as we apply it to sort, grouping, join, and window function operators as well.

@ritchie46
Copy link
Member

I think we still first need to check before we can conclude that ours would speed up from this. We have a different sort implementation.

@zundertj zundertj added the performance Performance issues or improvements label Nov 12, 2022
@ritchie46
Copy link
Member

We have proper row encoding now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature performance Performance issues or improvements
Projects
None yet
Development

No branches or pull requests

3 participants