feat: Sort kernel for `RunArray` #3695

askoa · 2023-02-11T20:15:26Z

Which issue does this PR close?

Part of #3520

Rationale for this change

See issue description

What changes are included in this PR?

Built on top to yet to be merged PR #3681

sort_run_to_indices for RunArray
sort_run for RunArray
Include sort run and sort run to indices in sort_kernel benchmark

Sorting run array to indices is very slow if the intention is to get output run array. The sorted indices are logical indices which has to be encoded back to run array. The function sort_run will rearrange runs based on sorted values and hence will be faster to get output run array.

How much faster is `sort_run` compared to `sort_run_to_indices`?

Below benchmark result shows sort_run produces the output run array using same time taken by sort_run_to_indices to produce indices.

sort primitive run to indices 2^12
                        time:   [11.023 µs 11.054 µs 11.096 µs]
Found 14 outliers among 100 measurements (14.00%)
  2 (2.00%) high mild
  12 (12.00%) high severe

sort primitive run to run 2^12
                        time:   [10.165 µs 10.267 µs 10.410 µs]
Found 12 outliers among 100 measurements (12.00%)
  6 (6.00%) high mild
  6 (6.00%) high severe

What's the catch?

The sort_run will only rearrange the runs and not re-encode them for efficiency. For e.g. an input RunArray { run_ends = [2,4,6,8], values = [1,2,1,2] } will result in output RunArray { run_ends = [2,4,6,8], values = [1,1,2,2] } and not RunArray { run_ends = [4,8], values = [1,2] }. The output of sort_run_to_indices can be used to re-encode the RunArray.

Are there any user-facing changes?

Users will get a new sort function for RunArray

…array-sort

tustvold

I intend to review this fully tomorrow morning

arrow-array/src/run_iterator.rs

arrow-ord/src/sort.rs

askoa · 2023-02-15T16:54:10Z

I just noticed a key issue in this code. So changing this to draft.

…array-sort

tustvold

Looks good to me, sorry for taking so long to review. Left some minor comments

arrow-ord/src/sort.rs

…array-sort

ursabot · 2023-02-23T09:56:17Z

Benchmark runs are scheduled for baseline = e753dea and contender = ebe6f53. ebe6f53 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

ask added 2 commits February 11, 2023 14:54

Handle sliced array in run array iterator

18d8882

sort_to_indices for RunArray

cdf4e65

github-actions bot added the arrow Changes to the arrow crate label Feb 11, 2023

ask added 5 commits February 11, 2023 16:47

better loop

07236ca

sort for run array

92efab9

improve docs

7fe2851

Merge branch 'master' of https://github.com/apache/arrow-rs into run-…

5aea274

…array-sort

some minor tweaks

309c9eb

askoa changed the title ~~WIP: feat: Sort kernel for RunArray~~ feat: Sort kernel for RunArray Feb 13, 2023

doc fix

f04e70c

askoa marked this pull request as ready for review February 13, 2023 16:28

format fix

20cfaee

tustvold reviewed Feb 14, 2023

View reviewed changes

arrow-array/src/run_iterator.rs Outdated Show resolved Hide resolved

arrow-ord/src/sort.rs Show resolved Hide resolved

askoa marked this pull request as draft February 15, 2023 16:54

ask added 2 commits February 15, 2023 12:12

fix sort run to return all logical indices

8b26cdf

Merge branch 'master' of https://github.com/apache/arrow-rs into run-…

0b13898

…array-sort

askoa marked this pull request as ready for review February 15, 2023 17:18

pr comment

fc645d0

tustvold approved these changes Feb 21, 2023

View reviewed changes

arrow-ord/src/sort.rs Outdated Show resolved Hide resolved

arrow-ord/src/sort.rs Outdated Show resolved Hide resolved

arrow-ord/src/sort.rs Show resolved Hide resolved

askoa marked this pull request as draft February 21, 2023 15:32

ask added 2 commits February 21, 2023 21:00

Merge branch 'master' of https://github.com/apache/arrow-rs into run-…

381e4ef

…array-sort

rename test function, pull sort run logic into a separate function

e0cf496

askoa marked this pull request as ready for review February 22, 2023 13:22

tustvold merged commit ebe6f53 into apache:master Feb 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Sort kernel for `RunArray` #3695

feat: Sort kernel for `RunArray` #3695

askoa commented Feb 11, 2023 •

edited

tustvold left a comment

askoa commented Feb 15, 2023

tustvold left a comment

ursabot commented Feb 23, 2023

feat: Sort kernel for RunArray #3695

feat: Sort kernel for RunArray #3695

Conversation

askoa commented Feb 11, 2023 • edited

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How much faster is sort_run compared to sort_run_to_indices?

What's the catch?

Are there any user-facing changes?

tustvold left a comment

Choose a reason for hiding this comment

askoa commented Feb 15, 2023

tustvold left a comment

Choose a reason for hiding this comment

ursabot commented Feb 23, 2023

feat: Sort kernel for `RunArray` #3695

feat: Sort kernel for `RunArray` #3695

askoa commented Feb 11, 2023 •

edited

How much faster is `sort_run` compared to `sort_run_to_indices`?