Split out arrow-string (#2594) #3295

tustvold · 2022-12-08T14:19:31Z

Which issue does this PR close?

Part of #2594

Rationale for this change

Splits out string kernels into a crate called arrow-string. Whilst these are primarily concerned with UTF8 strings, some also handle binary arrays. I therefore went with arrow-string instead of arrow-str as str is very specifically just UTF-8 in Rust, whereas the general concept of a string extends to binary strings. Or something like that...

What changes are included in this PR?

Are there any user-facing changes?

tustvold · 2022-12-08T14:20:32Z

arrow-string/src/regexp.rs

 use std::sync::Arc;

-use regex::Regex;
+/// Perform SQL `array ~ regex_array` operation on [`StringArray`] / [`LargeStringArray`].


These functions are moved from comparison.rs

tustvold · 2022-12-08T14:21:47Z

arrow/src/compute/kernels/comparison.rs

@@ -23,1227 +23,75 @@
 //! [here](https://doc.rust-lang.org/stable/core/arch/) for more information.
 //!

-use crate::array::*;


The like and regex kernels are moved into arrow-string. The remaining kernels will be moved into an arrow-ord crate in a follow up PR

What will be left in arrow-compute 🤔

Nothing 🎉

Edit: well nothing once I also split out the arithmetic kernels, the end goal is the top-level arrow is just a re-export of other crates

alamb

Looks like a fine improvement to me. Thank you @tustvold

I recommend running benchmarks prior to merging this, just to make sure there aren't any cross crate inlining issues

alamb · 2022-12-08T14:34:39Z

arrow-string/src/length.rs

@@ -245,11 +245,10 @@ mod tests {
    macro_rules! length_binary_helper {
        ($offset_ty: ty, $result_ty: ty, $kernel: ident, $value: expr, $expected: expr) => {{
            let array = GenericBinaryArray::<$offset_ty>::from($value);
-            let result = $kernel(&array)?;
+            let result = $kernel(&array).unwrap();


is this a drive by cleanup to use unwrap rather than Error in the tests?

Yes, it means you get an actual backtrace as opposed to some random error from somewhere 😆

alamb · 2022-12-08T14:36:42Z

arrow/src/compute/kernels/comparison.rs

@@ -23,1227 +23,75 @@
 //! [here](https://doc.rust-lang.org/stable/core/arch/) for more information.
 //!

-use crate::array::*;


What will be left in arrow-compute 🤔

alamb · 2022-12-08T14:36:55Z

dev/release/README.md

@@ -258,6 +258,7 @@ Rust Arrow Crates:
 (cd arrow-array && cargo publish)
 (cd arrow-select && cargo publish)
 (cd arrow-cast && cargo publish)
+(cd arrow-string && cargo publish)


thank you for this

tustvold · 2022-12-08T15:07:16Z

length                  time:   [595.92 ns 596.63 ns 597.38 ns]
                        change: [-1.3976% -1.2331% -0.9947%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 21 outliers among 100 measurements (21.00%)
  4 (4.00%) low severe
  9 (9.00%) low mild
  4 (4.00%) high mild
  4 (4.00%) high severe
bit_length              time:   [617.99 ns 618.41 ns 618.90 ns]
                        change: [-0.0074% +0.2794% +0.5081%] (p = 0.02 < 0.05)
                        Change within noise threshold.
Found 12 outliers among 100 measurements (12.00%)
  12 (12.00%) high mild
ike_utf8 scalar equals time:   [274.03 µs 274.09 µs 274.16 µs]
                        change: [-7.1092% -5.8177% -4.5008%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe

like_utf8 scalar contains
                        time:   [2.1757 ms 2.1766 ms 2.1776 ms]
                        change: [+0.3428% +0.4064% +0.4694%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) high mild
  3 (3.00%) high severe

like_utf8 scalar ends with
                        time:   [265.21 µs 265.29 µs 265.38 µs]
                        change: [-6.4885% -6.2843% -6.1589%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  5 (5.00%) high mild
  2 (2.00%) high severe

like_utf8 scalar starts with
                        time:   [282.40 µs 282.46 µs 282.54 µs]
                        change: [-3.4855% -3.2805% -3.1530%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild
  6 (6.00%) high severe

Benchmarking like_utf8 scalar complex: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.1s, enable flat sampling, or reduce sample count to 60.
like_utf8 scalar complex
                        time:   [1.2038 ms 1.2053 ms 1.2067 ms]
                        change: [-4.5351% -4.2369% -3.9728%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) high mild
  3 (3.00%) high severe

nlike_utf8 scalar equals
                        time:   [267.63 µs 267.77 µs 267.92 µs]
                        change: [-3.1867% -3.0555% -2.8708%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high severe

nlike_utf8 scalar contains
                        time:   [2.1864 ms 2.1874 ms 2.1883 ms]
                        change: [+0.9321% +0.9941% +1.0521%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 7 outliers among 100 measurements (7.00%)
  4 (4.00%) high mild
  3 (3.00%) high severe

nlike_utf8 scalar ends with
                        time:   [270.93 µs 270.98 µs 271.05 µs]
                        change: [-0.5265% -0.3000% -0.1667%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 9 outliers among 100 measurements (9.00%)
  3 (3.00%) high mild
  6 (6.00%) high severe

nlike_utf8 scalar starts with
                        time:   [286.47 µs 286.89 µs 287.47 µs]
                        change: [-0.3103% +0.0892% +0.5751%] (p = 0.71 > 0.05)
                        No change in performance detected.
Found 16 outliers among 100 measurements (16.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild
  13 (13.00%) high severe

Benchmarking nlike_utf8 scalar complex: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.0s, enable flat sampling, or reduce sample count to 60.
nlike_utf8 scalar complex
                        time:   [1.1895 ms 1.1911 ms 1.1927 ms]
                        change: [-3.5829% -3.3030% -3.0941%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

ilike_utf8 scalar equals
                        time:   [2.3545 ms 2.3553 ms 2.3561 ms]
                        change: [+0.4019% +0.4396% +0.4757%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 8 outliers among 100 measurements (8.00%)
  3 (3.00%) high mild
  5 (5.00%) high severe

ilike_utf8 scalar contains
                        time:   [4.3433 ms 4.3449 ms 4.3465 ms]
                        change: [+3.5738% +3.6201% +3.6710%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

ilike_utf8 scalar ends with
                        time:   [2.4242 ms 2.4251 ms 2.4260 ms]
                        change: [+3.2608% +3.3098% +3.3571%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe

ilike_utf8 scalar starts with
                        time:   [2.3726 ms 2.3733 ms 2.3741 ms]
                        change: [+1.0692% +1.1071% +1.1443%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild

Benchmarking ilike_utf8 scalar complex: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 9.4s, enable flat sampling, or reduce sample count to 50.
ilike_utf8 scalar complex
                        time:   [1.8580 ms 1.8588 ms 1.8599 ms]
                        change: [-1.0504% -0.9019% -0.7959%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) high mild
  6 (6.00%) high severe

nilike_utf8 scalar equals
                        time:   [2.4189 ms 2.4201 ms 2.4219 ms]
                        change: [+0.2827% +0.3427% +0.4143%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 8 outliers among 100 measurements (8.00%)
  4 (4.00%) high mild
  4 (4.00%) high severe

nilike_utf8 scalar contains
                        time:   [4.3295 ms 4.3311 ms 4.3328 ms]
                        change: [+3.1880% +3.2443% +3.2990%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

nilike_utf8 scalar ends with
                        time:   [2.4293 ms 2.4299 ms 2.4307 ms]
                        change: [+1.6192% +1.6579% +1.6989%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) high mild
  2 (2.00%) high severe

nilike_utf8 scalar starts with
                        time:   [2.3875 ms 2.3880 ms 2.3885 ms]
                        change: [-1.0378% -1.0087% -0.9765%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 8 outliers among 100 measurements (8.00%)
  5 (5.00%) high mild
  3 (3.00%) high severe

Benchmarking nilike_utf8 scalar complex: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 9.3s, enable flat sampling, or reduce sample count to 50.
nilike_utf8 scalar complex
                        time:   [1.8380 ms 1.8390 ms 1.8401 ms]
                        change: [-2.5813% -2.3894% -2.2066%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  3 (3.00%) high severe

Benchmarking egexp_matches_utf8 scalar starts with: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.4s, enable flat sampling, or reduce sample count to 60.
egexp_matches_utf8 scalar starts with
                        time:   [1.2614 ms 1.2618 ms 1.2622 ms]
                        change: [+1.2463% +1.4495% +1.6022%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  4 (4.00%) high mild
  3 (3.00%) high severe

Benchmarking egexp_matches_utf8 scalar ends with: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.2s, enable flat sampling, or reduce sample count to 60.
egexp_matches_utf8 scalar ends with
                        time:   [1.2152 ms 1.2161 ms 1.2173 ms]
                        change: [-0.3264% -0.0446% +0.2490%] (p = 0.80 > 0.05)
                        No change in performance detected.
Found 9 outliers among 100 measurements (9.00%)
  2 (2.00%) low mild
  2 (2.00%) high mild
  5 (5.00%) high severe

So no changes outside the noise threshold

Split out arrow-string (apache#2594)

1b0ce96

github-actions bot added the arrow Changes to the arrow crate label Dec 8, 2022

tustvold commented Dec 8, 2022

View reviewed changes

tustvold added 2 commits December 8, 2022 14:24

Doc

2ea47db

Clippy

0777e92

crepererum approved these changes Dec 8, 2022

View reviewed changes

alamb reviewed Dec 8, 2022

View reviewed changes

alamb approved these changes Dec 8, 2022

View reviewed changes

tustvold merged commit 96c7c9d into apache:master Dec 8, 2022

This was referenced Dec 8, 2022

Split out arrow-ord (#2594) #3299

Merged

Reduce Duplication in Like Kernels #3296

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split out arrow-string (#2594) #3295

Split out arrow-string (#2594) #3295

tustvold commented Dec 8, 2022

tustvold Dec 8, 2022

tustvold Dec 8, 2022

alamb Dec 8, 2022

tustvold Dec 8, 2022 •

edited

alamb left a comment

alamb Dec 8, 2022

tustvold Dec 8, 2022 •

edited

alamb Dec 8, 2022

alamb Dec 8, 2022

tustvold commented Dec 8, 2022

Split out arrow-string (#2594) #3295

Split out arrow-string (#2594) #3295

Conversation

tustvold commented Dec 8, 2022

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

tustvold Dec 8, 2022

Choose a reason for hiding this comment

tustvold Dec 8, 2022

Choose a reason for hiding this comment

alamb Dec 8, 2022

Choose a reason for hiding this comment

tustvold Dec 8, 2022 • edited

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

alamb Dec 8, 2022

Choose a reason for hiding this comment

tustvold Dec 8, 2022 • edited

Choose a reason for hiding this comment

alamb Dec 8, 2022

Choose a reason for hiding this comment

alamb Dec 8, 2022

Choose a reason for hiding this comment

tustvold commented Dec 8, 2022

tustvold Dec 8, 2022 •

edited

tustvold Dec 8, 2022 •

edited