Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split out arrow-string (#2594) #3295

Merged
merged 3 commits into from Dec 8, 2022
Merged

Conversation

tustvold
Copy link
Contributor

@tustvold tustvold commented Dec 8, 2022

Which issue does this PR close?

Part of #2594

Rationale for this change

Splits out string kernels into a crate called arrow-string. Whilst these are primarily concerned with UTF8 strings, some also handle binary arrays. I therefore went with arrow-string instead of arrow-str as str is very specifically just UTF-8 in Rust, whereas the general concept of a string extends to binary strings. Or something like that...

What changes are included in this PR?

Are there any user-facing changes?

@github-actions github-actions bot added the arrow Changes to the arrow crate label Dec 8, 2022
use std::sync::Arc;

use regex::Regex;
/// Perform SQL `array ~ regex_array` operation on [`StringArray`] / [`LargeStringArray`].
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These functions are moved from comparison.rs

@@ -23,1227 +23,75 @@
//! [here](https://doc.rust-lang.org/stable/core/arch/) for more information.
//!

use crate::array::*;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The like and regex kernels are moved into arrow-string. The remaining kernels will be moved into an arrow-ord crate in a follow up PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What will be left in arrow-compute 🤔

Copy link
Contributor Author

@tustvold tustvold Dec 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing 🎉

Edit: well nothing once I also split out the arithmetic kernels, the end goal is the top-level arrow is just a re-export of other crates

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a fine improvement to me. Thank you @tustvold

I recommend running benchmarks prior to merging this, just to make sure there aren't any cross crate inlining issues

@@ -245,11 +245,10 @@ mod tests {
macro_rules! length_binary_helper {
($offset_ty: ty, $result_ty: ty, $kernel: ident, $value: expr, $expected: expr) => {{
let array = GenericBinaryArray::<$offset_ty>::from($value);
let result = $kernel(&array)?;
let result = $kernel(&array).unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a drive by cleanup to use unwrap rather than Error in the tests?

Copy link
Contributor Author

@tustvold tustvold Dec 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it means you get an actual backtrace as opposed to some random error from somewhere 😆

@@ -23,1227 +23,75 @@
//! [here](https://doc.rust-lang.org/stable/core/arch/) for more information.
//!

use crate::array::*;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What will be left in arrow-compute 🤔

@@ -258,6 +258,7 @@ Rust Arrow Crates:
(cd arrow-array && cargo publish)
(cd arrow-select && cargo publish)
(cd arrow-cast && cargo publish)
(cd arrow-string && cargo publish)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for this

@tustvold
Copy link
Contributor Author

tustvold commented Dec 8, 2022

length                  time:   [595.92 ns 596.63 ns 597.38 ns]
                        change: [-1.3976% -1.2331% -0.9947%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 21 outliers among 100 measurements (21.00%)
  4 (4.00%) low severe
  9 (9.00%) low mild
  4 (4.00%) high mild
  4 (4.00%) high severe
bit_length              time:   [617.99 ns 618.41 ns 618.90 ns]
                        change: [-0.0074% +0.2794% +0.5081%] (p = 0.02 < 0.05)
                        Change within noise threshold.
Found 12 outliers among 100 measurements (12.00%)
  12 (12.00%) high mild
ike_utf8 scalar equals time:   [274.03 µs 274.09 µs 274.16 µs]
                        change: [-7.1092% -5.8177% -4.5008%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe

like_utf8 scalar contains
                        time:   [2.1757 ms 2.1766 ms 2.1776 ms]
                        change: [+0.3428% +0.4064% +0.4694%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) high mild
  3 (3.00%) high severe

like_utf8 scalar ends with
                        time:   [265.21 µs 265.29 µs 265.38 µs]
                        change: [-6.4885% -6.2843% -6.1589%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  5 (5.00%) high mild
  2 (2.00%) high severe

like_utf8 scalar starts with
                        time:   [282.40 µs 282.46 µs 282.54 µs]
                        change: [-3.4855% -3.2805% -3.1530%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild
  6 (6.00%) high severe

Benchmarking like_utf8 scalar complex: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.1s, enable flat sampling, or reduce sample count to 60.
like_utf8 scalar complex
                        time:   [1.2038 ms 1.2053 ms 1.2067 ms]
                        change: [-4.5351% -4.2369% -3.9728%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) high mild
  3 (3.00%) high severe

nlike_utf8 scalar equals
                        time:   [267.63 µs 267.77 µs 267.92 µs]
                        change: [-3.1867% -3.0555% -2.8708%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high severe

nlike_utf8 scalar contains
                        time:   [2.1864 ms 2.1874 ms 2.1883 ms]
                        change: [+0.9321% +0.9941% +1.0521%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 7 outliers among 100 measurements (7.00%)
  4 (4.00%) high mild
  3 (3.00%) high severe

nlike_utf8 scalar ends with
                        time:   [270.93 µs 270.98 µs 271.05 µs]
                        change: [-0.5265% -0.3000% -0.1667%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 9 outliers among 100 measurements (9.00%)
  3 (3.00%) high mild
  6 (6.00%) high severe

nlike_utf8 scalar starts with
                        time:   [286.47 µs 286.89 µs 287.47 µs]
                        change: [-0.3103% +0.0892% +0.5751%] (p = 0.71 > 0.05)
                        No change in performance detected.
Found 16 outliers among 100 measurements (16.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild
  13 (13.00%) high severe

Benchmarking nlike_utf8 scalar complex: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.0s, enable flat sampling, or reduce sample count to 60.
nlike_utf8 scalar complex
                        time:   [1.1895 ms 1.1911 ms 1.1927 ms]
                        change: [-3.5829% -3.3030% -3.0941%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

ilike_utf8 scalar equals
                        time:   [2.3545 ms 2.3553 ms 2.3561 ms]
                        change: [+0.4019% +0.4396% +0.4757%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 8 outliers among 100 measurements (8.00%)
  3 (3.00%) high mild
  5 (5.00%) high severe

ilike_utf8 scalar contains
                        time:   [4.3433 ms 4.3449 ms 4.3465 ms]
                        change: [+3.5738% +3.6201% +3.6710%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

ilike_utf8 scalar ends with
                        time:   [2.4242 ms 2.4251 ms 2.4260 ms]
                        change: [+3.2608% +3.3098% +3.3571%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe

ilike_utf8 scalar starts with
                        time:   [2.3726 ms 2.3733 ms 2.3741 ms]
                        change: [+1.0692% +1.1071% +1.1443%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild

Benchmarking ilike_utf8 scalar complex: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 9.4s, enable flat sampling, or reduce sample count to 50.
ilike_utf8 scalar complex
                        time:   [1.8580 ms 1.8588 ms 1.8599 ms]
                        change: [-1.0504% -0.9019% -0.7959%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) high mild
  6 (6.00%) high severe

nilike_utf8 scalar equals
                        time:   [2.4189 ms 2.4201 ms 2.4219 ms]
                        change: [+0.2827% +0.3427% +0.4143%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 8 outliers among 100 measurements (8.00%)
  4 (4.00%) high mild
  4 (4.00%) high severe

nilike_utf8 scalar contains
                        time:   [4.3295 ms 4.3311 ms 4.3328 ms]
                        change: [+3.1880% +3.2443% +3.2990%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

nilike_utf8 scalar ends with
                        time:   [2.4293 ms 2.4299 ms 2.4307 ms]
                        change: [+1.6192% +1.6579% +1.6989%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) high mild
  2 (2.00%) high severe

nilike_utf8 scalar starts with
                        time:   [2.3875 ms 2.3880 ms 2.3885 ms]
                        change: [-1.0378% -1.0087% -0.9765%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 8 outliers among 100 measurements (8.00%)
  5 (5.00%) high mild
  3 (3.00%) high severe

Benchmarking nilike_utf8 scalar complex: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 9.3s, enable flat sampling, or reduce sample count to 50.
nilike_utf8 scalar complex
                        time:   [1.8380 ms 1.8390 ms 1.8401 ms]
                        change: [-2.5813% -2.3894% -2.2066%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  3 (3.00%) high severe

Benchmarking egexp_matches_utf8 scalar starts with: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.4s, enable flat sampling, or reduce sample count to 60.
egexp_matches_utf8 scalar starts with
                        time:   [1.2614 ms 1.2618 ms 1.2622 ms]
                        change: [+1.2463% +1.4495% +1.6022%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  4 (4.00%) high mild
  3 (3.00%) high severe

Benchmarking egexp_matches_utf8 scalar ends with: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.2s, enable flat sampling, or reduce sample count to 60.
egexp_matches_utf8 scalar ends with
                        time:   [1.2152 ms 1.2161 ms 1.2173 ms]
                        change: [-0.3264% -0.0446% +0.2490%] (p = 0.80 > 0.05)
                        No change in performance detected.
Found 9 outliers among 100 measurements (9.00%)
  2 (2.00%) low mild
  2 (2.00%) high mild
  5 (5.00%) high severe

So no changes outside the noise threshold

@tustvold tustvold merged commit 96c7c9d into apache:master Dec 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants