Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add nilike support in comparison #1846

Merged
merged 1 commit into from Jun 15, 2022

Conversation

MazterQyou
Copy link
Contributor

@MazterQyou MazterQyou commented Jun 11, 2022

Which issue does this PR close?

Closes #1845.

Rationale for this change

comparison kernel has functions like and nlike, which perform LIKE and NOT LIKE operations. There was also an addition of ilike back in late 2021, which allows to performe ILIKE (case-insensitive LIKE) operations. However, nilike, which is what nlike to like is, respectively (NOT ILIKE), has been missing. It is a useful addition to perform similar NOT operation with already supported ILIKE.

What changes are included in this PR?

This PR adds nilike functions to comparison kernel that perform SQL left NOT ILIKE right operation on arrays and scalars, as well as related tests and benches.

Are there any user-facing changes?

@github-actions github-actions bot added the arrow Changes to the arrow crate label Jun 11, 2022
@codecov-commenter
Copy link

codecov-commenter commented Jun 11, 2022

Codecov Report

Merging #1846 (45b524e) into master (c08e532) will increase coverage by 0.01%.
The diff coverage is 95.83%.

@@            Coverage Diff             @@
##           master    #1846      +/-   ##
==========================================
+ Coverage   83.49%   83.50%   +0.01%     
==========================================
  Files         201      201              
  Lines       56903    56951      +48     
==========================================
+ Hits        47511    47559      +48     
  Misses       9392     9392              
Impacted Files Coverage Δ
arrow/src/compute/kernels/comparison.rs 92.55% <95.83%> (+0.09%) ⬆️
parquet_derive/src/parquet_field.rs 66.21% <0.00%> (+0.45%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c08e532...45b524e. Read the comment docs.

@MazterQyou MazterQyou force-pushed the upstream-patch/comparison-nilike branch from 49b5eae to 45b524e Compare June 11, 2022 11:22
Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good step forward, although I left some questions.

result.append(
!left
.value(i)
.to_uppercase()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given this allocates a new string, I wonder how much faster this fast path actually is? Do you have some benchmark results?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have simply re-used this code from the already existing ilike comparator function, so I have no idea regarding that. I agree, it does make sense to do some benchmarks. Although I feel ends_with and starts_with sound simple enough to be much faster than a regexp match implicitly.

@@ -3984,6 +4067,60 @@ mod tests {
vec![false, true, false, false]
);

test_utf8!(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a test of non-ASCII characters, e.g. Â â or something

/// [`LargeStringArray`] and a scalar.
///
/// See the documentation on [`like_utf8`] for more details.
pub fn nilike_utf8_scalar<OffsetSize: OffsetSizeTrait>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It occurs to me that we now have four methods ilike_utf8_scalar, nlike_utf8_scalar, like_utf8_scalar, nilike_utf8_scalar where the only difference appears to be an optional post-conversion on left.value(i). Perhaps we could refactor out the common logic into a function taking F: Fn(&str) -> Cow<'_, str> or something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had this in mind as well but decided to open the PR nevertheless considering there are already three similar functions (which means whoever accepted them in was okay with that). I'll refactor those and see how that goes.

@tustvold
Copy link
Contributor

tustvold commented Jun 15, 2022

Would you like me to get this PR in and we can maybe follow up with feedback in another PR? I think as is this is a good step forward?

Edit: merging master should fix the MIRI failures

@MazterQyou
Copy link
Contributor Author

I am up for making the changes you suggested, although if you think it's good to have this faster and improve on later, I can rebase now.

@alamb
Copy link
Contributor

alamb commented Jun 15, 2022

Merging this one in to keep the queue down -- Thank you @MazterQyou and @tustvold . I would love to see a refactor PR that reduced the code duplication

@alamb alamb merged commit 9860aa7 into apache:master Jun 15, 2022
@MazterQyou MazterQyou deleted the upstream-patch/comparison-nilike branch June 16, 2022 07:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add nilike support in comparison
4 participants