Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use BitChunks in equal_bits #2194

Merged
merged 7 commits into from Jul 28, 2022
Merged

Conversation

tustvold
Copy link
Contributor

@tustvold tustvold commented Jul 27, 2022

Which issue does this PR close?

Closes #2186
Part of #2188

Rationale for this change

equal_512               time:   [33.149 ns 33.153 ns 33.157 ns]                       
                        change: [-2.0324% -1.9669% -1.9098%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  2 (2.00%) low mild
  2 (2.00%) high mild
  2 (2.00%) high severe

equal_nulls_512         time:   [948.52 ns 948.88 ns 949.33 ns]                             
                        change: [-45.807% -45.682% -45.573%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 17 outliers among 100 measurements (17.00%)
  5 (5.00%) high mild
  12 (12.00%) high severe

equal_string_512        time:   [46.418 ns 46.430 ns 46.442 ns]                              
                        change: [-1.5368% -1.4796% -1.4286%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) low severe
  2 (2.00%) high mild

equal_string_nulls_512  time:   [4.1180 us 4.1189 us 4.1199 us]                                    
                        change: [-19.389% -19.355% -19.320%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  5 (5.00%) high mild
  1 (1.00%) high severe

equal_bool_512          time:   [15.277 ns 15.286 ns 15.295 ns]                            
                        change: [-23.554% -23.428% -23.320%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) low mild
  2 (2.00%) high mild

equal_bool_513          time:   [23.229 ns 23.249 ns 23.268 ns]                            
                        change: [+4.4352% +4.6082% +4.7715%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low severe
  1 (1.00%) low mild
  5 (5.00%) high mild

Technically equal_bool_513 has regressed, but this is a fraction of a nanosecond so I think is unlikely to be all that meaningful in practice, and the significant millisecond improvements elsewhere are justification for this change

What changes are included in this PR?

Are there any user-facing changes?

@github-actions github-actions bot added the arrow Changes to the arrow crate label Jul 27, 2022
@alamb
Copy link
Contributor

alamb commented Jul 27, 2022

I took the liberty of fixing the fmt lint diff and merging the branch from master

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure about the change to boolean_equal -- everything else looks good to me

@@ -37,6 +38,21 @@ use std::sync::Arc;

use super::equal::equal;

#[inline]
pub(crate) fn contains_nulls(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the never ending confusion about "is this a size in bits or bytes" I recommend making it explicit -- either add a docstring that says offset and len are in bits or perhaps name them bit_offset and bit_len.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the potential usefulness (and non obviousness) of using BitSliceIterator to check for nulls, I wonder what you think about making this a function on BitSliceIterator such as BitSliceIterator::contains_only_unset_bits or something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could definitely see this being promoted to a method on BitMap, once that supports slicing #1802


#[test]
fn test_contains_nulls() {
let buffer: Buffer =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if using a buffer with more than 64 bits is important (or is that already well enough covered in BitSliceIterator tests)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy they are well covered by BitSliceIterator


let lhs_start = lhs.offset() + lhs_start;
let rhs_start = rhs.offset() + rhs_start;

(0..len).all(|i| {
BitIndexIterator::new(lhs_null_bytes, lhs_start, len).all(|i| {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this correct? It seems like it will would not find positions where rhs was null but lhs was not. Maybe I mis understand something

It seems like we need to iterate over the positions where either is set -- like lhs_null_bytes | rhs_null_bytes?

Maybe there is a lack of test coverage 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe you can just call equal_bits here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These methods are purely checking value equality, at this point the null masks have already been checked for equality. Otherwise the previous logic would have been ill-formed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I think I missed that this was checking BooleanArray (and hence this get_bit is getting the value as you say). 👍


let lhs_start = lhs.offset() + lhs_start;
let rhs_start = rhs.offset() + rhs_start;

(0..len).all(|i| {
BitIndexIterator::new(lhs_null_bytes, lhs_start, len).all(|i| {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I think I missed that this was checking BooleanArray (and hence this get_bit is getting the value as you say). 👍

@tustvold tustvold merged commit 7199b1b into apache:master Jul 28, 2022
@ursabot
Copy link

ursabot commented Jul 28, 2022

Benchmark runs are scheduled for baseline = 6c77cd5 and contender = 7199b1b. 7199b1b is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use BitChunks in equal_bits
3 participants