Change `nullif` to support arbitrary arrays #521

bjchambers · 2021-07-03T16:05:59Z

Which issue does this PR close?

Closes #510.

Rationale for this change

There is no reason the nullif kernel applies only to primitive arrays. I have found it useful for nulling out a field array to respect the null-ness of a struct for instance, and in supporting only primitive arrays limits the ability to do this.

What changes are included in this PR?

A change to the implementation of nullif to support arbitrary ArrayRef.

Are there any user-facing changes?

The signature of nullif has changed.

bjchambers · 2021-07-03T16:06:37Z

I believe since the API changed this should have the "breaking change" label, but I don't seem to be able to add it.

bjchambers · 2021-07-04T16:55:17Z

arrow/src/compute/kernels/boolean.rs

+            .count_set_bits_offset(right.offset(), right.len())
+            == 0
+    {
+        return Ok(left.clone());


This is one of the cases where it seems like taking an ArrayRef is useful.

bjchambers · 2021-07-04T16:55:26Z

arrow/src/compute/kernels/boolean.rs

+    // Check again -- if all of the falses in the right corresponded to nulls, we
+    // can still pass  the left unmodified.
+    if right_combo_buffer.count_set_bits_offset(right.offset(), right.len()) == 0 {
+        return Ok(left.clone());


This is the other case where it seems like taking an ArrayRef is useful.

arrow/src/compute/kernels/boolean.rs

alamb

Thank you @bjchambers -- this looks helpful indeed. @jorgecarleitao or @nevi-me are you ok with this interface? If so I can review this PR carefully.

This closes apache#510.

codecov-commenter · 2021-07-15T04:29:24Z

Codecov Report

Merging #521 (2cbe13c) into master (f873d77) will increase coverage by 0.09%.
The diff coverage is 97.51%.

@@            Coverage Diff             @@
##           master     #521      +/-   ##
==========================================
+ Coverage   82.47%   82.57%   +0.09%     
==========================================
  Files         167      167              
  Lines       46144    46399     +255     
==========================================
+ Hits        38059    38315     +256     
+ Misses       8085     8084       -1

Impacted Files	Coverage Δ
arrow/src/compute/kernels/boolean.rs	`97.53% <97.51%> (+0.76%)`	⬆️
arrow/src/buffer/immutable.rs	`98.92% <0.00%> (+1.07%)`	⬆️
arrow/src/array/transform/boolean.rs	`84.61% <0.00%> (+7.69%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f873d77...2cbe13c. Read the comment docs.

jhorstmann · 2021-09-06T21:39:53Z

Padding the validity buffer is an interesting approach and avoids many edge cases for handling buffers of different data types. The only downside I see is that the performance now depends a little on the offset instead of only on the length of the slice.

There is a possible alternative solution that would slice the buffers (depending on the datatype). I have such an implementation for most data types, but since there is separate logic for each type the potential for errors is much higher. The not yet implemented types are Struct, Union and FixedSizeLists. If there is interest I can post the code or open an alternative PR, but I'm not sure it would be a clear improvement.

bjchambers · 2021-09-17T16:18:46Z

Padding the validity buffer is an interesting approach and avoids many edge cases for handling buffers of different data types. The only downside I see is that the performance now depends a little on the offset instead of only on the length of the slice.

There is a possible alternative solution that would slice the buffers (depending on the datatype). I have such an implementation for most data types, but since there is separate logic for each type the potential for errors is much higher. The not yet implemented types are Struct, Union and FixedSizeLists. If there is interest I can post the code or open an alternative PR, but I'm not sure it would be a clear improvement.

I'd be happy with either. How does the slicing depend on datatype? It seems like supporting the composite types is important to make this work, and the errors are a potential concern. On the other hand -- how much do you think the performance would depend on the offset? It seems like it may be a little sensitive, but shouldn't be significant? If so, it may be better to start with something that is less error prone, and then change if performance is a concern?

jhorstmann · 2021-09-20T10:57:26Z

For some background, the query engine I'm working with keeps data in memory in big arrow arrays. Processing of the data happens in batches, that are created as zero-copy slices of those arrays. So the first batch would start with arrays with offset 0, second for example offset 4096 and continuing in the same manner. This leads to us usually being the first to notice any offset related issues :) For this kernel this would mean that calculating null_if on some large input would become quadratic instead of linear, although probably with a relatively small constant factor.

To create the result array with an offset of 0, the buffers or child arrays would have to be sliced. This depends on the datatype, for example:

Boolean => buffer[0].bit_slice(offset, len) // potentially copies data
Primitive => buffer[0].slice(offset * size_of::()) // buffer.slice uses byte offsets
Struct => ... // needs to slice all child arrays
List => ... // should slice the offset buffer, but creation of ListArray currently validates that the first offset is 0, so actually needs to calculate a new offset buffer

The implementation I have works with the subset of datatypes I'm using, I think an implementation inside arrow should better support all datatypes even if it has a small performance penalty. Longer-term I think moving the offset down into the buffers would be the better general solution that would simplify a lot of kernels. I think arrow2 is using that approach successfully.

bjchambers · 2021-09-22T20:08:28Z

FWIW, I also believe there are bugs in the implementation in this MR in some cases of offsets. For instance, I think bit_slice(0, len) doesn't do anything to incorporate the len in the result, and then using bit_and on it later causes exceptions because the lengths don't line up. So, totally happy for there to be alternative / less buggy approaches.

It may even be worth a proptest for this to make sure it handles all the cases?

bjchambers · 2021-09-27T18:37:44Z

For some background, the query engine I'm working with keeps data in memory in big arrow arrays. Processing of the data happens in batches, that are created as zero-copy slices of those arrays. So the first batch would start with arrays with offset 0, second for example offset 4096 and continuing in the same manner. This leads to us usually being the first to notice any offset related issues :) For this kernel this would mean that calculating null_if on some large input would become quadratic instead of linear, although probably with a relatively small constant factor.

To create the result array with an offset of 0, the buffers or child arrays would have to be sliced. This depends on the datatype, for example:

Boolean => buffer[0].bit_slice(offset, len) // potentially copies data
Primitive => buffer[0].slice(offset * size_of::()) // buffer.slice uses byte offsets
Struct => ... // needs to slice all child arrays
List => ... // should slice the offset buffer, but creation of ListArray currently validates that the first offset is 0, so actually needs to calculate a new offset buffer

The implementation I have works with the subset of datatypes I'm using, I think an implementation inside arrow should better support all datatypes even if it has a small performance penalty. Longer-term I think moving the offset down into the buffers would be the better general solution that would simplify a lot of kernels. I think arrow2 is using that approach successfully.

Note that I'm seeing problems with slices. For instance, the following issue is causing this implementation to panic depending on the slice offset -- #807. If your implementation is less subject to those problems, that may be a good option moving forward. But, it looks like you may hit the same problem since your boolean case uses bit_slice.

alamb · 2021-11-02T20:39:44Z

I am going through old PRs and this one seems stalled. I am wondering what we would like to do with this one? Is ok to merge? Are we doing an alternate implementation? Do we have something else in mind?

alamb · 2021-12-20T16:51:29Z

Marking PRs that are over a month old as stale -- please let us know if there is additional work planned or if we should close them.

bjchambers · 2021-12-27T16:37:48Z

I've lost the thread on this one. I have a version of this checked in and used in some internal code, so I don't need this to go in. At the same time, it seems like supporting nullif on arbitrary arrays would be beneficial. I'm more than happy to close this and wait for a "better" implementation... seems like it may be acceptable to check-in something since it improves upon the current state (nullif limited to primitive arrays) and then improve the implementation if/when performance is a problem and/or an improved implementation is available.

I'm happy to go either way with this -- if we'd like to move forward I have some additional proptests and such that found a bug or two in the implementation that I'll add, just to make this complete.

alamb · 2021-12-30T12:46:10Z

At the same time, it seems like supporting nullif on arbitrary arrays would be beneficia

I agree

if we'd like to move forward I have some additional proptests and such that found a bug or two in the implementation that I'll add, just to make this complete.

Let's do that -- and I will find time to review this code properly

alamb · 2022-01-24T19:30:33Z

Closing this one down to keep the review list clean. Please reopen if that is a mistake

bjchambers · 2022-03-28T16:29:55Z

@jhorstmann any chance you could create a PR or share your imlpementation of nullif? Since the original approach didn't make it in, it would be good to get something moving, since this would be a useful kernel to have within Arrow.

alamb · 2022-03-28T20:36:03Z

I agree it would be useful. Thank you for pushing it @bjchambers

alamb · 2022-10-26T21:15:22Z

New proposed PR: #2940

github-actions bot added the arrow Changes to the arrow crate label Jul 3, 2021

Dandandan added the api-change Changes to the arrow API label Jul 4, 2021

bjchambers commented Jul 4, 2021

View reviewed changes

arrow/src/compute/kernels/boolean.rs Outdated Show resolved Hide resolved

bjchambers commented Jul 4, 2021

View reviewed changes

arrow/src/compute/kernels/boolean.rs Show resolved Hide resolved

alamb reviewed Jul 12, 2021

View reviewed changes

Change nullif to support arbitrary arrays

4a5020d

This closes apache#510.

bjchambers force-pushed the generic-null-if branch from 41a793b to 4a5020d Compare July 15, 2021 03:57

add additional tests

2cbe13c

alamb added the stale-pr label Dec 20, 2021

alamb closed this Jan 24, 2022

jhorstmann mentioned this pull request Mar 29, 2022

Alternative implementation of nullif kernel by slicing nested buffers #1499

Closed

tustvold mentioned this pull request Oct 26, 2022

Accept any &dyn Array in nullif kernel #2940

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change `nullif` to support arbitrary arrays #521

Change `nullif` to support arbitrary arrays #521

bjchambers commented Jul 3, 2021

bjchambers commented Jul 3, 2021

bjchambers Jul 4, 2021

bjchambers Jul 4, 2021

alamb left a comment

codecov-commenter commented Jul 15, 2021

jhorstmann commented Sep 6, 2021

bjchambers commented Sep 17, 2021

jhorstmann commented Sep 20, 2021

bjchambers commented Sep 22, 2021 •

edited

bjchambers commented Sep 27, 2021

alamb commented Nov 2, 2021

alamb commented Dec 20, 2021

bjchambers commented Dec 27, 2021

alamb commented Dec 30, 2021

alamb commented Jan 24, 2022

bjchambers commented Mar 28, 2022

alamb commented Mar 28, 2022

alamb commented Oct 26, 2022

Change nullif to support arbitrary arrays #521

Change nullif to support arbitrary arrays #521

Conversation

bjchambers commented Jul 3, 2021

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

bjchambers commented Jul 3, 2021

bjchambers Jul 4, 2021

Choose a reason for hiding this comment

bjchambers Jul 4, 2021

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

codecov-commenter commented Jul 15, 2021

Codecov Report

jhorstmann commented Sep 6, 2021

bjchambers commented Sep 17, 2021

jhorstmann commented Sep 20, 2021

bjchambers commented Sep 22, 2021 • edited

bjchambers commented Sep 27, 2021

alamb commented Nov 2, 2021

alamb commented Dec 20, 2021

bjchambers commented Dec 27, 2021

alamb commented Dec 30, 2021

alamb commented Jan 24, 2022

bjchambers commented Mar 28, 2022

alamb commented Mar 28, 2022

alamb commented Oct 26, 2022

Change `nullif` to support arbitrary arrays #521

Change `nullif` to support arbitrary arrays #521

bjchambers commented Sep 22, 2021 •

edited