Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change nullif to support arbitrary arrays #521

Closed
wants to merge 2 commits into from

Conversation

bjchambers
Copy link
Contributor

Which issue does this PR close?

Closes #510.

Rationale for this change

There is no reason the nullif kernel applies only to primitive arrays. I have found it useful for nulling out a field array to respect the null-ness of a struct for instance, and in supporting only primitive arrays limits the ability to do this.

What changes are included in this PR?

A change to the implementation of nullif to support arbitrary ArrayRef.

Are there any user-facing changes?

The signature of nullif has changed.

@github-actions github-actions bot added the arrow Changes to the arrow crate label Jul 3, 2021
@bjchambers
Copy link
Contributor Author

I believe since the API changed this should have the "breaking change" label, but I don't seem to be able to add it.

@Dandandan Dandandan added the api-change Changes to the arrow API label Jul 4, 2021
.count_set_bits_offset(right.offset(), right.len())
== 0
{
return Ok(left.clone());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is one of the cases where it seems like taking an ArrayRef is useful.

// Check again -- if all of the falses in the right corresponded to nulls, we
// can still pass the left unmodified.
if right_combo_buffer.count_set_bits_offset(right.offset(), right.len()) == 0 {
return Ok(left.clone());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the other case where it seems like taking an ArrayRef is useful.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @bjchambers -- this looks helpful indeed. @jorgecarleitao or @nevi-me are you ok with this interface? If so I can review this PR carefully.

@codecov-commenter
Copy link

Codecov Report

Merging #521 (2cbe13c) into master (f873d77) will increase coverage by 0.09%.
The diff coverage is 97.51%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #521      +/-   ##
==========================================
+ Coverage   82.47%   82.57%   +0.09%     
==========================================
  Files         167      167              
  Lines       46144    46399     +255     
==========================================
+ Hits        38059    38315     +256     
+ Misses       8085     8084       -1     
Impacted Files Coverage Δ
arrow/src/compute/kernels/boolean.rs 97.53% <97.51%> (+0.76%) ⬆️
arrow/src/buffer/immutable.rs 98.92% <0.00%> (+1.07%) ⬆️
arrow/src/array/transform/boolean.rs 84.61% <0.00%> (+7.69%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f873d77...2cbe13c. Read the comment docs.

@jhorstmann
Copy link
Contributor

Padding the validity buffer is an interesting approach and avoids many edge cases for handling buffers of different data types. The only downside I see is that the performance now depends a little on the offset instead of only on the length of the slice.

There is a possible alternative solution that would slice the buffers (depending on the datatype). I have such an implementation for most data types, but since there is separate logic for each type the potential for errors is much higher. The not yet implemented types are Struct, Union and FixedSizeLists. If there is interest I can post the code or open an alternative PR, but I'm not sure it would be a clear improvement.

@bjchambers
Copy link
Contributor Author

Padding the validity buffer is an interesting approach and avoids many edge cases for handling buffers of different data types. The only downside I see is that the performance now depends a little on the offset instead of only on the length of the slice.

There is a possible alternative solution that would slice the buffers (depending on the datatype). I have such an implementation for most data types, but since there is separate logic for each type the potential for errors is much higher. The not yet implemented types are Struct, Union and FixedSizeLists. If there is interest I can post the code or open an alternative PR, but I'm not sure it would be a clear improvement.

I'd be happy with either. How does the slicing depend on datatype? It seems like supporting the composite types is important to make this work, and the errors are a potential concern. On the other hand -- how much do you think the performance would depend on the offset? It seems like it may be a little sensitive, but shouldn't be significant? If so, it may be better to start with something that is less error prone, and then change if performance is a concern?

@jhorstmann
Copy link
Contributor

For some background, the query engine I'm working with keeps data in memory in big arrow arrays. Processing of the data happens in batches, that are created as zero-copy slices of those arrays. So the first batch would start with arrays with offset 0, second for example offset 4096 and continuing in the same manner. This leads to us usually being the first to notice any offset related issues :) For this kernel this would mean that calculating null_if on some large input would become quadratic instead of linear, although probably with a relatively small constant factor.

To create the result array with an offset of 0, the buffers or child arrays would have to be sliced. This depends on the datatype, for example:

Boolean => buffer[0].bit_slice(offset, len) // potentially copies data
Primitive => buffer[0].slice(offset * size_of::()) // buffer.slice uses byte offsets
Struct => ... // needs to slice all child arrays
List => ... // should slice the offset buffer, but creation of ListArray currently validates that the first offset is 0, so actually needs to calculate a new offset buffer

The implementation I have works with the subset of datatypes I'm using, I think an implementation inside arrow should better support all datatypes even if it has a small performance penalty. Longer-term I think moving the offset down into the buffers would be the better general solution that would simplify a lot of kernels. I think arrow2 is using that approach successfully.

@bjchambers
Copy link
Contributor Author

bjchambers commented Sep 22, 2021

FWIW, I also believe there are bugs in the implementation in this MR in some cases of offsets. For instance, I think bit_slice(0, len) doesn't do anything to incorporate the len in the result, and then using bit_and on it later causes exceptions because the lengths don't line up. So, totally happy for there to be alternative / less buggy approaches.

It may even be worth a proptest for this to make sure it handles all the cases?

@bjchambers
Copy link
Contributor Author

For some background, the query engine I'm working with keeps data in memory in big arrow arrays. Processing of the data happens in batches, that are created as zero-copy slices of those arrays. So the first batch would start with arrays with offset 0, second for example offset 4096 and continuing in the same manner. This leads to us usually being the first to notice any offset related issues :) For this kernel this would mean that calculating null_if on some large input would become quadratic instead of linear, although probably with a relatively small constant factor.

To create the result array with an offset of 0, the buffers or child arrays would have to be sliced. This depends on the datatype, for example:

Boolean => buffer[0].bit_slice(offset, len) // potentially copies data
Primitive => buffer[0].slice(offset * size_of::()) // buffer.slice uses byte offsets
Struct => ... // needs to slice all child arrays
List => ... // should slice the offset buffer, but creation of ListArray currently validates that the first offset is 0, so actually needs to calculate a new offset buffer

The implementation I have works with the subset of datatypes I'm using, I think an implementation inside arrow should better support all datatypes even if it has a small performance penalty. Longer-term I think moving the offset down into the buffers would be the better general solution that would simplify a lot of kernels. I think arrow2 is using that approach successfully.

Note that I'm seeing problems with slices. For instance, the following issue is causing this implementation to panic depending on the slice offset -- #807. If your implementation is less subject to those problems, that may be a good option moving forward. But, it looks like you may hit the same problem since your boolean case uses bit_slice.

@alamb
Copy link
Contributor

alamb commented Nov 2, 2021

I am going through old PRs and this one seems stalled. I am wondering what we would like to do with this one? Is ok to merge? Are we doing an alternate implementation? Do we have something else in mind?

@alamb
Copy link
Contributor

alamb commented Dec 20, 2021

Marking PRs that are over a month old as stale -- please let us know if there is additional work planned or if we should close them.

@alamb alamb added the stale-pr label Dec 20, 2021
@bjchambers
Copy link
Contributor Author

I've lost the thread on this one. I have a version of this checked in and used in some internal code, so I don't need this to go in. At the same time, it seems like supporting nullif on arbitrary arrays would be beneficial. I'm more than happy to close this and wait for a "better" implementation... seems like it may be acceptable to check-in something since it improves upon the current state (nullif limited to primitive arrays) and then improve the implementation if/when performance is a problem and/or an improved implementation is available.

I'm happy to go either way with this -- if we'd like to move forward I have some additional proptests and such that found a bug or two in the implementation that I'll add, just to make this complete.

@alamb
Copy link
Contributor

alamb commented Dec 30, 2021

At the same time, it seems like supporting nullif on arbitrary arrays would be beneficia

I agree

if we'd like to move forward I have some additional proptests and such that found a bug or two in the implementation that I'll add, just to make this complete.

Let's do that -- and I will find time to review this code properly

@alamb
Copy link
Contributor

alamb commented Jan 24, 2022

Closing this one down to keep the review list clean. Please reopen if that is a mistake

@alamb alamb closed this Jan 24, 2022
@bjchambers
Copy link
Contributor Author

@jhorstmann any chance you could create a PR or share your imlpementation of nullif? Since the original approach didn't make it in, it would be good to get something moving, since this would be a useful kernel to have within Arrow.

@alamb
Copy link
Contributor

alamb commented Mar 28, 2022

I agree it would be useful. Thank you for pushing it @bjchambers

@alamb
Copy link
Contributor

alamb commented Oct 26, 2022

New proposed PR: #2940

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change Changes to the arrow API arrow Changes to the arrow crate stale-pr
Projects
None yet
Development

Successfully merging this pull request may close these issues.

nullif operating on ArrayRef
5 participants