New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change nullif
to support arbitrary arrays
#521
Conversation
I believe since the API changed this should have the "breaking change" label, but I don't seem to be able to add it. |
.count_set_bits_offset(right.offset(), right.len()) | ||
== 0 | ||
{ | ||
return Ok(left.clone()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is one of the cases where it seems like taking an ArrayRef
is useful.
// Check again -- if all of the falses in the right corresponded to nulls, we | ||
// can still pass the left unmodified. | ||
if right_combo_buffer.count_set_bits_offset(right.offset(), right.len()) == 0 { | ||
return Ok(left.clone()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the other case where it seems like taking an ArrayRef
is useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @bjchambers -- this looks helpful indeed. @jorgecarleitao or @nevi-me are you ok with this interface? If so I can review this PR carefully.
This closes apache#510.
41a793b
to
4a5020d
Compare
Codecov Report
@@ Coverage Diff @@
## master #521 +/- ##
==========================================
+ Coverage 82.47% 82.57% +0.09%
==========================================
Files 167 167
Lines 46144 46399 +255
==========================================
+ Hits 38059 38315 +256
+ Misses 8085 8084 -1
Continue to review full report at Codecov.
|
Padding the validity buffer is an interesting approach and avoids many edge cases for handling buffers of different data types. The only downside I see is that the performance now depends a little on the offset instead of only on the length of the slice. There is a possible alternative solution that would slice the buffers (depending on the datatype). I have such an implementation for most data types, but since there is separate logic for each type the potential for errors is much higher. The not yet implemented types are Struct, Union and FixedSizeLists. If there is interest I can post the code or open an alternative PR, but I'm not sure it would be a clear improvement. |
I'd be happy with either. How does the slicing depend on datatype? It seems like supporting the composite types is important to make this work, and the errors are a potential concern. On the other hand -- how much do you think the performance would depend on the offset? It seems like it may be a little sensitive, but shouldn't be significant? If so, it may be better to start with something that is less error prone, and then change if performance is a concern? |
For some background, the query engine I'm working with keeps data in memory in big arrow arrays. Processing of the data happens in batches, that are created as zero-copy slices of those arrays. So the first batch would start with arrays with offset 0, second for example offset 4096 and continuing in the same manner. This leads to us usually being the first to notice any offset related issues :) For this kernel this would mean that calculating null_if on some large input would become quadratic instead of linear, although probably with a relatively small constant factor. To create the result array with an offset of 0, the buffers or child arrays would have to be sliced. This depends on the datatype, for example: Boolean => buffer[0].bit_slice(offset, len) // potentially copies data The implementation I have works with the subset of datatypes I'm using, I think an implementation inside arrow should better support all datatypes even if it has a small performance penalty. Longer-term I think moving the offset down into the buffers would be the better general solution that would simplify a lot of kernels. I think arrow2 is using that approach successfully. |
FWIW, I also believe there are bugs in the implementation in this MR in some cases of offsets. For instance, I think It may even be worth a proptest for this to make sure it handles all the cases? |
Note that I'm seeing problems with slices. For instance, the following issue is causing this implementation to panic depending on the slice offset -- #807. If your implementation is less subject to those problems, that may be a good option moving forward. But, it looks like you may hit the same problem since your boolean case uses |
I am going through old PRs and this one seems stalled. I am wondering what we would like to do with this one? Is ok to merge? Are we doing an alternate implementation? Do we have something else in mind? |
Marking PRs that are over a month old as stale -- please let us know if there is additional work planned or if we should close them. |
I've lost the thread on this one. I have a version of this checked in and used in some internal code, so I don't need this to go in. At the same time, it seems like supporting I'm happy to go either way with this -- if we'd like to move forward I have some additional proptests and such that found a bug or two in the implementation that I'll add, just to make this complete. |
I agree
Let's do that -- and I will find time to review this code properly |
Closing this one down to keep the review list clean. Please reopen if that is a mistake |
@jhorstmann any chance you could create a PR or share your imlpementation of |
I agree it would be useful. Thank you for pushing it @bjchambers |
New proposed PR: #2940 |
Which issue does this PR close?
Closes #510.
Rationale for this change
There is no reason the
nullif
kernel applies only to primitive arrays. I have found it useful for nulling out a field array to respect the null-ness of a struct for instance, and in supporting only primitive arrays limits the ability to do this.What changes are included in this PR?
A change to the implementation of
nullif
to support arbitraryArrayRef
.Are there any user-facing changes?
The signature of
nullif
has changed.