New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why do we check for null in TypedDictionaryArray value function #2564
Comments
Dictionaries are a special case, as the value of a null index is undefined, we can't then use it as an index in the values array. We therefore need this additional check. The bounds check is preformed by calling value on the keys array. |
Hmm🤔, Why is the processes are somewhat different: However, the results are same: You get an |
There's a subtlety here, you get an unspecified value not an undefined value. It is not UB to read a null value from a ListArray, StringArray, etc... It isn't defined what you will get, but you won't get uninitialised memory or something that would trigger undefined behaviour. TypedDictionaryArray is a special case because it is actually doing two array lookups. First it does a lookup into the indices array, if this index is null it gets back something unspecified. This is fine, however, this arbitrary value may not be in the bounds of the values, in which case we would access memory out of bounds. We could also validate the bounds for the value access, instead of checking the NULL mask, however, the NULL assertion I thought more clearly indicated the why. |
Make sense, Thank you @tustvold . |
Thank you @tustvold for the explanation. I learnt a lot from it. Just one last thing, based on your explanation, dosent this mean that we have actually implemented the function cmp_dict_primitive (https://github.com/apache/arrow-rs/blob/master/arrow/src/compute/kernels/comparison.rs#L2263 ) wrongly and subsequently all the other kernels which use compare_op with a dictionary as one of its parameters. As you can see in that function we are using compare_op to compare the elements of dictionary with a another array. And in compares op implementation (https://github.com/apache/arrow-rs/blob/master/arrow/src/compute/kernels/comparison.rs#L65) we are using value_unchecked to get the elements of the dictionary without checking for null. So based on the above explanation we shouldnt be doing that right since the unspecified value can lead to out of bounds access ? |
Aah yeah, that's a bug. Fortunately it hasn't been released yet, it was changed in #2533. We should fix it before the next release. I do wonder if there is a way to avoid needing to check the null index for dictionary arrays, perhaps we could enforce nulls to be 0, and then special case an empty dictionary or something... 🤔 |
The only somewhat reasonable case would be an empty dictionary, where |
#2564 hopefully fixes this confusion, and fixes the OOB behaviour. PTAL |
Which part is this question about
Code base: https://github.com/apache/arrow-rs/blob/master/arrow/src/array/array_dictionary.rs#L498.
Describe your question
In TypedDictionaryArray value function why do we check for null , we could instead assert if the index is less than keys.len() right?
Additional context
the trait documentation states the below, due to which I felt it was bizzare that we are checking for null.
Proposal
We can instead rewrite it as below
The text was updated successfully, but these errors were encountered: