Skip to content

Commit

Permalink
ARROW-15244: [Format] Clarify that offsets are monotonic for binary l…
Browse files Browse the repository at this point in the history
…ike arrays

# Rationale
The question of "what are the values of the offsets for non-valid entries in arrays" came up in arrow-rs: apache/arrow-rs#1071 and the existing [docs](https://arrow.apache.org/docs/format/Columnar.html#variable-size-binary-layout) seem to be somewhat vague on this issue.

I looked at three implementations of arrow, and they all seem to assume / validate the offsets are monotonic:
* C++ implementation (I think) also also ensures the offsets are monotonic without first checking the validity array https://github.com/apache/arrow/blob/master/cpp/src/arrow/array/validate.cc#L568-L592
* arrow-rs after apache/arrow-rs#921 (based on the C++) will refuse to create arrays where the array offsets are non monotonic
* arrow2 also ensures that offsets are always monotonic.
https://github.com/jorgecarleitao/arrow2/blob/37a9c758826a92d98dc91e992b2a49ce9724095d/src/array/specification.rs#L102-L119

# Changes
Thus I propose updating the format docs to make the monotonic offsets explicit.

# Background
I think @jorgecarleitao's description on  apache/arrow-rs#1071 (comment), explains the reason why having monotonic offsets is a good idea

> I think that in general the property we seek is: discarding the validity cannot result in UB when accessing the values. This justifies the values buffer of a primitive array is always initialized, and the offsets being valid and in-bounds even in null cases.
>
> The rational for this is that sometimes it is faster to skip validity accesses and only iterate over the values (and clone the validity). I do not recall the benchmark result, but this may explain why string comparison ignores validity and & the bitmaps instead.

Closes #12019 from alamb/alamb/clarify_offsets

Lead-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Signed-off-by: Antoine Pitrou <antoine@python.org>
  • Loading branch information
alamb and pitrou committed Jan 4, 2022
1 parent 31a07be commit e7dc8f5
Showing 1 changed file with 5 additions and 1 deletion.
6 changes: 5 additions & 1 deletion docs/source/format/Columnar.rst
Expand Up @@ -309,7 +309,11 @@ That is, a null value may occupy a **non-empty** memory space in the data
buffer. When this is true, the content of the corresponding memory space
is undefined.

Generally the first value in the offsets array is 0, and the last slot
Offsets must be monotonically increasing, that is ``offsets[j+1] >= offsets[j]``
for ``0 <= j < length``, even for null slots. This property ensures the
location for all values is valid and well defined.

Generally the first slot in the offsets array is 0, and the last slot
is the length of the values array. When serializing this layout, we
recommend normalizing the offsets to start at 0.

Expand Down

0 comments on commit e7dc8f5

Please sign in to comment.