better document when we need `LargeUtf8` instead of `Utf8` #3228

msalib · 2022-11-29T14:39:08Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

When I filed #3123, I was surprised to discover that concatenating lots of Utf8 elements is supposed to panic when the total size is over 2 GB, even though the individual sizes are much smaller. That constraint was really unexpected! It makes sense if you understand the storage model, but I didn't and so was very surprised.

Describe the solution you'd like

I'm not sure how to surface this knowledge better. When I first skimmed the data type docs, I walked away thinking that LargeUtf8 is for cases where an individual element is large (I wasn't even clear that large meant > 2 GB) and that I should use Utf8 for everything else. But I should've understood the constraint as "use LargeUtf8 everywhere except places where you can guarantee that you'll never have an array with more than 2 GB of text total".

Maybe we just need a big statement in the Physical Memory Layout guide and the DataType doc string explaining that you cannot ever build an array where the total text size is over 2 GB if you use Utf8

Describe alternatives you've considered

This feels like a landmine and I wish Arrow could transparently convert between these types as needed. Ideally there should just be a Utf8 type that internally specifies what type it uses to manage offsets.

Alternatively, I wish the concat kernel could return a more explicit failure message by explicitly checking for this sort of overflow, something like "I've been asked to concat 2 Utf8 arrays into an array that will be over 2 GB and I cannot do that: these arrays need to be LargeUtf8 instead". I mean, when you're doing the concatenation, you can check lengths explicitly ahead of time.

The text was updated successfully, but these errors were encountered:

msalib added the enhancement Any new improvement worthy of a entry in the changelog label Nov 29, 2022

tustvold added a commit to tustvold/arrow-rs that referenced this issue Nov 30, 2022

Better document implications of offsets (apache#3228)

1ab4c79

tustvold mentioned this issue Nov 30, 2022

Better document implications of offsets (#3228) #3243

Merged

viirya closed this as completed in #3243 Dec 1, 2022

viirya pushed a commit that referenced this issue Dec 1, 2022

Better document implications of offsets (#3228) (#3243)

89a072e

alamb added arrow Changes to the arrow crate documentation Improvements or additions to documentation and removed enhancement Any new improvement worthy of a entry in the changelog labels Dec 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

better document when we need `LargeUtf8` instead of `Utf8` #3228

better document when we need `LargeUtf8` instead of `Utf8` #3228

msalib commented Nov 29, 2022

better document when we need LargeUtf8 instead of Utf8 #3228

better document when we need LargeUtf8 instead of Utf8 #3228

Comments

msalib commented Nov 29, 2022

better document when we need `LargeUtf8` instead of `Utf8` #3228

better document when we need `LargeUtf8` instead of `Utf8` #3228