Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

better document when we need LargeUtf8 instead of Utf8 #3228

Closed
msalib opened this issue Nov 29, 2022 · 0 comments · Fixed by #3243
Closed

better document when we need LargeUtf8 instead of Utf8 #3228

msalib opened this issue Nov 29, 2022 · 0 comments · Fixed by #3243
Labels
arrow Changes to the arrow crate documentation Improvements or additions to documentation

Comments

@msalib
Copy link
Contributor

msalib commented Nov 29, 2022

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

When I filed #3123, I was surprised to discover that concatenating lots of Utf8 elements is supposed to panic when the total size is over 2 GB, even though the individual sizes are much smaller. That constraint was really unexpected! It makes sense if you understand the storage model, but I didn't and so was very surprised.

Describe the solution you'd like

I'm not sure how to surface this knowledge better. When I first skimmed the data type docs, I walked away thinking that LargeUtf8 is for cases where an individual element is large (I wasn't even clear that large meant > 2 GB) and that I should use Utf8 for everything else. But I should've understood the constraint as "use LargeUtf8 everywhere except places where you can guarantee that you'll never have an array with more than 2 GB of text total".

Maybe we just need a big statement in the Physical Memory Layout guide and the DataType doc string explaining that you cannot ever build an array where the total text size is over 2 GB if you use Utf8

Describe alternatives you've considered

This feels like a landmine and I wish Arrow could transparently convert between these types as needed. Ideally there should just be a Utf8 type that internally specifies what type it uses to manage offsets.

Alternatively, I wish the concat kernel could return a more explicit failure message by explicitly checking for this sort of overflow, something like "I've been asked to concat 2 Utf8 arrays into an array that will be over 2 GB and I cannot do that: these arrays need to be LargeUtf8 instead". I mean, when you're doing the concatenation, you can check lengths explicitly ahead of time.

@msalib msalib added the enhancement Any new improvement worthy of a entry in the changelog label Nov 29, 2022
tustvold added a commit to tustvold/arrow-rs that referenced this issue Nov 30, 2022
@alamb alamb added arrow Changes to the arrow crate documentation Improvements or additions to documentation and removed enhancement Any new improvement worthy of a entry in the changelog labels Dec 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants