Arrow Row Format #2677

tustvold · 2022-09-07T14:46:22Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

I think this crate has pretty good stories for operating on individual columns, either by downcasting to a concrete type, or invoking a dyn kernel.

The stories for multi-column operations are substantially weaker, with patchy support for common multi-column operations such as sorts, groupings, aggregations, reassembly, etc... We have some pieces such as MutableArrayData, DynComparator, but they're not especially performant, making extensive use of dynamic dispatch at the row-level, nor easy to use.

Describe the solution you'd like

Having a first-class row representation will not only allow us to implement more performant versions of existing kernels such as lexsort, but also provide a pretty compelling primitive to downstreams with which to implement more advanced operations such as streaming merges, joins, aggregates, etc... There is also precedent, with the C++ arrow library providing its own row format.

Goals

Each row should be encoded as a single sequence of bytes
Comparison of the byte arrays should be sufficient to establish ordering of the rows
It should be possible to convert a selection of rows back to arrays

Non-Goals

Support introspection or mutation of the row values
Provide a stable encoding for FFI, IO, etc...
Provide "optimal" encoding, rather a reasonable out-of-the-box baseline for common use-cases

Describe alternatives you've considered

We could extend the row format in DataFusion, however, this would limit its benefits to DataFusion. I think a row-oriented representation is such a fundamental primitive that it makes sense for inclusion in arrow-rs, so that it can be both used in its kernels and by downstreams that don't make use of DataFusion.

Additional context

The text was updated successfully, but these errors were encountered:

v1gnesh · 2022-09-11T02:41:01Z

Hello,

Would this help with processing row/record-based data, where the number of columns (that is, the structure of the row/record) vary depending on the type of row/record in the file. Example: An ndjson file where rows can be of n different variants.

tustvold · 2022-09-11T07:07:30Z

This will still not allow operations on data with multiple schema, same as with RecordBatch, if that is what you're asking for?

That being said, in the case of rows with different variants, nulls will be inserted by the JSON reader for the columns found in other variants but not present in the current record. The data is effectively unified to a single schema.

This trades off memory efficiency for the ability to efficiently process data in a columnar fashion, without per-value dynamic dispatch.

This is likely acceptable, however, partitioning the data so that different schema aren't interleaved will lead to better performance and memory efficiency. The row format could help with implementing this, but does not alter the nature of arrow schema

v1gnesh · 2022-09-11T12:28:03Z

Thanks for clarifying. As you say, I'll have to make a separate stream for each row-type.

* Add string_dictionary benches for row format (#2677) * Fix copy-pasta

* Convert rows to arrays (#2677) * Review feedback * Clippy

tustvold added the enhancement Any new improvement worthy of a entry in the changelog label Sep 7, 2022

This was referenced Sep 7, 2022

Comparable Row Format #2593

Merged

DynComparator/LexicographicalComparator Incorrectly Handles Dictionary Nulls #2687

Closed

tustvold mentioned this issue Sep 30, 2022

Simplify FixedLengthEncoding #2812

Merged

tustvold added a commit to tustvold/arrow-rs that referenced this issue Oct 3, 2022

Add OrderPreservingInterner::lookup (apache#2677)

8469cfb

tustvold mentioned this issue Oct 3, 2022

Add OrderPreservingInterner::lookup (#2677) #2815

Merged

tustvold added a commit to tustvold/arrow-rs that referenced this issue Oct 3, 2022

Use fixed size slots in OrderPreservingInterner (apache#2677)

598294a

tustvold added a commit to tustvold/arrow-rs that referenced this issue Oct 3, 2022

Add string_dictionary benches for row format (apache#2677)

8046e07

tustvold mentioned this issue Oct 3, 2022

Add string_dictionary benches for row format (#2677) #2816

Merged

tustvold added a commit to tustvold/arrow-rs that referenced this issue Oct 3, 2022

Add string_dictionary benches for row format (apache#2677)

63360fa

tustvold added a commit to tustvold/arrow-rs that referenced this issue Oct 3, 2022

Use fixed size slots in OrderPreservingInterner (apache#2677)

dda381c

tustvold added a commit that referenced this issue Oct 3, 2022

Add OrderPreservingInterner::lookup (#2677) (#2815)

70054fd

tustvold added a commit that referenced this issue Oct 3, 2022

Add string_dictionary benches for row format (#2677) (#2816)

76da624

* Add string_dictionary benches for row format (#2677) * Fix copy-pasta

tustvold added a commit to tustvold/arrow-rs that referenced this issue Oct 4, 2022

Convert rows to arrays (apache#2677)

6866690

tustvold mentioned this issue Oct 4, 2022

Convert rows to arrays (#2677) #2826

Merged

tustvold added a commit to tustvold/arrow-rs that referenced this issue Oct 5, 2022

Simplify OrderPreservingInterner allocation strategy (apache#2677)

4c4b993

tustvold mentioned this issue Oct 5, 2022

Simplify OrderPreservingInterner allocation strategy ~97% faster (#2677) #2827

Merged

tustvold closed this as completed in #2826 Oct 6, 2022

tustvold added a commit that referenced this issue Oct 6, 2022

Convert rows to arrays (#2677) (#2826)

f8c4037

* Convert rows to arrays (#2677) * Review feedback * Clippy

tustvold added a commit to tustvold/arrow-rs that referenced this issue Oct 6, 2022

Simplify OrderPreservingInterner allocation strategy (apache#2677)

210a35d

tustvold added a commit that referenced this issue Oct 13, 2022

Simplify OrderPreservingInterner allocation strategy (#2677) (#2827)

65d5576

alamb added the arrow Changes to the arrow crate label Oct 14, 2022

tustvold mentioned this issue Oct 17, 2022

Improve row format docs #2888

Merged

jun0315 mentioned this issue May 15, 2023

Provide "optimal" encoding #4218

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Arrow Row Format #2677

Arrow Row Format #2677

tustvold commented Sep 7, 2022

v1gnesh commented Sep 11, 2022

tustvold commented Sep 11, 2022 •

edited

v1gnesh commented Sep 11, 2022

Arrow Row Format #2677

Arrow Row Format #2677

Comments

tustvold commented Sep 7, 2022

v1gnesh commented Sep 11, 2022

tustvold commented Sep 11, 2022 • edited

v1gnesh commented Sep 11, 2022

tustvold commented Sep 11, 2022 •

edited