Implement Run Length Encoding (RLE) / Run End Encoding (REE) support (Epic) #3520

alamb · 2023-01-12T21:52:46Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Arrow has added REE support apache/arrow#14176, similar to dictionary arrays that allow repeated values to be encoded in a space efficient manner that also allows fast processing.

Describe the solution you'd like
Implement REE in arrow-rs. Some likely candidate:

Describe alternatives you've considered

Additional context

tustvold · 2023-01-13T10:20:13Z

One thing to perhaps give thought to is what kernels we need to facilitate this use-case. For example, a core pattern for dictionaries is to evaluate against the child values array and then take the results based on the keys. This avoids generating code for each combination of key and value type. We possibly need something similar for the REE use-case, i.e. some sort of take_runs kernel.

askoa · 2023-01-15T18:13:25Z

I would like to start this by writing REE Array.

alippai · 2023-01-17T08:24:54Z

Would be adding parquet support in the scope? I’m not sure if parquet has RLE directly, but highly compressed (zstd,snappy,zlib) values can be converted to REE arrays instead of inflating.

tustvold · 2023-01-17T09:57:58Z

Dictionaries are RLE encoded in parquet, so we could theoretically preserve RLE encoding for dictionary arrays, on top of the existing logic to preserve the dictionary encoding. However, the cost and complexity of translating between the two representations is likely to negate much of the potential gains.

zlib

Do you have a link for this, I'm not sure how you would reliably use the codec's internal RLE coding as it isn't guaranteed to be at a meaningful granularity for the encoded data?

alippai · 2023-01-18T03:29:29Z

@tustvold I agree that the reading performance might not improve but the downstream operations like filter/join/group by (if optimized for REE) would definitely make it worth. Absolutely not a day 0 feature.

The zlib/zstd idea is far fetched, but:
1.) zlib sometimes decides to store data as Z_RLE, might worth checking if this happens with Parquet too: https://optipng.sourceforge.net/pngtech/z_rle.html
2.) I was wondering if LZ77 -> REE is possible without fully decompressing the data, something like posted here (for a different algorithm): https://www.researchgate.net/publication/261072335_From_Run_Length_Encoding_to_LZ78_and_Back_Again

Both of these are super theoretical and needs access to the compression building blocks, the low level api (as you don’t want to ask the lib to decompress the data, but you want to go compressed -> REE to save big)

askoa · 2023-01-20T09:37:30Z

I am starting to write Iterator / ArrayAccessor for RunEndEncodedArray. I could see two scenarios for accessing the values

The caller know the physical index and just wants value in that physical index.
- This method is useful when implementing iterators. The iterator can decide if the physical index has to be increased in O(1) time. After the decision is made, the iterator needs to access the value in the physical index.
The caller does not know the physical index and trying to access a value in logical index. A binary search has to be implemented to determine physical index based on logical index. This will take O(log N) time.
- This method is useful when the caller wants to access only few random indices (may be in take kernels?)

My questions are below

is it okay to implement two ArrayAccessor for RunEndEncodedArray. One based on logical index and one based on physical index?
ArrayIter cannot be used as is to implement an Iterator. REE iterator needs to store two values and need a different increment logic. Is it okay to implement RunEndEncodedArrayIter and not use ArrayIter?

cc: @tustvold @viirya @alamb

tustvold · 2023-01-20T09:49:52Z

I'm not sure you can implement ArrayAccessor for REEArray as it doesn't know its value type? This is fine imo, we don't implement ArrayAccessor for DictionaryArray for much the same reason.

Providing an iterator abstraction, similar to TypedDictionaryArray, that downcasts the values and uses ArrayAccessor to "decode" the runs makes sense to me to help ergonomics

However, most kernels I imagine will need custom logic to handle RunEncodedArrays efficiently, e.g. take will need to parse the runs array and compute a new set of runs along with the take indices to apply to the values array. Filter will need to do something similar.

One thing to be extremely careful of, is to avoid generic code typed on both the key type and the value type in our kernels - this explodes codegen and has caused a lot of pain with dictionaries

askoa · 2023-01-20T11:21:37Z

Providing an iterator abstraction, similar to TypedDictionaryArray, that downcasts the values and uses ArrayAccessor to "decode" the runs makes sense to me to help ergonomics

Yes, I am planning to model ArrayAccessor after TypedDictionaryArray.

However, most kernels I imagine will need custom logic to handle RunEncodedArrays efficiently, e.g. take will need to parse the runs array and compute a new set of runs along with the take indices to apply to the values array. Filter will need to do something similar.

Does it make sense to add a function to RunEndEncodedArray that'll convert from logical index to physical index? I think it can be used by functions that are trying to decode arbitrary indices.

One thing to be extremely careful of, is to avoid generic code typed on both the key type and the value type in our kernels - this explodes codegen and has caused a lot of pain with dictionaries

I am planning to model after TypedDictionaryArray which has generics for both key and value. Would that be a problem?

tustvold · 2023-01-20T11:57:39Z

Would that be a problem?

Yes, wherever possible we should handle the indices and values separately, similarly to how we currently handle dictionaries. Otherwise you catastrophically explode code gen and build times, see #2596 and apache/datafusion#4999 for some more context. Non-scalar binary kernels are the only ones where I foresee this not being possible, we should gate those behind feature flags like we do for dictionaries.

TypedDictionaryArray is useful for statically typed codepaths, but using it in kernels that must be generated for every combination of types ends up being problematic

fabianmurariu · 2024-03-30T13:55:10Z

Hi, I'm working on raphtory trying to get a query engine off the ground with datafusion. One of the key ingredients would be REE array support, because hopping around graphs can be expressed efficiently with REE.

Can I start looking into adding support for filtering or is there ongoing work that's required first?

alamb · 2024-03-30T20:40:55Z

Hi, I'm working on raphtory trying to get a query engine off the ground with datafusion. One of the key ingredients would be REE array support, because hopping around graphs can be expressed efficiently with REE.

That sounds like an awesome project 🙏

Can I start looking into adding support for filtering or is there ongoing work that's required first?

That would be great.

Some key features for arrays are (I am not sure offhand what REE has already, you would have to to some research)

Cast support (to convert back/forth to other types
Filter support (as you mentioned)
IPC support (for arrow flight, for example)

Something else that might be worth looking at is the ability to read REE arrays directly from parquet (and avoid decoding) if you are reading for arrow

You might take a look at the list of items we made for StringView to get some inspiration: #5374

alamb added the enhancement Any new improvement worthy of a entry in the changelog label Jan 12, 2023

viirya changed the title ~~Implement Run Length Encoding (RLE) / Rule End Encoding (REE) support (Epic)~~ Implement Run Length Encoding (RLE) / Run End Encoding (REE) support (Epic) Jan 16, 2023

viirya mentioned this issue Jan 16, 2023

Add Run End Encoding DataType #3534

Closed

askoa mentioned this issue Jan 18, 2023

feat: Add RunEndEncodedArray #3553

Merged

This was referenced Jan 22, 2023

Add ArrayAccessor, Iterator, Extend and benches for string run builder #3583

Closed

Add ArrayAccessor, Iterator, Extend and benchmarks for RunArray #3603

Merged

feat: take kernel for RunArray #3622

Merged

askoa mentioned this issue Feb 4, 2023

feat + fix: IPC support for run encoded array. #3662

Merged

This was referenced Feb 11, 2023

WIP: feat: Sort kernel for RunArray #3694

Closed

feat: Sort kernel for RunArray #3695

Merged

perf: take_run improvements #3705

Merged

talawahtech mentioned this issue Feb 21, 2023

Feature: Add support for Arrow's new run-end encoding jorgecarleitao/arrow2#1413

Open

askoa mentioned this issue Feb 22, 2023

Implement logical comparison for run encoded array #3747

Open

wjones127 mentioned this issue Apr 28, 2023

Support arbitrary user defined partition column in ListingTable (rather than assuming they are always Dictionary encoded) apache/datafusion#5545

Merged

spencerwilson mentioned this issue Feb 28, 2024

Support Arrow columnar format v1.4 #5326

Closed

3 tasks

fabianmurariu mentioned this issue Mar 30, 2024

filter for run end array #5573

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Run Length Encoding (RLE) / Run End Encoding (REE) support (Epic) #3520

Implement Run Length Encoding (RLE) / Run End Encoding (REE) support (Epic) #3520

alamb commented Jan 12, 2023 •

edited

tustvold commented Jan 13, 2023 •

edited

askoa commented Jan 15, 2023

alippai commented Jan 17, 2023

tustvold commented Jan 17, 2023 •

edited

alippai commented Jan 18, 2023

askoa commented Jan 20, 2023

tustvold commented Jan 20, 2023 •

edited

askoa commented Jan 20, 2023

tustvold commented Jan 20, 2023 •

edited

fabianmurariu commented Mar 30, 2024

alamb commented Mar 30, 2024

Implement Run Length Encoding (RLE) / Run End Encoding (REE) support (Epic) #3520

Implement Run Length Encoding (RLE) / Run End Encoding (REE) support (Epic) #3520

Comments

alamb commented Jan 12, 2023 • edited

tustvold commented Jan 13, 2023 • edited

askoa commented Jan 15, 2023

alippai commented Jan 17, 2023

tustvold commented Jan 17, 2023 • edited

alippai commented Jan 18, 2023

askoa commented Jan 20, 2023

tustvold commented Jan 20, 2023 • edited

askoa commented Jan 20, 2023

tustvold commented Jan 20, 2023 • edited

fabianmurariu commented Mar 30, 2024

alamb commented Mar 30, 2024

alamb commented Jan 12, 2023 •

edited

tustvold commented Jan 13, 2023 •

edited

tustvold commented Jan 17, 2023 •

edited

tustvold commented Jan 20, 2023 •

edited

tustvold commented Jan 20, 2023 •

edited