Provide "optimal" encoding #4218

jun0315 · 2023-05-15T14:02:42Z

Parquet supports many types of encoding. If we can provide "optimal" encoding, e.g. by default, the most suitable encoding will be selected based on the characteristics of the data, rather than letting users choose. Currently, the default encoding is plain, which is not a good way. If the user needs to choose encoding based on data characteristics, the requirements for the user are relatively high.

Originally posted by @tustvold in issues: Non-Goals

Provide "optimal" encoding, rather a reasonable out-of-the-box baseline for common use-cases

mapleFU · 2023-05-15T14:06:30Z

Do you have methods do decide which encoding is the "optimal" encoding? Since it might both require sampling data, and some heuristic or other methods. Do you have any idea or formula here?

tustvold · 2023-05-15T14:09:27Z

the default encoding is plain

This isn't quite correct, for a V1 writer the default encoding is RLE_DICTIONARY falling back to PLAIN on exceeding the dictionary page size. There are no other non-deprecated encodings supported by the V1 spec. For a V2 writer, the defaults are similar but falling back to DELTA_BYTE_ARRAY for byte array types instead of PLAIN.

Perhaps you could give an example where the encoding is not as you would expect?

jun0315 · 2023-05-15T14:17:18Z

the default encoding is plain

https://github.com/apache/arrow-rs/blob/master/parquet/src/basic.rs#L222-L230, Sorry, It's my mistake. I saw this line before and thought it was all plain coding.

Perhaps you could give an example where the encoding is not as you would expect?

If our data is 1 1 1 1 2 2 2 2 3 3, maybe RLE hybrid encoding is better, At this point, by default, we shouldn't be using rle encoding, right

tustvold · 2023-05-15T14:21:15Z

If our data is 1 1 1 1 2 2 2 2 3 3, maybe RLE hybrid encoding is better,

RLE Hybrid is used to encode level data, and dictionary indices. The default settings will therefore PLAIN encode 1, 2 to the dictionary page, and then RLE encode 0, 0, 0, 0, 1, 1, 1, 1, 2, 2 to the data page. I think this should be optimal.

For v2 writers there is a form of delta encoding, however, amusingly the linked paper says precisely not to do what the parquet specification then goes on to do 😆. This translates into pretty terrible decode performance, and I would not recommend using it for most workloads.

mapleFU · 2023-05-15T14:24:03Z

RLE hybrid is only used in Dictionary data and RLE. @jun0315

@tustvold By the way, maybe make FastPFor as a encoding in standard parquet helps? Though it may spend lots of time implement and poc, I guess it can have better performance

jun0315 · 2023-05-15T15:01:52Z

For v2 writers there is a form of delta encoding,

So in the case of v2 writers, the default encoding chosen is delta instead of plain. Has this been chosen internally?

Sorry, my example may not be good. If it's' 100 100 100 100 10000 10000 10000 1000, is it better for RLE? In v2, what encoding will be chosen by default?

amusingly the linked paper says precisely not to do what the parquet specification then goes on to do

I am very interested in this paper. Can you tell me the title of the paper? I'll go study :D

mapleFU · 2023-05-15T15:03:17Z

https://github.com/apache/parquet-format/blob/master/Encodings.md#delta-encoding-delta_binary_packed--5

This encoding is adapted from the Binary packing described in "Decoding billions of integers per second through vectorization" by D. Lemire and L. Boytsov.

tustvold · 2023-05-15T15:23:28Z

default encoding chosen is delta instead of plain

For v2 the dictionary fallback is https://github.com/apache/parquet-format/blob/master/Encodings.md#delta-length-byte-array-delta_length_byte_array--6 for byte arrays, and PLAIN for everything else.

https://github.com/apache/parquet-format/blob/master/Encodings.md#delta-encoding-delta_binary_packed--5 is never used by default.

Ultimately DICTIONARY falling back to PLAIN is very fast and well supported, and the space efficiency is good enough for most workloads, alternatives have a hard task to drive broad ecosystem adoption. Ultimately you can always do better than parquet, but people use parquet because it is good enough and well supported

jun0315 · 2023-05-15T15:36:58Z

So in summary, if we want to write arrow's memory data to the parquet file, we generally do not need to specify encoding. Will parquet automatically help us choose a more suitable encoding?

tustvold · 2023-05-15T15:57:30Z

Will parquet automatically help us choose a more suitable encoding

Correct, the defaults should be appropriate for most workloads. Some workloads may benefit from tweaking based on empirical data, e.g. smaller row groups, etc... but I would advise against premature optimisation here.

jun0315 · 2023-05-15T16:00:42Z

Thanks a lot! @tustvold @mapleFU

jun0315 · 2023-05-22T14:59:33Z

Hi @tustvold Previously, the plain encoding of arrow2 was used, but now it has been changed to the default encoding of arrow-rs. It can be observed that the buffer written has changed, but all the changed buffers have become larger. Is this expected?

https://github.com/datafuselabs/databend/actions/runs/5043938824/jobs/9046595560?pr=11473#step:4:243

tustvold · 2023-05-22T15:06:53Z

but all the changed buffers have become larger. Is this expected?

Yes, it's a heuristic, there is no guaranteed way to know ahead of time the most efficient way to encode a given block of data. Consider the case of no repeated values, dictionary encoding will be larger. It will fallback to PLAIN encoding once the dictionary page is full (1 MB) but for very small columns with low repetition, it is highly probable the encoding will be larger.

jun0315 · 2023-05-22T15:09:05Z

it's a heuristic

May I ask where the logical code for this section is located?

tustvold · 2023-05-22T15:12:52Z

https://github.com/apache/arrow-rs/blob/master/parquet/src/column/writer/mod.rs#L569

jun0315 · 2023-05-22T15:22:14Z

If I want to choose some encoding based on the data characteristics in the upper layer application, such as delta. Are there any previous studies that can be used for reference?

tustvold · 2023-05-22T15:30:05Z

I'm not aware of any, but would be interested should you find such information, we just follow the example of the other parquet writers like parquet-mr. I suspect if you have a cardinality estimation of the input you can make a fairly good guess as to whether dictionary encoding is valuable. If your application is really sensitive to storage size, you could consider lowering the max dictionary page size, so that fallback triggers earlier, or possibly explore the block compression options.

Alternatively, if you wanted to contribute a PR that would optionally re-encode on fallback, instead of preserving what has already been dictionary encoded, I would be willing to review it

jun0315 · 2023-05-22T15:34:09Z

Thank you. If I find some useful information, I will share it.

jun0315 added the enhancement Any new improvement worthy of a entry in the changelog label May 15, 2023

jun0315 closed this as completed May 15, 2023

tustvold added question Further information is requested development-process Related to development process of arrow-rs and removed enhancement Any new improvement worthy of a entry in the changelog labels May 15, 2023

jun0315 mentioned this issue May 16, 2023

feat(query): use arrow-rs write parquet datafuselabs/databend#11473

Closed

jun0315 reopened this May 22, 2023

jun0315 closed this as not planned Won't fix, can't repro, duplicate, stale May 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide "optimal" encoding #4218

Provide "optimal" encoding #4218

jun0315 commented May 15, 2023

mapleFU commented May 15, 2023

tustvold commented May 15, 2023 •

edited

jun0315 commented May 15, 2023

tustvold commented May 15, 2023

mapleFU commented May 15, 2023 •

edited

jun0315 commented May 15, 2023 •

edited

mapleFU commented May 15, 2023

tustvold commented May 15, 2023 •

edited

jun0315 commented May 15, 2023

tustvold commented May 15, 2023

jun0315 commented May 15, 2023

jun0315 commented May 22, 2023

tustvold commented May 22, 2023

jun0315 commented May 22, 2023

tustvold commented May 22, 2023

jun0315 commented May 22, 2023

tustvold commented May 22, 2023 •

edited

jun0315 commented May 22, 2023

Provide "optimal" encoding #4218

Provide "optimal" encoding #4218

Comments

jun0315 commented May 15, 2023

mapleFU commented May 15, 2023

tustvold commented May 15, 2023 • edited

jun0315 commented May 15, 2023

tustvold commented May 15, 2023

mapleFU commented May 15, 2023 • edited

jun0315 commented May 15, 2023 • edited

mapleFU commented May 15, 2023

tustvold commented May 15, 2023 • edited

jun0315 commented May 15, 2023

tustvold commented May 15, 2023

jun0315 commented May 15, 2023

jun0315 commented May 22, 2023

tustvold commented May 22, 2023

jun0315 commented May 22, 2023

tustvold commented May 22, 2023

jun0315 commented May 22, 2023

tustvold commented May 22, 2023 • edited

jun0315 commented May 22, 2023

tustvold commented May 15, 2023 •

edited

mapleFU commented May 15, 2023 •

edited

jun0315 commented May 15, 2023 •

edited

tustvold commented May 15, 2023 •

edited

tustvold commented May 22, 2023 •

edited