Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide "optimal" encoding #4218

Closed
jun0315 opened this issue May 15, 2023 · 18 comments
Closed

Provide "optimal" encoding #4218

jun0315 opened this issue May 15, 2023 · 18 comments
Labels
development-process Related to development process of arrow-rs question Further information is requested

Comments

@jun0315
Copy link

jun0315 commented May 15, 2023

Parquet supports many types of encoding. If we can provide "optimal" encoding, e.g. by default, the most suitable encoding will be selected based on the characteristics of the data, rather than letting users choose. Currently, the default encoding is plain, which is not a good way. If the user needs to choose encoding based on data characteristics, the requirements for the user are relatively high.

Originally posted by @tustvold in issues: Non-Goals

Provide "optimal" encoding, rather a reasonable out-of-the-box baseline for common use-cases

@jun0315 jun0315 added the enhancement Any new improvement worthy of a entry in the changelog label May 15, 2023
@mapleFU
Copy link
Member

mapleFU commented May 15, 2023

Do you have methods do decide which encoding is the "optimal" encoding? Since it might both require sampling data, and some heuristic or other methods. Do you have any idea or formula here?

@tustvold
Copy link
Contributor

tustvold commented May 15, 2023

the default encoding is plain

This isn't quite correct, for a V1 writer the default encoding is RLE_DICTIONARY falling back to PLAIN on exceeding the dictionary page size. There are no other non-deprecated encodings supported by the V1 spec. For a V2 writer, the defaults are similar but falling back to DELTA_BYTE_ARRAY for byte array types instead of PLAIN.

Perhaps you could give an example where the encoding is not as you would expect?

@jun0315
Copy link
Author

jun0315 commented May 15, 2023

the default encoding is plain

https://github.com/apache/arrow-rs/blob/master/parquet/src/basic.rs#L222-L230, Sorry, It's my mistake. I saw this line before and thought it was all plain coding.

Perhaps you could give an example where the encoding is not as you would expect?

If our data is 1 1 1 1 2 2 2 2 3 3, maybe RLE hybrid encoding is better, At this point, by default, we shouldn't be using rle encoding, right

@tustvold
Copy link
Contributor

If our data is 1 1 1 1 2 2 2 2 3 3, maybe RLE hybrid encoding is better,

RLE Hybrid is used to encode level data, and dictionary indices. The default settings will therefore PLAIN encode 1, 2 to the dictionary page, and then RLE encode 0, 0, 0, 0, 1, 1, 1, 1, 2, 2 to the data page. I think this should be optimal.

For v2 writers there is a form of delta encoding, however, amusingly the linked paper says precisely not to do what the parquet specification then goes on to do 😆. This translates into pretty terrible decode performance, and I would not recommend using it for most workloads.

@mapleFU
Copy link
Member

mapleFU commented May 15, 2023

RLE hybrid is only used in Dictionary data and RLE. @jun0315

@tustvold By the way, maybe make FastPFor as a encoding in standard parquet helps? Though it may spend lots of time implement and poc, I guess it can have better performance

@jun0315
Copy link
Author

jun0315 commented May 15, 2023

For v2 writers there is a form of delta encoding,

So in the case of v2 writers, the default encoding chosen is delta instead of plain. Has this been chosen internally?

Sorry, my example may not be good. If it's' 100 100 100 100 10000 10000 10000 1000, is it better for RLE? In v2, what encoding will be chosen by default?

amusingly the linked paper says precisely not to do what the parquet specification then goes on to do

I am very interested in this paper. Can you tell me the title of the paper? I'll go study :D

@mapleFU
Copy link
Member

mapleFU commented May 15, 2023

https://github.com/apache/parquet-format/blob/master/Encodings.md#delta-encoding-delta_binary_packed--5

This encoding is adapted from the Binary packing described in "Decoding billions of integers per second through vectorization" by D. Lemire and L. Boytsov.

@tustvold
Copy link
Contributor

tustvold commented May 15, 2023

default encoding chosen is delta instead of plain

For v2 the dictionary fallback is https://github.com/apache/parquet-format/blob/master/Encodings.md#delta-length-byte-array-delta_length_byte_array--6 for byte arrays, and PLAIN for everything else.

https://github.com/apache/parquet-format/blob/master/Encodings.md#delta-encoding-delta_binary_packed--5 is never used by default.

Ultimately DICTIONARY falling back to PLAIN is very fast and well supported, and the space efficiency is good enough for most workloads, alternatives have a hard task to drive broad ecosystem adoption. Ultimately you can always do better than parquet, but people use parquet because it is good enough and well supported

@jun0315
Copy link
Author

jun0315 commented May 15, 2023

So in summary, if we want to write arrow's memory data to the parquet file, we generally do not need to specify encoding. Will parquet automatically help us choose a more suitable encoding?

@tustvold
Copy link
Contributor

Will parquet automatically help us choose a more suitable encoding

Correct, the defaults should be appropriate for most workloads. Some workloads may benefit from tweaking based on empirical data, e.g. smaller row groups, etc... but I would advise against premature optimisation here.

@jun0315
Copy link
Author

jun0315 commented May 15, 2023

Thanks a lot! @tustvold @mapleFU

@jun0315 jun0315 closed this as completed May 15, 2023
@tustvold tustvold added question Further information is requested development-process Related to development process of arrow-rs and removed enhancement Any new improvement worthy of a entry in the changelog labels May 15, 2023
@jun0315
Copy link
Author

jun0315 commented May 22, 2023

Hi @tustvold Previously, the plain encoding of arrow2 was used, but now it has been changed to the default encoding of arrow-rs. It can be observed that the buffer written has changed, but all the changed buffers have become larger. Is this expected?

https://github.com/datafuselabs/databend/actions/runs/5043938824/jobs/9046595560?pr=11473#step:4:243

@jun0315 jun0315 reopened this May 22, 2023
@tustvold
Copy link
Contributor

but all the changed buffers have become larger. Is this expected?

Yes, it's a heuristic, there is no guaranteed way to know ahead of time the most efficient way to encode a given block of data. Consider the case of no repeated values, dictionary encoding will be larger. It will fallback to PLAIN encoding once the dictionary page is full (1 MB) but for very small columns with low repetition, it is highly probable the encoding will be larger.

@jun0315
Copy link
Author

jun0315 commented May 22, 2023

it's a heuristic

May I ask where the logical code for this section is located?

@tustvold
Copy link
Contributor

@jun0315
Copy link
Author

jun0315 commented May 22, 2023

If I want to choose some encoding based on the data characteristics in the upper layer application, such as delta. Are there any previous studies that can be used for reference?

@tustvold
Copy link
Contributor

tustvold commented May 22, 2023

I'm not aware of any, but would be interested should you find such information, we just follow the example of the other parquet writers like parquet-mr. I suspect if you have a cardinality estimation of the input you can make a fairly good guess as to whether dictionary encoding is valuable. If your application is really sensitive to storage size, you could consider lowering the max dictionary page size, so that fallback triggers earlier, or possibly explore the block compression options.

Alternatively, if you wanted to contribute a PR that would optionally re-encode on fallback, instead of preserving what has already been dictionary encoded, I would be willing to review it

@jun0315
Copy link
Author

jun0315 commented May 22, 2023

Thank you. If I find some useful information, I will share it.

@jun0315 jun0315 closed this as not planned Won't fix, can't repro, duplicate, stale May 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
development-process Related to development process of arrow-rs question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants