New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CompressionCodec LZ4 incompatible with C++ implementation #2988
Comments
My only question with 2. is if there is a chance of silent failures, where data succesfully decodes with the wrong codec yielding gibberish? Otherwise 2 sounds like a good approach, thank you for the detailed investigation |
Short answer: Yes, probably it is possible to silently fail and be decoded by the wrong codec, but unlikely. So another option would be to only allow Long Answer: After this change we should only generate So the the files arranged by likelihood of hit by this library will be:
Then most of the times we expect the files to be of type
We can find the implementation in Rust in pola-rs crate at compression.rs file. For this algorithm, we can easily detect if there is an error when decoding, because it is unlikely that the input and output length match for every block. Then, the second in likelihood is
To meet the requirements to be decoded by this codec if the data is not in So now we fall to the last in probability
Even if it is not likely to be decompress by this algorithm, it is the one with more relaxed rules. However, we know the uncompressed size, so the output should fit in that exact length, which is an additional check of integrity. NOTE: Parque C++ has this fallback to This comment has the same concern you have. |
You've convinced me, lets go with option 2 then 😄 |
Great!😄 Should we include the feature |
I think we could add an option to |
I committed in my local fork the changes required to pass the option through I also tried to make the new API methods as extensible as possible, by creating a Options and OptionsStruct method for each of the new API methods. Here it is the commit to check the changes in the API: marioloko@a94b248 So that the changes in the API can be verified in parallel while I add the compression methods. The important changes are:
|
Looks good to me, I think we may be able to keep some of that crate-private, especially as the compression codecs themselves are an experimental API, but the general gist looks good - nice work 👍 |
I chose to create the
However, it is possible to hide the class So in my opinions both options are valid, and it is just a design choice:
|
Hmm... Thinking about this a bit more what do you think of just plumbing I'm also totally happy for you to raise a PR when you have something ready, and we can refine the implementation from there? |
This seems good yo me! I think that this solves both problems mentioned in comment above 😁 Right know I have the implementation of the new algorithm |
|
The algorithm used to read and write parquet files with CompressionCodec LZ4 is different in the C++ and Rust implementation. In C++ implementation it uses the algorithm
LZ4Hadoop
while on Rust for the same CompressionCodec is using theLZ4Frame
.When trying to read a parquet generated with C++ arrow library and compression LZ4 I get a panic due to the following error:
To Reproduce
I uploaded to my arrow-rs fork, at the
lz4_hadoop_test
branch, the failing test with produces the error above. I do not merge the test to this repository because the test will fail.To test it just clone my git fork and execute the test:
Expected behavior
The expected behavior will be to be able to read the file and show the contents inside. The test in the previous section should be able to succeed reading the data.
Additional context
After some digging in the problem, comparing the C++ and Rust libraries I find that they use different algorithms for the same CompressionCode.
C++:
Check the C++ source code here:
CompressionCodec: https://github.com/apache/arrow/blob/master/cpp/src/parquet/parquet.thrift#L482
Compession: https://github.com/apache/arrow/blob/master/cpp/src/parquet/thrift_internal.h#L79
Codec: https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/compression.cc#L179
Rust:
Check the Rust source code here:
CompressionCodec: https://github.com/apache/arrow-rs/blob/master/parquet/src/format.rs#L443
Compession: https://github.com/apache/arrow-rs/blob/master/parquet/src/basic.rs#L819
Codec: https://github.com/apache/arrow-rs/blob/master/parquet/src/compression.rs#L77
As we can observe, for LZ4 CompressionCodec it uses 2 different algorithms in both libraries, for both compression and decompression. This makes both libraries incompatible, so files generated with codec LZ4 in any of the libraries cannot be read with the other library.
Moreover, changing the algorithm in this library from
Lz4Codec (Frame)
toLz4HadoopRawCodec
makes files generated by older version of this library incompatible with the new version, which is not desirable for people using this library in production.So I think there are two different solutions:
Lz4Codec (Frame)
toLz4HadoopRawCodec
. Showing a panic error pointing to this thread. I am not very happy with this solution because users of this library may experience problems after updating and they are forced to regenerating parquet files with the new version.i. On
CodecCompression = LZ4
try to useLz4HadoopRawCodec
.ii. On error try to use
Lz4Codec (Frame)
iii. On error try to use
Lz4RawCodec
(This is because C++ library does the fallback to this codec)The problem of the 2 option is a bit of overhead due to try and fail procedure. But it will be compatible with both C++ library and older versions of this library.
The text was updated successfully, but these errors were encountered: