Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CompressionCodec LZ4 incompatible with C++ implementation #2988

Closed
marioloko opened this issue Oct 31, 2022 · 11 comments · Fixed by #3013
Closed

CompressionCodec LZ4 incompatible with C++ implementation #2988

marioloko opened this issue Oct 31, 2022 · 11 comments · Fixed by #3013
Labels
bug parquet Changes to the parquet crate

Comments

@marioloko
Copy link
Contributor

marioloko commented Oct 31, 2022

The algorithm used to read and write parquet files with CompressionCodec LZ4 is different in the C++ and Rust implementation. In C++ implementation it uses the algorithm LZ4Hadoop while on Rust for the same CompressionCodec is using the LZ4Frame.

When trying to read a parquet generated with C++ arrow library and compression LZ4 I get a panic due to the following error:

thread 'arrow::arrow_reader::tests::test_read_lz4_hadoop_compressed' panicked at 'called `Result::unwrap()` on an `Err` value: ParquetError("Parquet error: underlying IO error: LZ4 error: ERROR_frameType_unknown")', parquet/src/arrow/arrow_reader/mod.rs:2434:14
stack backtrace:
   0:        0x108e96042 - std::backtrace_rs::backtrace::libunwind::trace::hc51e76fb8889c4c7
                               at /rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
   1:        0x108e96042 - std::backtrace_rs::backtrace::trace_unsynchronized::h6636ebb4dbdfddda
                               at /rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
   2:        0x108e96042 - std::sys_common::backtrace::_print_fmt::h1fa8f79b68aa8a0a
                               at /rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/std/src/sys_common/backtrace.rs:66:5
   3:        0x108e96042 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::hed16e04ffb615208
                               at /rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/std/src/sys_common/backtrace.rs:45:22
   4:        0x108eb596a - core::fmt::write::haeb35e341082f6bd
                               at /rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/core/src/fmt/mod.rs:1209:17
   5:        0x108e9278c - std::io::Write::write_fmt::h369182a997830c5c
                               at /rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/std/src/io/mod.rs:1679:15
   6:        0x108e9780b - std::sys_common::backtrace::_print::h02fcfc3ff2a6bde7
                               at /rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/std/src/sys_common/backtrace.rs:48:5
   7:        0x108e9780b - std::sys_common::backtrace::print::h5bd14db72e4fe5de
                               at /rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/std/src/sys_common/backtrace.rs:35:9
   8:        0x108e9780b - std::panicking::default_hook::{{closure}}::h0dcad13e9fa12765
                               at /rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/std/src/panicking.rs:267:22
   9:        0x108e974a6 - std::panicking::default_hook::h981e72b615f3b097
                               at /rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/std/src/panicking.rs:283:9
  10:        0x108e97e60 - std::panicking::rust_panic_with_hook::hc2eb58a19ca82803
                               at /rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/std/src/panicking.rs:669:13
  11:        0x108e97d73 - std::panicking::begin_panic_handler::{{closure}}::h54264dcd7b850967
                               at /rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/std/src/panicking.rs:560:13
  12:        0x108e964d8 - std::sys_common::backtrace::__rust_end_short_backtrace::hb6f4936878be10a3
                               at /rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/std/src/sys_common/backtrace.rs:138:18
  13:        0x108e97a3d - rust_begin_unwind
                               at /rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/std/src/panicking.rs:556:5
  14:        0x108f1ca43 - core::panicking::panic_fmt::h2903bb0e76f10197
                               at /rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/core/src/panicking.rs:142:14
  15:        0x108f1cba5 - core::result::unwrap_failed::h18af68091210073e
                               at /rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/core/src/result.rs:1791:5
  16:        0x107cfbc89 - core::result::Result<T,E>::unwrap::hf00cf6a95fafacda
                               at /rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/core/src/result.rs:1113:23
  17:        0x1077d9c49 - parquet::arrow::arrow_reader::tests::test_read_lz4_hadoop_compressed::habb822e6e3580a4a
                               at /Users/adriangc/arrow-rs/parquet/src/arrow/arrow_reader/mod.rs:2431:23
  18:        0x10826f9c9 - parquet::arrow::arrow_reader::tests::test_read_lz4_hadoop_compressed::{{closure}}::hcca0ca14496f9a08
                               at /Users/adriangc/arrow-rs/parquet/src/arrow/arrow_reader/mod.rs:2426:5
  19:        0x10815e7b8 - core::ops::function::FnOnce::call_once::h725a8a1a5e94f92d
                               at /rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/core/src/ops/function.rs:251:5
  20:        0x108637382 - core::ops::function::FnOnce::call_once::h7dbeda91e6aead6b
                               at /rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/core/src/ops/function.rs:251:5
  21:        0x108637382 - test::__rust_begin_short_backtrace::h2d7dab6490f59483
                               at /rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/test/src/lib.rs:576:18
  22:        0x108607471 - test::run_test::{{closure}}::hdc386f68383c6886
                               at /rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/test/src/lib.rs:567:30
  23:        0x108607471 - core::ops::function::FnOnce::call_once{{vtable.shim}}::h6e841d27c8e848a9
                               at /rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/core/src/ops/function.rs:251:5
  24:        0x1086360f5 - <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once::h6b3722b347c9df52
                               at /rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/alloc/src/boxed.rs:1938:9
  25:        0x1086360f5 - <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once::hce401ebfe14ef605
                               at /rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/core/src/panic/unwind_safe.rs:271:9
  26:        0x1086360f5 - std::panicking::try::do_call::h43148f9663b6c691
                               at /rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/std/src/panicking.rs:464:40
  27:        0x1086360f5 - std::panicking::try::h66bef5e978c7d8ca
                               at /rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/std/src/panicking.rs:428:19
  28:        0x1086360f5 - std::panic::catch_unwind::h06d495074f8fe9c9
                               at /rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/std/src/panic.rs:137:14
  29:        0x1086360f5 - test::run_test_in_process::h77027c7da4dcf222
                               at /rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/test/src/lib.rs:599:27
  30:        0x1086360f5 - test::run_test::run_test_inner::{{closure}}::hd39abb8dc2feae55
                               at /rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/test/src/lib.rs:493:39
  31:        0x108601580 - test::run_test::run_test_inner::{{closure}}::h1c34355cd155b093
                               at /rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/test/src/lib.rs:520:37
  32:        0x108601580 - std::sys_common::backtrace::__rust_begin_short_backtrace::he0b6405111093efc
                               at /rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/std/src/sys_common/backtrace.rs:122:18
  33:        0x1086074ec - std::thread::Builder::spawn_unchecked_::{{closure}}::{{closure}}::h2d528071181592f0
                               at /rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/std/src/thread/mod.rs:514:17
  34:        0x1086074ec - <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once::h5e0af44f5ef1962b
                               at /rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/core/src/panic/unwind_safe.rs:271:9
  35:        0x1086074ec - std::panicking::try::do_call::h95320ad6aee10be9
                               at /rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/std/src/panicking.rs:464:40
  36:        0x1086074ec - std::panicking::try::hf5ca42fcabdfecee
                               at /rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/std/src/panicking.rs:428:19
  37:        0x1086074ec - std::panic::catch_unwind::hc7cc0dc7e028b195
                               at /rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/std/src/panic.rs:137:14
  38:        0x1086074ec - std::thread::Builder::spawn_unchecked_::{{closure}}::h1cf9ab97df7fb558
                               at /rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/std/src/thread/mod.rs:513:30
  39:        0x1086074ec - core::ops::function::FnOnce::call_once{{vtable.shim}}::h7f89265263d96ccd
                               at /rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/core/src/ops/function.rs:251:5
  40:        0x108e9c267 - <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once::h267a3b8c1ab382aa
                               at /rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/alloc/src/boxed.rs:1938:9
  41:        0x108e9c267 - <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once::h6a449eed197d47c4
                               at /rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/alloc/src/boxed.rs:1938:9
  42:        0x108e9c267 - std::sys::unix::thread::Thread::new::thread_start::h5e4754f0c4a6d6ad
                               at /rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/std/src/sys/unix/thread.rs:108:17
  43:     0x7ff81337f4e1 - __pthread_start

To Reproduce

I uploaded to my arrow-rs fork, at the lz4_hadoop_test branch, the failing test with produces the error above. I do not merge the test to this repository because the test will fail.

To test it just clone my git fork and execute the test:

git clone git@github.com:marioloko/arrow-rs.git
cd arrow-rs
git checkout lz4_hadoop_test
cargo test arrow::arrow_reader::tests::test_read_lz4_hadoop_compressed

Expected behavior

The expected behavior will be to be able to read the file and show the contents inside. The test in the previous section should be able to succeed reading the data.

Additional context

After some digging in the problem, comparing the C++ and Rust libraries I find that they use different algorithms for the same CompressionCode.

C++:

CompressionCodec Compression Codec
LZ4 = 5 LZ4_HADOOP Lz4HadoopRawCodec (Hadoop + Fallback to raw if fails)
LZ4_RAW = 7 LZ4 Lz4Codec (Raw)
LZ4_FRAME. Lz4FrameCodec

Check the C++ source code here:
CompressionCodec: https://github.com/apache/arrow/blob/master/cpp/src/parquet/parquet.thrift#L482
Compession: https://github.com/apache/arrow/blob/master/cpp/src/parquet/thrift_internal.h#L79
Codec: https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/compression.cc#L179

Rust:

CompressionCodec Compression Codec
LZ4 = 5 LZ4 Lz4Codec (Frame).
LZ4_RAW = 7 LZ4_RAW Lz4RawCodec (Raw)

Check the Rust source code here:
CompressionCodec: https://github.com/apache/arrow-rs/blob/master/parquet/src/format.rs#L443
Compession: https://github.com/apache/arrow-rs/blob/master/parquet/src/basic.rs#L819
Codec: https://github.com/apache/arrow-rs/blob/master/parquet/src/compression.rs#L77

As we can observe, for LZ4 CompressionCodec it uses 2 different algorithms in both libraries, for both compression and decompression. This makes both libraries incompatible, so files generated with codec LZ4 in any of the libraries cannot be read with the other library.

Moreover, changing the algorithm in this library from Lz4Codec (Frame) to Lz4HadoopRawCodec makes files generated by older version of this library incompatible with the new version, which is not desirable for people using this library in production.

So I think there are two different solutions:

  1. Break the backward compatibility with previous versions of this library by changing the algorithm from Lz4Codec (Frame) to Lz4HadoopRawCodec. Showing a panic error pointing to this thread. I am not very happy with this solution because users of this library may experience problems after updating and they are forced to regenerating parquet files with the new version.
  2. Implementing a fallback mechanism, that is:
    i. On CodecCompression = LZ4 try to use Lz4HadoopRawCodec.
    ii. On error try to use Lz4Codec (Frame)
    iii. On error try to use Lz4RawCodec (This is because C++ library does the fallback to this codec)
    The problem of the 2 option is a bit of overhead due to try and fail procedure. But it will be compatible with both C++ library and older versions of this library.
@marioloko marioloko added the bug label Oct 31, 2022
@tustvold
Copy link
Contributor

My only question with 2. is if there is a chance of silent failures, where data succesfully decodes with the wrong codec yielding gibberish?

Otherwise 2 sounds like a good approach, thank you for the detailed investigation

@marioloko
Copy link
Contributor Author

marioloko commented Oct 31, 2022

Short answer: Yes, probably it is possible to silently fail and be decoded by the wrong codec, but unlikely. So another option would be to only allow LZ4_HADOOP and allow fallback if backward_compatible_lz4 feature is enabled.

Long Answer: After this change we should only generate LZ4_HADOOP for LZ4 CompressionCodec. So the correct implementation should be tried first, and after this commit that will be the most likely.

So the the files arranged by likelihood of hit by this library will be:

  1. LZ4_HADOOP
  2. LZ4_FRAME (if files generated by older version)
  3. LZ4_RAW (if interoperating with older parquet C++ version)

Then most of the times we expect the files to be of type LZ4_HADOOP. These files have the following format:

  • 4 Big endian bytes representing decompressed block size
  • 4 Big endian bytes representing uncompressed block size
  • Compressed bytes

We can find the implementation in Rust in pola-rs crate at compression.rs file.

For this algorithm, we can easily detect if there is an error when decoding, because it is unlikely that the input and output length match for every block.

Then, the second in likelihood is LZ4_FRAME for files generated by this library previously.
These files have the following format:

  • A magic number at the beginning of the frame.
  • A frame descriptor.
  • Data blocks with checksum.
  • End marker.

To meet the requirements to be decoded by this codec if the data is not in LZ4_FRAME is really, really unlikely, as it should match the magic number and the checksum of each block.

So now we fall to the last in probability LZ4_RAW, and this one is probably the most problematic.
To ensure that the file is not corrupted, it does the following checks:

  1. The last sequence contains only literals. The block ends after the literals.
  2. The last five bytes of input are only literals. Therefore, the last sequence contains at least five bytes.
  3. The last match must start at least 12 bytes before the end of block

Even if it is not likely to be decompress by this algorithm, it is the one with more relaxed rules. However, we know the uncompressed size, so the output should fit in that exact length, which is an additional check of integrity.

NOTE: Parque C++ has this fallback to LZ4_RAW and they have this discussion. In that, discussion they mention that LZ4_RAW tends to fail quickly when the first sequence read fails.

This comment has the same concern you have.

@tustvold
Copy link
Contributor

You've convinced me, lets go with option 2 then 😄

@marioloko
Copy link
Contributor Author

Great!😄 Should we include the feature backward_compatible_lz4? To allow the users decide if they want a safe matching (fail if not LZ4_HADOOP) or a backward compatible one (do fallback)?

@tustvold
Copy link
Contributor

I think we could add an option to ReadOptions, but have it enabled by default. That way people can opt-out if they know they don't want that functionality

@marioloko
Copy link
Contributor Author

I committed in my local fork the changes required to pass the option through ReadOption. However, passing the configuration down from SerializedFileReader to create_codec require changes in the signature of several functions. I tried not to break the public API by keeping the name of the existing functions, but I had to add new methods to public API, new_with_config.

I also tried to make the new API methods as extensible as possible, by creating a Options and OptionsStruct method for each of the new API methods.

Here it is the commit to check the changes in the API: marioloko@a94b248

So that the changes in the API can be verified in parallel while I add the compression methods.

The important changes are:

  • Added CodecOptions struct to hold Codec configuration.
  • Added backward_compatible_lz4 option in CodecOptions.
  • Added CodecOptions to ReadOptions to be able to configure SerializedFileReader.
  • Added SerializedRowGroupReaderOptionsBuilder with CodecOptions to be able to configure SerializedRowGroupReader, with extensible interface.
  • Added SerializedPageReaderOptionsBuilder with CodecOptions to be able to configure SerializedPageReader, with extensible interface.
  • Added new_with_config to SerializedPageReader API to be able to configure SerializedFileReader without breaking new API.
  • CodecOptions implements CopyTrait as it is composed by Copy types. If in the future it contains a non Copy type, maybe is better to create CodecOptionsPtr = Arc<CodecOptions>.
  • CodecOptions is only added in the read path, in the write path the default values are taken, as the options currently only affect the read path and have no effect on write path. If required to add to write path maybe it will be nice to add into WriteProperties.

@tustvold
Copy link
Contributor

tustvold commented Nov 1, 2022

Looks good to me, I think we may be able to keep some of that crate-private, especially as the compression codecs themselves are an experimental API, but the general gist looks good - nice work 👍

@marioloko
Copy link
Contributor Author

I chose to create the CodecOptions thinking about future extensions of this configurations. I mean, right know the only configurable method of CodecOptions is backward_compatible_lz4, and so it has a builder with no_backward_compatible_lz4 method.

ReadOptions, SerializedRowGroupReaderOptionsBuilder and SerializedPageReaderOptionsBuilder has a with_codec_options which read those options. Therefore, If we add a new field to CodecOption in the future the interface of these structs will remain the same, the only struct that will change is the one of CodecOption, which will have a method set_new_field.

However, it is possible to hide the class CodecOptions and make it private to the crate, which from an user point of view it probably makes sense, as the user does not care if the option no_backward_compatible_lz4 is for the codec or for any other part of the application. But from a developer perspective, this approach will imply that if in the future we add a new option to CodecOption, as this struct is private to the crate, then we will need to implement the method set_new_field will have to be implemented in all the structs that hide CodecOptions, that is ReadOptions, SerializedRowGroupReaderOptionsBuilder and SerializedPageReaderOptionsBuilder.

So in my opinions both options are valid, and it is just a design choice:

  • Making CodecOptions public will probably make less painful to add new options in the future. Maybe no other option is required, so maybe it is ok. Moreover, it hides the internals to the user.
  • Making CodecOptions private will make easier to add new options in the future, but we are exposing the internals to users, which usually is not desirable. Anyway, from compression module, only this struct will be public and the others will remain experimental.

@tustvold
Copy link
Contributor

tustvold commented Nov 3, 2022

Hmm... Thinking about this a bit more what do you think of just plumbing Arc<ReadOptions> down to SerializedRowGroupReader and SerializedPageReader? Whilst this does mean that theoretically accept options that don't impact them, it keeps things simple and matches what we do on the write side where a single WriterPropertiesPtr is passed down? We could even keep this crate-local for now, ultimately very few users interact with SerializedRowGroupReader or SerializedPageReader directly.

I'm also totally happy for you to raise a PR when you have something ready, and we can refine the implementation from there?

@marioloko
Copy link
Contributor Author

This seems good yo me! I think that this solves both problems mentioned in comment above 😁

Right know I have the implementation of the new algorithm LZ4_HADOOP and its tests ready. I just need to make these changes on the interface and it will be ready to go. So I will do a PR soon, so we can work over it.

@alamb
Copy link
Contributor

alamb commented Nov 11, 2022

label_issue.py automatically added labels {'parquet'} from #3013

@alamb alamb added the parquet Changes to the parquet crate label Nov 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants