Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hadoop LZ4 Support for LZ4 Codec #3013

Merged
merged 8 commits into from Nov 6, 2022
Merged

Conversation

marioloko
Copy link
Contributor

Which issue does this PR close?

Closes #2988 .

Rationale for this change

What changes are included in this PR?

It adds support for LZ4_HADOOP compression for CompressionCodec::LZ4, with fallback to LZ4_FRAME and LZ4_RAW algorithm for backward compatibility with older arrow-rs versions and older parquet-cpp versions, respectively.

It also adds tests for LZ4_HADOOP files, and tests for the fallback to LZ4_RAW (a test including 2 files with same content, same CompressionCodec::LZ4, but different compression algorithm LZ4_HADOOP and LZ4_RAW).

From now on, CompressionCodec::LZ4 will compress files using LZ4_HADOOP algorithm, making LZ4 files generated by this library compatible with other languages arrow implementations.

The user interface is changed to allow the user to configure whether to fallback or not, by allowing the user to disable backward compatibility with no_backward_compatible_lz4 (WIP in this PR).

Are there any user-facing changes?

The user interface is changed to allow the user to configure whether to fallback or not, by allowing the user to disable backward compatibility with no_backward_compatible_lz4 (WIP in this PR).

New structs have been created:
SerializedRowGroupReaderOptionsand SerializedPageReaderOptions to allow the users to set the CodecOptions for SerializedRowGroupReader and SerializedPageReader. The builders for these structs are also added to public interface.
SerializedRowGroupReader::new_with_options and SerializedPageReader::new_with_options have been created to allow to configure these components without breaking public API.

NOTE: The API changes are WIP, however, it is not possible to follow the same approach that with WriterProperties and wrap ReadOptions in an Arc as its field predicates: Vec<ReadGroupPredicate> is not thread safe. Other option would be to wrap it in just an Rc so type ReadOptionsPtr = Rc<ReadOptions>.

Adrián Gallego Castellanos added 4 commits November 1, 2022 21:28
* Added `CodecOptions` struct to hold `Codec` configuration.
* Added `backward_compatible_lz4` option in `CodecOptions`.
* Added `CodecOptions` to `ReadOptions` to be able to configure `SerializedFileReader`.
* Added `SerializedRowGroupReaderOptionsBuilder` with `CodecOptions` to be able to configure `SerializedRowGroupReader`, with extensible interface.
* Added `SerializedPageReaderOptionsBuilder` with `CodecOptions` to be able to configure `SerializedPageReader`, with extensible interface.
* Added `new_with_config` to `SerializedPageReader` API to be able to configure `SerializedFileReader` without breaking `new` API.
* `CodecOptions` implements `CopyTrait` as it is composed by `Copy` types. If in the future it contains a non `Copy` type, maybe is better to create `CodecOptionsPtr = Arc<CodecOptions>`.
* `CodecOptions` is only added in the read path, in the write path the default values are taken, as the options currently only affect the read path and have no effect on write path. If required to add to write path maybe it will be nice to add into `WriteProperties`.
* Added compression and decompression for LZ4_HADOOP.
* Added a test for two parquet files with the same content, both with LZ4 CompressionCodec, but one using the LZ4_HADOOP (no-fallback) algorithm and the other LZ4_RAW algorithm (fallback to last level).
* Refactor `LZ4HadoopCodec::compress` function to make it easier to understand.
@github-actions github-actions bot added the parquet Changes to the parquet crate label Nov 3, 2022
@marioloko
Copy link
Contributor Author

marioloko commented Nov 3, 2022

Another thing I notice with using ReadOptionsPtr is that even if we share the ownership of the pointer we need to take ownership of some of its values here:

    pub fn new_with_options(chunk_reader: R, options: ReadOptions) -> Result<Self> {
        let metadata = footer::parse_metadata(&chunk_reader)?;
        let mut predicates = options.predicates;

Of course these options are not used anymore at this point, so instead of cloning the vector of predicates we can just take it.

    pub fn new_with_options(chunk_reader: R, options: ReadOptions) -> Result<Self> {
        let metadata = footer::parse_metadata(&chunk_reader)?;
        let mut predicates = std::men::take(&mut options.predicates);

@marioloko
Copy link
Contributor Author

If we want to hide the CodecOptions probably we have to create a ReadProperties struct. Which in this case is only a wrap around CodecOptions but different config could be included, the restrictions for fields in this struct is to be Send and Sync, and to be read only. In contrast with ReadOptions which consumes the fields and is not Send nor Sync.

This problem also implies implementing setters multiple times, but we can use the same ReadPropertiesPtr for SerializedPageReader and SerializedRowGroupReader.

@tustvold what do you think?

@tustvold
Copy link
Contributor

tustvold commented Nov 5, 2022

Sorry for the delay, I'm currently on holiday. Will try to find time to review today

@marioloko
Copy link
Contributor Author

Oh, do not worry, enjoy your holidays! We review this when you are back 😄

Adrián Gallego Castellanos and others added 2 commits November 5, 2022 23:47
This commits hides `CodecOptions` from the public API. The changes are the following:
- Added a new structs to public API `ReaderProperties`, `ReaderPropertiesBuilder` and `ReaderPropertiesPtr` to store inmutable reader config, as it is the case of `CodecOptions`.
- Removed `SerializedRowGroupReaderOptions`, `SerializedRowGroupReaderOptionsBuilder`, `SerializedPageReaderOptionsBuilder` and `SerializedPageReaderOptions`. They are not required anymore as `SerializedRowGroupReader` and `SerializedRowGroupReaderOptions` use `ReaderPropertiesPtr` for configuration.
- `SerializedRowGroupReader::new_with_options` renamed to `SerializedRowGroupReader::new_with_properties`.
- `SerializedPageReader::new_with_options` renamed to `SerializedPageReader::new_with_properties`.
- Test added for `ReaderPropertiesBuilder`.
@marioloko
Copy link
Contributor Author

With this last change we remove CodecOptions from public API by wrapping it into a more general ReaderProperties.

Commit description:

This commits hides CodecOptions from the public API. The changes are the following:

  • Added a new structs to public API ReaderProperties, ReaderPropertiesBuilder and ReaderPropertiesPtr to store inmutable reader config, as it is the case of CodecOptions.
  • Removed SerializedRowGroupReaderOptions, SerializedRowGroupReaderOptionsBuilder, SerializedPageReaderOptionsBuilder and SerializedPageReaderOptions. They are not required anymore as SerializedRowGroupReader and SerializedRowGroupReaderOptions use ReaderPropertiesPtr for configuration.
  • SerializedRowGroupReader::new_with_options renamed to SerializedRowGroupReader::new_with_properties.
  • SerializedPageReader::new_with_options renamed to SerializedPageReader::new_with_properties.
  • Test added for ReaderPropertiesBuilder.

@marioloko
Copy link
Contributor Author

I think that now it is everything is ready to merge. Take a look when you have time and we discuss if any change has to be done.

About the failing test above I think it is not caused by my commit, as I have not modify IPC. In fact, in my local Mac the test is passing.

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will give this another once over prior to merge, but looking good. Thank you for your work on this 👍

@@ -2422,6 +2422,76 @@ mod tests {
assert_eq!(a.values(), &[42.000000, 7.700000, 42.125000, 7.700000]);
}

// This test is to ensure backward compatibility, it test 2 files containing the LZ4 CompressionCodec
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love the tests

}

/// Try to decompress the buffer as if it was compressed with the Hadoop Lz4Codec.
/// Adapted from pola-rs [compression.rs:try_decompress_hadoop](https://pola-rs.github.io/polars/src/parquet2/compression.rs.html#225)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 nice to see attribution

@tustvold tustvold merged commit 4f525fe into apache:master Nov 6, 2022
@tustvold
Copy link
Contributor

tustvold commented Nov 6, 2022

Thanks again 😄

@ursabot
Copy link

ursabot commented Nov 6, 2022

Benchmark runs are scheduled for baseline = 488eff0 and contender = 4f525fe. 4f525fe is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@tustvold tustvold added the api-change Changes to the arrow API label Nov 11, 2022
@tustvold
Copy link
Contributor

Marking as api-change even though backwards compatible as does change the output, and so should be highlighted in changelog

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change Changes to the arrow API parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CompressionCodec LZ4 incompatible with C++ implementation
3 participants