Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

write ColumnMetadata after the column chunk data, not the ColumnChunk #1947

Merged
merged 1 commit into from Jun 28, 2022

Conversation

liukun4515
Copy link
Contributor

Which issue does this PR close?

Closes #1946

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

@github-actions github-actions bot added the parquet Changes to the parquet crate label Jun 25, 2022
@codecov-commenter
Copy link

Codecov Report

Merging #1947 (b104d64) into master (9f7b600) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master    #1947   +/-   ##
=======================================
  Coverage   83.48%   83.48%           
=======================================
  Files         221      221           
  Lines       57054    57068   +14     
=======================================
+ Hits        47629    47641   +12     
- Misses       9425     9427    +2     
Impacted Files Coverage Δ
parquet/src/file/metadata.rs 95.32% <100.00%> (+0.19%) ⬆️
parquet/src/file/writer.rs 92.92% <100.00%> (ø)
arrow/src/datatypes/datatype.rs 65.42% <0.00%> (-0.38%) ⬇️
parquet_derive/src/parquet_field.rs 65.98% <0.00%> (-0.23%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9f7b600...b104d64. Read the comment docs.

@liukun4515
Copy link
Contributor Author

@tustvold PTAL

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this makes sense, I left some minor nits. I think with a simple unit test, this can go in.

FWIW I did some digging and tbh I'm not actually sure if anything actually ever reads this data, parquet-mr doesn't even write it, arrow C++ writes it but doesn't appear to ever read it (apache/parquet-cpp#224). Hive and Impala allegedly makes use of it, but I can't find out where/how/why...

I can't help wondering if this was an oversight in the original parquet specification, not collocating column chunk metadata in the footer, that has since been papered over. All readers I can find simply read the ColumnChunkMetadata from the footer and ignore everything else.

@@ -611,6 +611,29 @@ impl ColumnChunkMetaData {
encrypted_column_metadata: None,
}
}

/// Method to convert to Thrift `ColumnMetaData`
pub fn to_column_metadata_thrift(&self) -> ColumnMetaData {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we change to_thrift above to use this method

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch

/// Returns Ok() if there are not errors serializing and writing data into the sink.
#[inline]
fn serialize_column_chunk(&mut self, chunk: parquet::ColumnChunk) -> Result<()> {
fn serialize_column_chunk(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we just remove this method and move its content into write_metadata?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have any preferences

@liukun4515
Copy link
Contributor Author

liukun4515 commented Jun 28, 2022

I can't help wondering if this was an oversight in the original parquet specification, not collocating column chunk metadata in the footer, that has since been papered over. All readers I can find simply read the ColumnChunkMetadata from the footer and ignore everything else.

I have the same confuse like you about this metadata.
I go through the parquet-mr(Java version) which did't append this metadata in end of each column, and read this metadata from the Filemetadata in the footer.

But from the definition of the format https://github.com/apache/parquet-format/blob/54e53e5d7794d383529dd30746378f19a12afd58/src/main/thrift/parquet.thrift#L790, we can know the file_offset is required field and the https://github.com/apache/parquet-format/blob/54e53e5d7794d383529dd30746378f19a12afd58/src/main/thrift/parquet.thrift#L796 ColumnMetaData is a optional field.

@liukun4515 liukun4515 force-pushed the parquet_column_append_metadata branch from b104d64 to 82c5534 Compare June 28, 2022 03:43
@liukun4515
Copy link
Contributor Author

Many system or reader just read the footer and get the metadata, I think we should just follow the parquet-format.
Maybe it's just historical issues or historical design

@liukun4515 liukun4515 requested a review from tustvold June 28, 2022 06:10
@tustvold tustvold merged commit 464e8d1 into apache:master Jun 28, 2022
@alamb alamb changed the title write columnmetadata to the behind of the column chunk data, not the ColumnChunk write ColumnMetadata after the column chunk data, not the ColumnChunk Jul 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Incorrect (but ignored) metadata written after ColumnChunk
3 participants