Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect (but ignored) metadata written after ColumnChunk #1946

Closed
liukun4515 opened this issue Jun 25, 2022 · 7 comments · Fixed by #1947
Closed

Incorrect (but ignored) metadata written after ColumnChunk #1946

liukun4515 opened this issue Jun 25, 2022 · 7 comments · Fixed by #1947
Labels
bug parquet Changes to the parquet crate

Comments

@liukun4515
Copy link
Contributor

liukun4515 commented Jun 25, 2022

TLDR: is we wrote the wrong thing after the column chunk, but no implementations actually read this data as it is already present in the footer, and so we never noticed

Describe the bug
I'm working on the #1935, and go through the write path of rust-version.

I find that parquet-write write some error data which is ColumnChunk to the file, you can find the logic from

fn serialize_column_chunk(&mut self, chunk: parquet::ColumnChunk) -> Result<()> {

From my knowledge about parquet format in this https://github.com/apache/parquet-format/blob/54e53e5d7794d383529dd30746378f19a12afd58/src/main/thrift/parquet.thrift#L789 and https://github.com/apache/parquet-format#file-format, the data after column data is ColumnMetaData, not ColumnChunk

To Reproduce
Steps to reproduce the behavior:

Expected behavior
A clear and concise description of what you expected to happen.

Additional context
Add any other context about the problem here.

@liukun4515 liukun4515 added the bug label Jun 25, 2022
@liukun4515
Copy link
Contributor Author

liukun4515 commented Jun 25, 2022

I will try to fix the this point.

@liukun4515
Copy link
Contributor Author

@sunchao @alamb PTAL

@liukun4515
Copy link
Contributor Author

@tustvold
I read the java version of parquet write in the https://github.com/apache/parquet-mr/blob/e06384455567c56d5906fc3a152ab00fd8dfdf33/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java#L916, and it do't follow the format https://github.com/apache/parquet-mr/blob/e06384455567c56d5906fc3a152ab00fd8dfdf33/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java#L916 writing the Column Metadata behind each column chunk.
Is there any missing information for me?
I'm confused about the logic about write path between rust-parquet and java-parquet.

@liukun4515
Copy link
Contributor Author

I created a ticket in the Parquet-MR for java version
https://issues.apache.org/jira/browse/PARQUET-2162

@alamb alamb changed the title Write error ColumnChunk to the Parquet File instead of ColumnMetaData ColumnMetadata is written to thew wrong location Jul 7, 2022
@alamb
Copy link
Contributor

alamb commented Jul 7, 2022

@liukun4515 and @tustvold I tried to summarize the issue in the title of this ticket (which is included in the CHANGELOG) but I am not sure I totally understand -- can you please verify my title change is correct?

@alamb alamb changed the title ColumnMetadata is written to thew wrong location ColumnMetadata is written to the wrong location Jul 7, 2022
@tustvold tustvold changed the title ColumnMetadata is written to the wrong location Incorrect metadata written after ColumnChunk Jul 7, 2022
@tustvold
Copy link
Contributor

tustvold commented Jul 7, 2022

The TLDR is we wrote the wrong thing after the column chunk, but no implementations actually read this data as it is already present in the footer, and so we never noticed

@alamb
Copy link
Contributor

alamb commented Jul 8, 2022

Thank you -- added to the original description

@alamb alamb changed the title Incorrect metadata written after ColumnChunk Incorrect (but ignored) metadata written after ColumnChunk Jul 8, 2022
@alamb alamb added the parquet Changes to the parquet crate label Jul 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants