Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose SortingColumn when reading and writing parquet metadata #3090

Closed
Tracked by #4177
alamb opened this issue Nov 11, 2022 · 3 comments · Fixed by #3103
Closed
Tracked by #4177

Expose SortingColumn when reading and writing parquet metadata #3090

alamb opened this issue Nov 11, 2022 · 3 comments · Fixed by #3103
Labels
enhancement Any new improvement worthy of a entry in the changelog good first issue Good for newcomers parquet Changes to the parquet crate

Comments

@alamb
Copy link
Contributor

alamb commented Nov 11, 2022

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Storing sorted data in parquet is often a key performance technique as it "clusters" data in interesting ways than can make predicate evaluation and other query techniques faster.

The parquet file format contains a way to encode the sortedness of data stored there using a "SortingColumn" in the format
https://github.com/apache/parquet-format/blob/54e53e5d7794d383529dd30746378f19a12afd58/src/main/thrift/parquet.thrift#L685-L698

Which is then in the RowGroup metadata:
https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L829-L832

However, I did not find any code to read/write this metadata yet in the parquet crate
https://sourcegraph.com/search?q=context:global+repo:%5Egithub%5C.com/apache/arrow-rs%24+SortingColumn&patternType=standard

Describe the solution you'd like

I would like some way to provide the parquet writer the SortingColumn when creating RowgroupMetadata

Perhaps we could add something to the WriterProperties

https://docs.rs/parquet/26.0.0/parquet/file/properties/struct.WriterProperties.html

Likewise, I would like a way to get the relevant SortingColumn list from RowGroupMetadata:
https://docs.rs/parquet/26.0.0/parquet/file/metadata/struct.RowGroupMetaData.html

Describe alternatives you've considered
It might be worth considering having the parquet writer determine automatically if the data was sorted (maybe this would be better than letting the caller have to verify it)? However, verifying in the writer would likely be a significant performance hit.

Additional context
DataFusion is getting more sophisticated in its ability to track and use sortedness information (e.g. apache/arrow-datafusion#4122). If this metadata was included in the parquet file, DataFusion might be able to take more advantage of it: apache/arrow-datafusion#4177.

There is more discussion about this topic here apache/arrow-datafusion#4169 (comment)

@askoa
Copy link
Contributor

askoa commented Nov 13, 2022

I'll attempt this one.

@mingmwang
Copy link

Looks like even the Java parquet implementations didn't read the Sort Columns info when it read the footer and try to convert the parquet meta.

https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java

@alamb
Copy link
Contributor Author

alamb commented Nov 14, 2022

It seems to have been around a while apache/parquet-format@934da01

But I don't know the history.

Interesting

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog good first issue Good for newcomers parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants