Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet: Read/write support for PageIndex (ColumnIndex / OffsetIndex) #1705

Closed
2 tasks done
alamb opened this issue May 16, 2022 · 5 comments
Closed
2 tasks done

Parquet: Read/write support for PageIndex (ColumnIndex / OffsetIndex) #1705

alamb opened this issue May 16, 2022 · 5 comments
Labels
enhancement Any new improvement worthy of a entry in the changelog parquet Changes to the parquet crate

Comments

@alamb
Copy link
Contributor

alamb commented May 16, 2022

Implementation tracker:

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Suggested by @ParadoxShmaradox ON #1191

In analytics systems, parquet files are passed around between different systems, so the more metadata is written by the rust implementation of parquet, the better other systems may be able to handle it.

One such type of metadata is PageIndex https://github.com/apache/parquet-format/blob/master/PageIndex.md

Specifically, I believe this is called a ColumnIndex and OffsetIndex in parquet-format:

https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L938-L971

https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L926-L932

Which is used to speed up queries in certain systems -- for example https://blog.cloudera.com/speeding-up-select-queries-with-parquet-page-indexes/

Describe the solution you'd like

  1. Write PageIndex structures into files created by parquet-rs (specif
  2. Provide access to reading PageIndex from the parquet metadata structures

Note that if creating the PageIndex structures is too expensive, their creation could be optional and controlled via options.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Original request here #1191 (comment)

@alamb alamb added parquet Changes to the parquet crate enhancement Any new improvement worthy of a entry in the changelog labels May 16, 2022
@alamb alamb changed the title Parquet: Create PageIndex on write Parquet: Read/write support for PageIndex (ColumnIndex / OffsetIndex) May 16, 2022
@Ted-Jiang
Copy link
Member

@alamb @tustvold
Our team is building a query engine which is sensitive to time-to-first-result latency(bottleneck is scan and agg). So we want to support PageIndex in arrow-rs avoid useless IO costs. I will follow up before discussions, Try to give a design of this (according to parquet-mr). If you can give any advice, would be grateful ❤️.

@tustvold
Copy link
Contributor

Hi @Ted-Jiang, I think the first step will be to support reading and writing the page index. Once that is done, we can look to include support in #1605. I'm currently actively working on object store IO, so if you wanted to handle the index encode/decode we can probably parallelize this effort 😃

@Ted-Jiang
Copy link
Member

Hi @Ted-Jiang, I think the first step will be to support reading and writing the page index. Once that is done, we can look to include support in #1605. I'm currently actively working on object store IO, so if you wanted to handle the index encode/decode we can probably parallelize this effort 😃

Sure!💪 I prefer to support reading(from java version) first, check the effect on performance.

@alamb
Copy link
Contributor Author

alamb commented May 24, 2022

I think the idea of parallelizing the implementation (as @tustvold suggests to implement the encode/decode of the page index) would be great

If creating the page index on write consumes too much time, perhaps we can make writing optional (aka in https://docs.rs/parquet/14.0.0/parquet/file/properties/struct.WriterProperties.html) -- same could be done for reading the page index if it is too expensive

@tustvold
Copy link
Contributor

I think this is now complete

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog parquet Changes to the parquet crate
Projects
None yet
Development

No branches or pull requests

3 participants