Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Reading PageIndex with ParquetRecordBatchStream #2430

Closed
tustvold opened this issue Aug 12, 2022 · 4 comments · Fixed by #2526
Closed

Support Reading PageIndex with ParquetRecordBatchStream #2430

tustvold opened this issue Aug 12, 2022 · 4 comments · Fixed by #2526
Assignees
Labels
enhancement Any new improvement worthy of a entry in the changelog parquet Changes to the parquet crate

Comments

@tustvold
Copy link
Contributor

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

ParquetRecordBatchStream does not contain the necessary logic to read the offset index

Describe the solution you'd like

Add support for reading the column index to ParquetRecordBatchStream

Describe alternatives you've considered

Additional context

@tustvold tustvold added the enhancement Any new improvement worthy of a entry in the changelog label Aug 12, 2022
@thinkharderdev
Copy link
Contributor

I can take this one

@tustvold
Copy link
Contributor Author

I think this will involve reading the necessary data in ParquetRecordBatchStreamBuilder::new if the metadata returned by AsyncFileReader::get_metadata doesn't already have it loaded. I think it is important to handle the case where the AsyncFileReader is able to provide this information, e.g. if there is some metadata caching, the data is stored in a catalog, etc...

@thinkharderdev
Copy link
Contributor

Was tinkering around with this and we can go one of two ways:

  1. Fetch all the index metadata when constructing the stream.
  2. Fetch the index on demand as needed.

The latter approach, we could add something to AsyncFileReader like

fn get_page_locations(&mut self, metadata: &ParquetMetadata, mask: &ProjectionMask) -> BoxFuture<'_, Result<Vec<Vec<PageLocation>>>

to only fetch what we need for a given projection. But that may be overkill given the amount of data....

@tustvold
Copy link
Contributor Author

I think do the first option for now, the data should be small and relatively cheap to decode. We can always revisit if this simple approach proves insufficient

@alamb alamb added the parquet Changes to the parquet crate label Sep 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants