Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ParquetReader does not respect time range provided #3944

Closed
v0y4g3r opened this issue May 15, 2024 · 0 comments
Closed

ParquetReader does not respect time range provided #3944

v0y4g3r opened this issue May 15, 2024 · 0 comments

Comments

@v0y4g3r
Copy link
Contributor

v0y4g3r commented May 15, 2024

What type of enhancement is this?

Performance

What does the enhancement do?

ParquetReaderBuilder provides a time_range option to filter the timestamps of rows to read. But currently it is not used anywhere when reading parquet files. We need to respect this option which will boost the scan performance when an exact time range is provided.

time_range: Option<TimestampRange>,

Implementation challenges

There's a workaround: we can transform the time range into predicates and inserts them into ParquetReaderBuilder::predicate. These predicates will be applied to the batches read from parquet files.

But this workaround does not skip pages and row groups, which may waste extra IO overhead. We need to use the time range to prune row groups and data pages according the the file metadata.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants