-
Notifications
You must be signed in to change notification settings - Fork 835
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support bloom filter reading and writing for parquet #3023
Comments
The influxdb_iox project is very interested in this feature and we would love to collaborate with the community to make it happen -- I at least can offer code and design reviews, and blogging about it :) |
very cool❤️ |
a note to myself for this comment cc @alamb |
(in case other people have missed it, @jimexist has begun work on this feature ❤️ ) |
I think the parquet reading/writing support may be done -- the next phase will be to add support to query engines like DataFusion to take advantage of these filters. I plan to write up a ticket in DataFusion over the course of the coming week to do so |
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
There are usecases where one wants to search a large amount of parquet data for a relatively small number of rows. For example, if you have distributed tracing data stored as parquet files and want to find the data for a particular trace.
In general, the pattern is "needle in a haystack type query" -- specifically a very selective predicate (passes on only a few rows) on high cardinality (many distinct values) columns.
The rust parquet crate has fairly advanced support for row group pruning, page level indexes, and filter pushdown. These techniques are quite effective when data is sorted and large contiguous ranges of rows can be skipped.
However, doing needle in the haystack queries still often requires substantial amounts of CPU and IO
One challenge is that for typical high cardinality columns such as ids, they often (by design) span the entire range of values of the data type
For example, given the best case when the data is "optimally sorted" by id within a row group, min/max statistics can not help skip row groups or pages. Instead the entire column must be decoded to search for a particular value
Describe the solution you'd like
The parquet file format has support for bloom filters: https://github.com/apache/parquet-format/blob/master/BloomFilter.md
A bloom filter is a space efficient structure that allows determining if a value is not in a set quickly. So for a parquet file with bloom filters for
id
in the metadata, the entire row group can be skipped if the id is not present:I would like the parquet crate to
The format support is here
https://docs.rs/parquet/latest/parquet/format/struct.BloomFilterHeader.html?search=Bloom
Describe alternatives you've considered
Additional context
There is some code for parquet bloom filters in https://github.com/jorgecarleitao/parquet2/tree/main/src/bloom_filter from @jorgecarleitao. I am not sure how mature it is, but perhaps we can use/repurpose some of that
The text was updated successfully, but these errors were encountered: