Support bloom filter reading and writing for parquet #3023

alamb · 2022-11-05T10:50:10Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

There are usecases where one wants to search a large amount of parquet data for a relatively small number of rows. For example, if you have distributed tracing data stored as parquet files and want to find the data for a particular trace.

In general, the pattern is "needle in a haystack type query" -- specifically a very selective predicate (passes on only a few rows) on high cardinality (many distinct values) columns.

The rust parquet crate has fairly advanced support for row group pruning, page level indexes, and filter pushdown. These techniques are quite effective when data is sorted and large contiguous ranges of rows can be skipped.

However, doing needle in the haystack queries still often requires substantial amounts of CPU and IO

One challenge is that for typical high cardinality columns such as ids, they often (by design) span the entire range of values of the data type

For example, given the best case when the data is "optimally sorted" by id within a row group, min/max statistics can not help skip row groups or pages. Instead the entire column must be decoded to search for a particular value

┌──────────────────────────┐                WHERE                 
│            id            │       ┌─────── id = 54322342343      
├──────────────────────────┤       │                              
│       00000000000        │       │                              
├──────────────────────────┤       │    Selective predicate on a  
│       00054542543        │       │    high cardinality column   
├──────────────────────────┤       │                              
│           ...            │       │                              
├──────────────────────────┤       │                              
│        ??????????        │◀──────┘                              
├──────────────────────────┤          Can not rule out ranges     
│           ...            │            using min/max values      
├──────────────────────────┤                                      
│       99545435432        │                                      
├──────────────────────────┤                                      
│       99999999999        │                                      
└──────────────────────────┘                                      
                                                                  
  High cardinality column:                                        
    many distinct values                                          
          (sorted)                                                
                                                                  
┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐                                           
   min: 00000000000                                               
│  max: 99999999999   │                                           
                                                                  
│       Metadata      │                                           
 ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─

Describe the solution you'd like
The parquet file format has support for bloom filters: https://github.com/apache/parquet-format/blob/master/BloomFilter.md

A bloom filter is a space efficient structure that allows determining if a value is not in a set quickly. So for a parquet file with bloom filters for id in the metadata, the entire row group can be skipped if the id is not present:

┌──────────────────────────┐                WHERE                
│            id            │      ─ ─ ─ ─ ─ id = 54322342343     
├──────────────────────────┤     │                               
│       00000000000        │           Can quickly check if      
├──────────────────────────┤     │    the value  54322342343     
│       00054542543        │             is not present by       
├──────────────────────────┤     │     consulting the Bloom      
│           ...            │                  Filter             
├──────────────────────────┤     │                               
│        ??????????        │                                     
├──────────────────────────┤     │                               
│           ...            │                                     
├──────────────────────────┤     │                               
│       99545435432        │                                     
├──────────────────────────┤     │                               
│       99999999999        │                                     
└──────────────────────────┘     │                               
  High cardinality column:                                       
    many distinct values         │                               
          (sorted)                                               
                                 │                               
                                                                 
┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─      │                               
                           │                                     
│    bloom_filter: ....  ◀ ─ ─ ─ ┘                               
                           │                                     
│  min: 00000000000                                              
   max: 99999999999        │                                     
│                                                                
        Metadata           │                                     
└ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─

I would like the parquet crate to

support optionally writing Parquet bloom filters into the metadata
support using parquet bloom filters during read to make "needle in the haystack" type queries go quickly by skipping entire row groups if the item is not in the bloom filter.

The format support is here
https://docs.rs/parquet/latest/parquet/format/struct.BloomFilterHeader.html?search=Bloom

Describe alternatives you've considered

Additional context
There is some code for parquet bloom filters in https://github.com/jorgecarleitao/parquet2/tree/main/src/bloom_filter from @jorgecarleitao. I am not sure how mature it is, but perhaps we can use/repurpose some of that

The text was updated successfully, but these errors were encountered:

alamb · 2022-11-05T10:54:49Z

The influxdb_iox project is very interested in this feature and we would love to collaborate with the community to make it happen -- I at least can offer code and design reviews, and blogging about it :)

aierui · 2022-11-05T16:43:06Z

very cool❤️

jimexist · 2022-11-13T12:49:46Z

a note to myself for this comment

cc @alamb

alamb · 2022-11-13T20:43:04Z

(in case other people have missed it, @jimexist has begun work on this feature ❤️ )

jimexist · 2022-11-24T05:45:38Z

~~@tustvold and @alamb i might not have the bandwidth to dig into how parquet integrates with arrow so i'd maybe defer this to you or anyone else to follow up in the final piece:~~

create an integration test set for parquet crate against pyspark for working with bloom filters #3167

scratch that, i have already done one part in #3176

* Bloom filter config tweaks (#3023) * Further tweaks

alamb · 2022-11-27T12:12:21Z

I think the parquet reading/writing support may be done -- the next phase will be to add support to query engines like DataFusion to take advantage of these filters.

I plan to write up a ticket in DataFusion over the course of the coming week to do so

alamb added parquet enhancement help wanted labels Nov 5, 2022

alamb mentioned this issue Nov 5, 2022

feat: add bloom filter when write sst apache/horaedb#370

Merged

jimexist mentioned this issue Nov 9, 2022

add bloom filter implementation based on split block (sbbf) spec #3057

Merged

This was referenced Nov 23, 2022

bloom filter part IV: adjust writer properties, bloom filter properties, and incorporate into column encoder #3165

Merged

create an integration test set for parquet crate against pyspark for working with bloom filters #3167

Closed

tustvold added a commit to tustvold/arrow-rs that referenced this issue Nov 24, 2022

Bloom filter config tweaks (apache#3023)

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.

GPG key ID: B5690EEEBB952194

Learn about vigilant mode

8f9d0ac

tustvold mentioned this issue Nov 24, 2022

Bloom filter config tweaks (#3023) #3175

Merged

jimexist mentioned this issue Nov 24, 2022

bloom filter part V: add an integration with pytest against pyspark #3176

Merged

tustvold added a commit that referenced this issue Nov 24, 2022

Bloom filter config tweaks (#3023) (#3175)

eefbdce

* Bloom filter config tweaks (#3023) * Further tweaks

tustvold closed this as completed in #3176 Nov 24, 2022

This was referenced Dec 5, 2022

Support Bloom Filter in parquet reader apache/datafusion#4512

Closed

Support writing BloomFilter in arrow_writer #3275

Closed

aierui mentioned this issue Nov 27, 2023

Support bloom filter when reading/writing parquet files GreptimeTeam/greptimedb#1830

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support bloom filter reading and writing for parquet #3023

Support bloom filter reading and writing for parquet #3023

alamb commented Nov 5, 2022 •

edited

Loading

alamb commented Nov 5, 2022

aierui commented Nov 5, 2022

jimexist commented Nov 13, 2022

alamb commented Nov 13, 2022

jimexist commented Nov 24, 2022 •

edited

Loading

alamb commented Nov 27, 2022 •

edited

Loading

Support bloom filter reading and writing for parquet #3023

Support bloom filter reading and writing for parquet #3023

Comments

alamb commented Nov 5, 2022 • edited Loading

alamb commented Nov 5, 2022

aierui commented Nov 5, 2022

jimexist commented Nov 13, 2022

alamb commented Nov 13, 2022

jimexist commented Nov 24, 2022 • edited Loading

alamb commented Nov 27, 2022 • edited Loading

alamb commented Nov 5, 2022 •

edited

Loading

jimexist commented Nov 24, 2022 •

edited

Loading

alamb commented Nov 27, 2022 •

edited

Loading