Support Bloom Filter in parquet reader #4512

alamb · 2022-12-05T12:31:38Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

Bloom filter support was added to arrow-rs in 28.0.0 (as part of apache/arrow-rs#3023). Here is some of that background copy/pasted:

There are usecases where one wants to search a large amount of parquet data for a relatively small number of rows. For example, if you have distributed tracing data stored as parquet files and want to find the data for a particular trace.

In general, the pattern is "needle in a haystack type query" -- specifically a very selective predicate (passes on only a few rows) on high cardinality (many distinct values) columns.

Datafusion has fairly advanced support for

row_group pruning
page index pruning
filter pushdown / late materialization

These techniques are quite effective when data is sorted and large contiguous ranges of rows can be skipped. However, doing needle in the haystack queries still often requires substantial amounts of CPU and IO

One challenge is that for typical high cardinality columns such as ids, they often (by design) span the entire range of values of the data type

For example, given the best case when the data is "optimally sorted" by id within a row group, min/max statistics can not help skip row groups or pages. Instead the entire column must be decoded to search for a particular value

┌──────────────────────────┐                WHERE                 
│            id            │       ┌─────── id = 54322342343      
├──────────────────────────┤       │                              
│       00000000000        │       │                              
├──────────────────────────┤       │    Selective predicate on a  
│       00054542543        │       │    high cardinality column   
├──────────────────────────┤       │                              
│           ...            │       │                              
├──────────────────────────┤       │                              
│        ??????????        │◀──────┘                              
├──────────────────────────┤          Can not rule out ranges     
│           ...            │            using min/max values      
├──────────────────────────┤                                      
│       99545435432        │                                      
├──────────────────────────┤                                      
│       99999999999        │                                      
└──────────────────────────┘                                      
                                                                  
  High cardinality column:                                        
    many distinct values                                          
          (sorted)                                                
                                                                  
┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐                                           
   min: 00000000000                                               
│  max: 99999999999   │                                           
                                                                  
│       Metadata      │                                           
 ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─

The parquet file format has support for bloom filters: https://github.com/apache/parquet-format/blob/master/BloomFilter.md

A bloom filter is a space efficient structure that allows determining if a value is not in a set quickly. So for a parquet file with bloom filters for id in the metadata, the entire row group can be skipped if the id is not present:

┌──────────────────────────┐                WHERE                
│            id            │      ─ ─ ─ ─ ─ id = 54322342343     
├──────────────────────────┤     │                               
│       00000000000        │           Can quickly check if      
├──────────────────────────┤     │    the value  54322342343     
│       00054542543        │             is not present by       
├──────────────────────────┤     │     consulting the Bloom      
│           ...            │                  Filter             
├──────────────────────────┤     │                               
│        ??????????        │                                     
├──────────────────────────┤     │                               
│           ...            │                                     
├──────────────────────────┤     │                               
│       99545435432        │                                     
├──────────────────────────┤     │                               
│       99999999999        │                                     
└──────────────────────────┘     │                               
  High cardinality column:                                       
    many distinct values         │                               
          (sorted)                                               
                                 │                               
                                                                 
┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─      │                               
                           │                                     
│    bloom_filter: ....  ◀ ─ ─ ─ ┘                               
                           │                                     
│  min: 00000000000                                              
   max: 99999999999        │                                     
│                                                                
        Metadata           │                                     
└ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─

Describe the solution you'd like
I would like the ParquetReader in DataFusion to take advantage of Bloom filters when they are present.

This would be in addition to page_filter and row_filter

Some high level steps are probably:

Add a config option like OPT_PARQUET_PUSHDOWN_FILTERS: https://github.com/apache/arrow-datafusion/blob/34d9bb5e64e01e1baca4f636c855082f4cadc270/datafusion/core/src/config.rs#L53
Identify predicates that can be applied to bloom filters (e.g. col = <constant>)
Add a module that can read bloom filters and apply the predicates to rule out row groups (e.g. test for <constant> in the bloom filter for that column) in https://github.com/apache/arrow-datafusion/blob/34d9bb5e64e01e1baca4f636c855082f4cadc270/datafusion/core/src/physical_plan/file_format/parquet.rs#L481-L486
Add unit tests
Add basic integration tests in https://github.com/apache/arrow-datafusion/blob/master/datafusion/core/tests/parquet_exec.rs

Describe alternatives you've considered
Don't add support ?

Additional context

Some additional support to properly write bloom filters: apache/arrow-rs#3275

The text was updated successfully, but these errors were encountered:

ajayaa · 2022-12-11T18:47:21Z

@alamb - Interested in picking this up unless you or someone else is working on this that I know of

alamb · 2022-12-12T18:20:06Z

Thanks @ajayaa -- that is great news. No one is actively working on this, though I have time set aside to help with implementation

People who might be interested and were involved with other parts of the implementation might be @tustvold @Jimexist @thinkharderdev and @Ted-Jiang

ajayaa · 2022-12-12T18:37:49Z

Thanks @alamb . Pretty new to rust-lang - please bear with me. I should have something in the next 4-5 days.

ajayaa · 2022-12-12T18:39:59Z

BTW - I started with this small PR - #4583. Would appreciate if could take a look @alamb. I haven't ported all the tests. Just wanted to make I am on the right track before porting over everything.

Ted-Jiang · 2023-01-17T03:07:56Z

If no one has started yet , i will start this one 😄

alamb · 2023-01-17T20:39:34Z

Awesome -- thanks @Ted-Jiang . Another interesting project might be #4085 ;)

ozgrakkurt · 2023-08-04T21:34:43Z

Hey @alamb! it seems like this was postponed? Can I take this if @Ted-Jiang isn't working on it anymore?

alamb · 2023-08-05T11:19:56Z

Hi @ozgrakkurt -- it is fine with me ! I don't know of anyone else working on this at this time. Maybe @tustvold knows more but I suspect the community would be very appreciative of contributions in this area.

ozgrakkurt · 2023-08-07T17:41:08Z

Hi @ozgrakkurt -- it is fine with me ! I don't know of anyone else working on this at this time. Maybe @tustvold knows more but I suspect the community would be very appreciative of contributions in this area.

Thanks! for now I changed to external indexing implementation in my project but will try to do this when I get free time

hengfeiyang · 2023-09-04T15:57:11Z

@Ted-Jiang Are you still working on it?

hengfeiyang · 2023-09-05T13:10:05Z

@ozgrakkurt Do you have time to do this? this is an awesome feature

Ted-Jiang · 2023-09-06T07:00:52Z

@ozgrakkurt Sure plz go ahead ! I will be glad if this feature is supported 👍

Ted-Jiang · 2023-09-06T07:01:33Z

Maybe you should start with the arrow-rs

hengfeiyang · 2023-09-19T15:31:05Z

@Ted-Jiang I am looking into this issue. i looked at your draft PR and the latest code of datafusion, we can create a method in datafusion/core/src/datasource/physical_plan/parquet/row_groups.rs, but we need a reader to read the bloom filter data, the reader created at here https://github.com/apache/arrow-datafusion/blob/31.0.0/datafusion/core/src/datasource/physical_plan/parquet.rs#L427 can't use to load data, Any suggestion for implement this? Thanks.

Ted-Jiang · 2023-09-20T03:09:11Z

@hengfeiyang have you seen this issue apache/arrow-rs#3851 , I used to decide going this way but something in my company stop me move on..

hengfeiyang · 2023-09-20T03:11:05Z

@Ted-Jiang Thanks, let me check.

alamb · 2023-11-06T11:14:48Z

Completed by #7821

alamb added the enhancement New feature or request label Dec 5, 2022

alamb changed the title ~~Add Bloom Filter support to parquet reader~~ Support Bloom Filter in parquet reader Dec 5, 2022

alamb added the help wanted Extra attention is needed label Dec 5, 2022

alamb mentioned this issue Dec 5, 2022

Support writing BloomFilter in arrow_writer apache/arrow-rs#3275

Closed

2 tasks

Ted-Jiang self-assigned this Mar 10, 2023

This was referenced Mar 13, 2023

Support get_row_group in AsyncFileReader apache/arrow-rs#3851

Closed

Support using Bloom Filter in parquet reader #5569

Closed

hengfeiyang mentioned this issue Oct 14, 2023

feat: Use bloom filter when reading parquet to skip row groups #7821

Merged

4 tasks

alamb closed this as completed Nov 6, 2023

aierui mentioned this issue Nov 27, 2023

Support bloom filter when reading/writing parquet files GreptimeTeam/greptimedb#1830

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Bloom Filter in parquet reader #4512

Support Bloom Filter in parquet reader #4512

alamb commented Dec 5, 2022 •

edited

ajayaa commented Dec 11, 2022 •

edited by alamb

alamb commented Dec 12, 2022

ajayaa commented Dec 12, 2022

ajayaa commented Dec 12, 2022

Ted-Jiang commented Jan 17, 2023

alamb commented Jan 17, 2023

ozgrakkurt commented Aug 4, 2023

alamb commented Aug 5, 2023

ozgrakkurt commented Aug 7, 2023

hengfeiyang commented Sep 4, 2023

hengfeiyang commented Sep 5, 2023

Ted-Jiang commented Sep 6, 2023

Ted-Jiang commented Sep 6, 2023

hengfeiyang commented Sep 19, 2023

Ted-Jiang commented Sep 20, 2023

hengfeiyang commented Sep 20, 2023

alamb commented Nov 6, 2023

Support Bloom Filter in parquet reader #4512

Support Bloom Filter in parquet reader #4512

Comments

alamb commented Dec 5, 2022 • edited

ajayaa commented Dec 11, 2022 • edited by alamb

alamb commented Dec 12, 2022

ajayaa commented Dec 12, 2022

ajayaa commented Dec 12, 2022

Ted-Jiang commented Jan 17, 2023

alamb commented Jan 17, 2023

ozgrakkurt commented Aug 4, 2023

alamb commented Aug 5, 2023

ozgrakkurt commented Aug 7, 2023

hengfeiyang commented Sep 4, 2023

hengfeiyang commented Sep 5, 2023

Ted-Jiang commented Sep 6, 2023

Ted-Jiang commented Sep 6, 2023

hengfeiyang commented Sep 19, 2023

Ted-Jiang commented Sep 20, 2023

hengfeiyang commented Sep 20, 2023

alamb commented Nov 6, 2023

alamb commented Dec 5, 2022 •

edited

ajayaa commented Dec 11, 2022 •

edited by alamb