feat: add method for async read bloom filter #4917

hengfeiyang · 2023-10-11T00:18:09Z

Which issue does this PR close?

We want to filter row_groups in Datafusion but there is no async API for reading bloom filter.

What changes are included in this PR?

Implemented a function get_row_group_column_bloom_filter for ParquetRecordBatchStreamBuilder to support reading bloom filter outside arrow.

Are there any user-facing changes?

Add an function get_row_group_column_bloom_filter in ParquetRecordBatchStreamBuilder

tustvold

This looks good, left some minor comments, but I think all this needs is a test

tustvold · 2023-10-11T06:55:37Z

parquet/src/arrow/async_reader/mod.rs

+        let buffer = self
+            .input
+            .0
+            .get_bytes(offset..offset + SBBF_HEADER_SIZE_ESTIMATE)


There is a new bloom_filter_length that may be present and would avoid needing to guess here

Thanks, i checked the module bloom_filter and then updated this part.

tustvold · 2023-10-11T06:57:08Z

parquet/src/arrow/async_reader/mod.rs

+        let bitset = self
+            .input
+            .0
+            .get_bytes(bitset_offset..bitset_offset + length)


I think it would be ideal if we could avoid this extra roundtrip in the common case, by fetching enough data in the first call

The first call is used to parse bloom_filter_length, and the second call is used to parse bloom_filter_data, We can reduce one call if we know the bloom_filter_length, Thanks, I updated. Can you help review again?

hengfeiyang · 2023-10-11T08:18:31Z

@tustvold Sure, I will try to add two test cases:

for the parquet file has bloom_filter_length header
for the parquet file has no bloom_filter_length header

hengfeiyang · 2023-10-11T08:29:04Z

@tustvold Can i create two test parquet files and commit to https://github.com/apache/parquet-testing/ ?

tustvold · 2023-10-11T08:32:36Z

You could, but I don't have merge rights there so it may take some time.

A quicker option might be to use an existing file for 1., and to write a file to a Vec for 2.

hengfeiyang · 2023-10-11T08:50:19Z

@tustvold Thanks, i will use {testdata}/alltypes_plain.parquet as base file and generate other files.

mapleFU · 2023-10-11T08:54:29Z

Would you mind take a look at data_index_bloom_encoding_stats.parquet under parquet-testing repo? I think it contains a bloom filter for the first column

tustvold

Looks good to me, thank you

feat: add method for async read bloomfilter

3aa3da4

github-actions bot added the parquet Changes to the parquet crate label Oct 11, 2023

tustvold reviewed Oct 11, 2023

View reviewed changes

fix: compatible for bloom filter length

de1b46e

test: add unit tests for read bloom filter

c6fcc5b

hengfeiyang requested a review from tustvold October 11, 2023 11:49

fix: format code for unit test

06f7369

tustvold approved these changes Oct 12, 2023

View reviewed changes

tustvold merged commit 6e49f31 into apache:master Oct 12, 2023
16 checks passed

hengfeiyang mentioned this pull request Oct 14, 2023

feat: Use bloom filter when reading parquet to skip row groups apache/datafusion#7821

Merged

4 tasks

tustvold mentioned this pull request Oct 23, 2023

Support get_row_group in AsyncFileReader #3851

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add method for async read bloom filter #4917

feat: add method for async read bloom filter #4917

hengfeiyang commented Oct 11, 2023

tustvold left a comment

tustvold Oct 11, 2023

hengfeiyang Oct 11, 2023

tustvold Oct 11, 2023

hengfeiyang Oct 11, 2023

hengfeiyang commented Oct 11, 2023

hengfeiyang commented Oct 11, 2023

tustvold commented Oct 11, 2023

hengfeiyang commented Oct 11, 2023 •

edited

mapleFU commented Oct 11, 2023 •

edited

tustvold left a comment

feat: add method for async read bloom filter #4917

feat: add method for async read bloom filter #4917

Conversation

hengfeiyang commented Oct 11, 2023

Which issue does this PR close?

What changes are included in this PR?

Are there any user-facing changes?

tustvold left a comment

Choose a reason for hiding this comment

tustvold Oct 11, 2023

Choose a reason for hiding this comment

hengfeiyang Oct 11, 2023

Choose a reason for hiding this comment

tustvold Oct 11, 2023

Choose a reason for hiding this comment

hengfeiyang Oct 11, 2023

Choose a reason for hiding this comment

hengfeiyang commented Oct 11, 2023

hengfeiyang commented Oct 11, 2023

tustvold commented Oct 11, 2023

hengfeiyang commented Oct 11, 2023 • edited

mapleFU commented Oct 11, 2023 • edited

tustvold left a comment

Choose a reason for hiding this comment

hengfeiyang commented Oct 11, 2023 •

edited

mapleFU commented Oct 11, 2023 •

edited