Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add method for async read bloom filter #4917

Merged
merged 4 commits into from
Oct 12, 2023

Conversation

hengfeiyang
Copy link
Contributor

Which issue does this PR close?

Impl #3851

We want to filter row_groups in Datafusion but there is no async API for reading bloom filter.

What changes are included in this PR?

Implemented a function get_row_group_column_bloom_filter for ParquetRecordBatchStreamBuilder to support reading bloom filter outside arrow.

Are there any user-facing changes?

Add an function get_row_group_column_bloom_filter in ParquetRecordBatchStreamBuilder

@github-actions github-actions bot added the parquet Changes to the parquet crate label Oct 11, 2023
Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, left some minor comments, but I think all this needs is a test

let buffer = self
.input
.0
.get_bytes(offset..offset + SBBF_HEADER_SIZE_ESTIMATE)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a new bloom_filter_length that may be present and would avoid needing to guess here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, i checked the module bloom_filter and then updated this part.

let bitset = self
.input
.0
.get_bytes(bitset_offset..bitset_offset + length)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be ideal if we could avoid this extra roundtrip in the common case, by fetching enough data in the first call

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first call is used to parse bloom_filter_length, and the second call is used to parse bloom_filter_data, We can reduce one call if we know the bloom_filter_length, Thanks, I updated. Can you help review again?

@hengfeiyang
Copy link
Contributor Author

@tustvold Sure, I will try to add two test cases:

  1. for the parquet file has bloom_filter_length header
  2. for the parquet file has no bloom_filter_length header

@hengfeiyang
Copy link
Contributor Author

@tustvold Can i create two test parquet files and commit to https://github.com/apache/parquet-testing/ ?

@tustvold
Copy link
Contributor

You could, but I don't have merge rights there so it may take some time.

A quicker option might be to use an existing file for 1., and to write a file to a Vec for 2.

@hengfeiyang
Copy link
Contributor Author

hengfeiyang commented Oct 11, 2023

@tustvold Thanks, i will use {testdata}/alltypes_plain.parquet as base file and generate other files.

@mapleFU
Copy link
Member

mapleFU commented Oct 11, 2023

Would you mind take a look at data_index_bloom_encoding_stats.parquet under parquet-testing repo? I think it contains a bloom filter for the first column

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thank you

@tustvold tustvold merged commit 6e49f31 into apache:master Oct 12, 2023
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants