New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

add bloom filter implementation based on split block (sbbf) spec #3057

Merged

Jimexist merged 14 commits into master from add-bloom-filter

Nov 13, 2022

Member

Jimexist commented Nov 9, 2022 •

edited

Which issue does this PR close?

Part I of:

Support bloom filter reading and writing for parquet #3023

Next up:

read block data from column chunk metadata
setup reader to filter by this bloom filter

This implementation is partly based on https://github.com/jorgecarleitao/parquet2/tree/main/src/bloom_filter on the testing data part.

Rationale for this change

add bloom filter implementation based on split block spec

What changes are included in this PR?

SBBF implementation per spec

Are there any user-facing changes?


          add bloom filter implementation based on split block spec

2fdc350

Jimexist requested a review from alamb

November 9, 2022 02:17

github-actions bot added the parquet label


          format and also revist index method

1f1a79e

Jimexist changed the title ~~add bloom filter implementation based on split block spec~~ add bloom filter implementation based on split block (sbbf) spec

Jimexist added 3 commits

November 9, 2022 17:47


          bloom filter reader

3cc48d3


          create new function to facilitate fixture test

01aed83


          fix clippy

d28ed17

Member Author

Jimexist commented Nov 10, 2022

turns out i had to make the sbbf module public to make the clippy happy but either way i am going to make use of it soon later and then maybe change it to crate private?

one thing that has been bothering me is the fact that down the lane, RowGroupReader, PageReader, etc. all make use of the trait ChunkReader which requires both an offset and a length to do actual IO. For bloom filter it is different: it only contains an offset, then the length of its record is read out from the BloomFilterHeader, with only an offset, there's no way to read with ChunkReader.

I'd probably need to setup a separate trait for this purpose...

Contributor

tustvold commented Nov 10, 2022

it only contains an offset, then the length of its record is read out from the BloomFilterHeader, with only an offset, there's no way to read with ChunkReader.

We may have to do something similar to what we do for the footer, guess what the length is, and then read additional data if necessary

Member Author

Jimexist commented Nov 10, 2022

it only contains an offset, then the length of its record is read out from the BloomFilterHeader, with only an offset, there's no way to read with ChunkReader.

We may have to do something similar to what we do for the footer, guess what the length is, and then read additional data if necessary

thanks, do happen to know the link to that example?

Member Author

Jimexist commented Nov 10, 2022

i intend to merge this one first before moving on to the part of parsing bitset from parquet file

Contributor

alamb commented Nov 10, 2022

I will try and review this PR later today. Thank you @Jimexist


          Merge branch 'master' into add-bloom-filter

ea0b316

Contributor

alamb commented Nov 11, 2022

I started (but did not finish) reviewing this PR, thank you. I need to find some dedicated time to study the bloom filter spec in more detail

So far, my analysis is that twox-hash seems reasonable -- it is widely used https://crates.io/crates/twox-hash/reverse_dependencies

alamb approved these changes

View reviewed changes

Contributor

alamb left a comment

Looks great @Jimexist -- thank you

I went over the code in detail while reading https://github.com/apache/parquet-format/blob/master/BloomFilter.md#technical-approach. The implementation seems to conform well to the spec.

My only substantial suggestions are about testing. I left a link to another test from @jorgecarleitao 's parquet2 (and its comments) that would be good to add.

also cc @zeevm and @shanisolomon who previously showed some interest in this work when exposing the metadata

https://github.com/apache/arrow-rs/pulls?q=is%3Apr+bloom+is%3Aclosed

All in all 🚀 very nice and THANK YOU

parquet/src/bloom_filter/mod.rs Show resolved Hide resolved

parquet/src/bloom_filter/mod.rs Show resolved Hide resolved

parquet/src/bloom_filter/mod.rs

+              fn mask(x: u32) -> Block {
+                  let mut result = [0_u32; 8];
+                  for i in 0..8 {
+                      // wrapping instead of checking for overflow

Contributor

alamb Nov 11, 2022

i don't know the implications of using wrapping mul here

Member Author

Jimexist Nov 11, 2022

basically it's very likely to wrap given the salt is numerically large, but the idea of salting is to make the distribution pseudo random so wrapping is a good idea.

parquet/src/bloom_filter/mod.rs Outdated Show resolved Hide resolved

parquet/src/bloom_filter/mod.rs

+                      Self(data)
+                  }
+                  pub fn read_from_column_chunk<R: Read + Seek>(

Contributor

alamb Nov 11, 2022

Is there any way to write a test for this function? Maybe we can do so eventually using the data in https://github.com/apache/parquet-testing/tree/master/data

Member Author

Jimexist Nov 11, 2022

yes eventually, but i think it can be done later after being used in the reader, likely RowGroupReader.

parquet/src/bloom_filter/mod.rs Show resolved Hide resolved

parquet/Cargo.toml

@@ @@ -57,6 +57,7 @@ seq-macro = { version = "0.3", default-features = false } @@
               futures = { version = "0.3", default-features = false, features = ["std"], optional = true }
               tokio = { version = "1.0", optional = true, default-features = false, features = ["macros", "rt", "io-util"] }
               hashbrown = { version = "0.13", default-features = false }
+              twox-hash = { version = "1.6", optional = true }

Contributor

alamb Nov 11, 2022

👍

Jimexist and others added 4 commits

November 11, 2022 21:40


          Update parquet/src/bloom_filter/mod.rs

c2a5e78

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>


          Update parquet/src/bloom_filter/mod.rs

c1daf66

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>


          Update parquet/src/bloom_filter/mod.rs

9fd963b

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>


          Update parquet/src/bloom_filter/mod.rs

4bc60a9

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

Jimexist commented

View reviewed changes

parquet/src/bloom_filter/mod.rs Outdated Show resolved Hide resolved


          Update parquet/src/bloom_filter/mod.rs

b08f97c

Jimexist force-pushed the add-bloom-filter branch from 37a024d to b08f97c Compare

November 11, 2022 15:31

alamb approved these changes

View reviewed changes

Contributor

alamb left a comment

👨‍🍳 👌 Looks very good to me

parquet/src/bloom_filter/mod.rs

+, 1, 80, 20, 64, 68, 8, 109, 6, 37, 4, 67, 144, 80, 96, 32, 8, 132, 43,
+, 0, 5, 99, 65, 2, 0, 224, 44, 64, 78, 96, 4,
+                      ];
+                      let sbbf = Sbbf::new(bitset);

Contributor

alamb Nov 11, 2022

What do you think about adding the second test from https://github.com/jorgecarleitao/parquet2/blob/main/src/bloom_filter/mod.rs#L14-L69 ? We could definitely do it as a follow on PR

Member Author

Jimexist Nov 12, 2022 •

edited

i think it is essentially the same as the one added, i'd prefer to implement the whole thing and then test using real bloomfilter test data files

Contributor

alamb commented Nov 12, 2022

@viirya or @tustvold any concerns about merging this?

viirya reviewed

View reviewed changes

parquet/src/bloom_filter/mod.rs Outdated Show resolved Hide resolved

viirya reviewed

View reviewed changes

parquet/src/bloom_filter/mod.rs

Comment on lines +162 to +166

+              pub fn hash_bytes<A: AsRef<[u8]>>(value: A) -> u64 {
+                  let mut hasher = XxHash64::with_seed(SEED);
+                  hasher.write(value.as_ref());
+                  hasher.finish()
+              }

Member

viirya Nov 13, 2022

Is the parameter hash on above functions generated by this function? Perhaps adding a doc for public function.

Member Author

Jimexist Nov 13, 2022

tbh i only added pub to silence clippy, otherwise it'll warn unused function.

i plan to do more around this later but getting this pull request merge is a first step.

the alternative, of course, is to just code the whole thing but it will make code review harder.

Contributor

alamb Nov 13, 2022

Thank you for breaking it up into small pieces!

viirya reviewed

View reviewed changes

parquet/src/bloom_filter/mod.rs

+                  fn hash_to_block_index(&self, hash: u64) -> usize {
+                      // unchecked_mul is unstable, but in reality this is safe, we'd just use saturating mul
+                      // but it will not saturate
+                      (((hash >> 32).saturating_mul(self.0.len() as u64)) >> 32) as usize

Member

viirya Nov 13, 2022

Is this guaranteed to be in the range of block index?

Member Author

Jimexist Nov 13, 2022

yes this is per spec

The filter_insert operation first uses the most significant 32 bits of its argument to select a block to operate on. Call the argument "h", and recall the use of "z" to mean the number of blocks. Then a block number i between 0 and z-1 (inclusive) to operate on is chosen as follows:

unsigned int64 h_top_bits = h >> 32;
unsigned int64 z_as_64_bit = z;
unsigned int32 i = (h_top_bits * z_as_64_bit) >> 32;

The first line extracts the most significant 32 bits from h and assignes them to a 64-bit unsigned integer. The second line is simpler: it just sets an unsigned 64-bit value to the same value as the 32-bit unsigned value z. The purpose of having both h_top_bits and z_as_64_bit be 64-bit values is so that their product is a 64-bit value. That product is taken in the third line, and then the most significant 32 bits are extracted into the value i, which is the index of the block that will be operated on.

After this process to select i, filter_insert uses the least significant 32 bits of h as the argument to block_insert called on block i.

The technique for converting the most significant 32 bits to an integer between 0 and z-1 (inclusive) avoids using the modulo operation, which is often very slow. This trick can be found in Kenneth A. Ross's 2006 IBM research report, "Efficient Hash Probes on Modern Processors"

viirya approved these changes

View reviewed changes


          Update parquet/src/bloom_filter/mod.rs

2f0e8bb

Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com>

alamb reviewed

View reviewed changes

parquet/src/lib.rs

@@ @@ -84,6 +84,8 @@ pub mod arrow; @@
               pub mod column;
               experimental!(mod compression);
               experimental!(mod encodings);
+              #[cfg(feature = "bloom")]

Contributor

alamb Nov 13, 2022

It occurs to me that you probably need to add bloom to the list of features in CI to get this tested:

arrow-rs/.github/workflows/arrow.yml

Lines 77 to 78 in 3084ee2

    
           - name: Test arrow with all features apart from simd 
        
             run: cargo test -p arrow --features=force_validate,prettyprint,ipc_compression,ffi,dyn_cmp_dict,dyn_arith_dict,chrono-tz

Also perhaps it is worth a mention (as experimental) in https://github.com/apache/arrow-rs/tree/master/arrow#feature-flags

Member Author

Jimexist Nov 13, 2022

i'll try to add that after the feature is completed?

Member Author

Jimexist Nov 13, 2022

on a second look, @alamb i think the bloom filter is added in parquet not in arrow so i don't think you comment is applicable per se.

Contributor

alamb Nov 13, 2022

Sorry I think the feature flag should be added to the parquet docs: https://github.com/apache/arrow-rs/tree/master/parquet#feature-flags

I doubled checked the parquet CI and it looks like this feature will be enabled and thus covered:

arrow-rs/.github/workflows/parquet.yml

Line 92 in 46da606

run: cargo check -p parquet --all-features

Jimexist mentioned this pull request

Support bloom filter reading and writing for parquet #3023

Closed

Jimexist added 2 commits

November 13, 2022 12:55


          fix clippy

f9e34f6


          Merge branch 'master' into add-bloom-filter

c9208e7

Member Author

Jimexist commented Nov 13, 2022

let me try to merge this and address the issues in a subsequent pull request

Jimexist merged commit b7af85c into master

Jimexist deleted the add-bloom-filter branch

November 13, 2022 13:07

ursabot commented Nov 13, 2022

Benchmark runs are scheduled for baseline = 3084ee2 and contender = b7af85c. b7af85c is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Jimexist mentioned this pull request

parquet bloom filter part II: read sbbf bitset from row group reader, update API, and add cli demo #3102

Merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment