Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add bloom filter implementation based on split block (sbbf) spec #3057

Merged
merged 14 commits into from Nov 13, 2022

Conversation

Jimexist
Copy link
Member

@Jimexist Jimexist commented Nov 9, 2022

Which issue does this PR close?

Part I of:

Next up:

  • read block data from column chunk metadata
  • setup reader to filter by this bloom filter

This implementation is partly based on https://github.com/jorgecarleitao/parquet2/tree/main/src/bloom_filter on the testing data part.

Rationale for this change

add bloom filter implementation based on split block spec

What changes are included in this PR?

SBBF implementation per spec

Are there any user-facing changes?

@Jimexist Jimexist requested a review from alamb November 9, 2022 02:17
@github-actions github-actions bot added the parquet Changes to the parquet crate label Nov 9, 2022
@Jimexist Jimexist changed the title add bloom filter implementation based on split block spec add bloom filter implementation based on split block (sbbf) spec Nov 9, 2022
@Jimexist
Copy link
Member Author

turns out i had to make the sbbf module public to make the clippy happy but either way i am going to make use of it soon later and then maybe change it to crate private?

one thing that has been bothering me is the fact that down the lane, RowGroupReader, PageReader, etc. all make use of the trait ChunkReader which requires both an offset and a length to do actual IO. For bloom filter it is different: it only contains an offset, then the length of its record is read out from the BloomFilterHeader, with only an offset, there's no way to read with ChunkReader.

I'd probably need to setup a separate trait for this purpose...

@tustvold
Copy link
Contributor

it only contains an offset, then the length of its record is read out from the BloomFilterHeader, with only an offset, there's no way to read with ChunkReader.

We may have to do something similar to what we do for the footer, guess what the length is, and then read additional data if necessary

@Jimexist
Copy link
Member Author

it only contains an offset, then the length of its record is read out from the BloomFilterHeader, with only an offset, there's no way to read with ChunkReader.

We may have to do something similar to what we do for the footer, guess what the length is, and then read additional data if necessary

thanks, do happen to know the link to that example?

@Jimexist
Copy link
Member Author

i intend to merge this one first before moving on to the part of parsing bitset from parquet file

@alamb
Copy link
Contributor

alamb commented Nov 10, 2022

I will try and review this PR later today. Thank you @Jimexist

@alamb
Copy link
Contributor

alamb commented Nov 11, 2022

I started (but did not finish) reviewing this PR, thank you. I need to find some dedicated time to study the bloom filter spec in more detail

So far, my analysis is that twox-hash seems reasonable -- it is widely used https://crates.io/crates/twox-hash/reverse_dependencies

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great @Jimexist -- thank you

I went over the code in detail while reading https://github.com/apache/parquet-format/blob/master/BloomFilter.md#technical-approach. The implementation seems to conform well to the spec.

My only substantial suggestions are about testing. I left a link to another test from @jorgecarleitao 's parquet2 (and its comments) that would be good to add.

also cc @zeevm and @shanisolomon who previously showed some interest in this work when exposing the metadata

https://github.com/apache/arrow-rs/pulls?q=is%3Apr+bloom+is%3Aclosed

All in all 🚀 very nice and THANK YOU

parquet/src/bloom_filter/mod.rs Show resolved Hide resolved
parquet/src/bloom_filter/mod.rs Show resolved Hide resolved
fn mask(x: u32) -> Block {
let mut result = [0_u32; 8];
for i in 0..8 {
// wrapping instead of checking for overflow
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't know the implications of using wrapping mul here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

basically it's very likely to wrap given the salt is numerically large, but the idea of salting is to make the distribution pseudo random so wrapping is a good idea.

parquet/src/bloom_filter/mod.rs Outdated Show resolved Hide resolved
Self(data)
}

pub fn read_from_column_chunk<R: Read + Seek>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any way to write a test for this function? Maybe we can do so eventually using the data in https://github.com/apache/parquet-testing/tree/master/data

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes eventually, but i think it can be done later after being used in the reader, likely RowGroupReader.

parquet/src/bloom_filter/mod.rs Show resolved Hide resolved
@@ -57,6 +57,7 @@ seq-macro = { version = "0.3", default-features = false }
futures = { version = "0.3", default-features = false, features = ["std"], optional = true }
tokio = { version = "1.0", optional = true, default-features = false, features = ["macros", "rt", "io-util"] }
hashbrown = { version = "0.13", default-features = false }
twox-hash = { version = "1.6", optional = true }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Jimexist and others added 4 commits November 11, 2022 21:40
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👨‍🍳 👌 Looks very good to me

200, 1, 80, 20, 64, 68, 8, 109, 6, 37, 4, 67, 144, 80, 96, 32, 8, 132, 43,
33, 0, 5, 99, 65, 2, 0, 224, 44, 64, 78, 96, 4,
];
let sbbf = Sbbf::new(bitset);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about adding the second test from https://github.com/jorgecarleitao/parquet2/blob/main/src/bloom_filter/mod.rs#L14-L69 ? We could definitely do it as a follow on PR

Copy link
Member Author

@Jimexist Jimexist Nov 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it is essentially the same as the one added, i'd prefer to implement the whole thing and then test using real bloomfilter test data files

@alamb
Copy link
Contributor

alamb commented Nov 12, 2022

@viirya or @tustvold any concerns about merging this?

Comment on lines +162 to +166
pub fn hash_bytes<A: AsRef<[u8]>>(value: A) -> u64 {
let mut hasher = XxHash64::with_seed(SEED);
hasher.write(value.as_ref());
hasher.finish()
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the parameter hash on above functions generated by this function? Perhaps adding a doc for public function.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tbh i only added pub to silence clippy, otherwise it'll warn unused function.

i plan to do more around this later but getting this pull request merge is a first step.

the alternative, of course, is to just code the whole thing but it will make code review harder.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for breaking it up into small pieces!

fn hash_to_block_index(&self, hash: u64) -> usize {
// unchecked_mul is unstable, but in reality this is safe, we'd just use saturating mul
// but it will not saturate
(((hash >> 32).saturating_mul(self.0.len() as u64)) >> 32) as usize
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this guaranteed to be in the range of block index?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes this is per spec

The filter_insert operation first uses the most significant 32 bits of its argument to select a block to operate on. Call the argument "h", and recall the use of "z" to mean the number of blocks. Then a block number i between 0 and z-1 (inclusive) to operate on is chosen as follows:

unsigned int64 h_top_bits = h >> 32;
unsigned int64 z_as_64_bit = z;
unsigned int32 i = (h_top_bits * z_as_64_bit) >> 32;

The first line extracts the most significant 32 bits from h and assignes them to a 64-bit unsigned integer. The second line is simpler: it just sets an unsigned 64-bit value to the same value as the 32-bit unsigned value z. The purpose of having both h_top_bits and z_as_64_bit be 64-bit values is so that their product is a 64-bit value. That product is taken in the third line, and then the most significant 32 bits are extracted into the value i, which is the index of the block that will be operated on.

After this process to select i, filter_insert uses the least significant 32 bits of h as the argument to block_insert called on block i.

The technique for converting the most significant 32 bits to an integer between 0 and z-1 (inclusive) avoids using the modulo operation, which is often very slow. This trick can be found in Kenneth A. Ross's 2006 IBM research report, "Efficient Hash Probes on Modern Processors"

Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com>
@@ -84,6 +84,8 @@ pub mod arrow;
pub mod column;
experimental!(mod compression);
experimental!(mod encodings);
#[cfg(feature = "bloom")]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It occurs to me that you probably need to add bloom to the list of features in CI to get this tested:

- name: Test arrow with all features apart from simd
run: cargo test -p arrow --features=force_validate,prettyprint,ipc_compression,ffi,dyn_cmp_dict,dyn_arith_dict,chrono-tz

Also perhaps it is worth a mention (as experimental) in https://github.com/apache/arrow-rs/tree/master/arrow#feature-flags

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'll try to add that after the feature is completed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on a second look, @alamb i think the bloom filter is added in parquet not in arrow so i don't think you comment is applicable per se.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I think the feature flag should be added to the parquet docs: https://github.com/apache/arrow-rs/tree/master/parquet#feature-flags

I doubled checked the parquet CI and it looks like this feature will be enabled and thus covered:

run: cargo check -p parquet --all-features

@Jimexist
Copy link
Member Author

let me try to merge this and address the issues in a subsequent pull request

@Jimexist Jimexist merged commit b7af85c into master Nov 13, 2022
@Jimexist Jimexist deleted the add-bloom-filter branch November 13, 2022 13:07
@ursabot
Copy link

ursabot commented Nov 13, 2022

Benchmark runs are scheduled for baseline = 3084ee2 and contender = b7af85c. b7af85c is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants