parquet bloom filter part II: read sbbf bitset from row group reader, update API, and add cli demo #3102

Jimexist · 2022-11-13T13:25:24Z

Which issue does this PR close?

Rationale for this change

parquet bloom filter part II:

read sbbf bitset from row group reader
update API
add cli demo

What changes are included in this PR?

data generation:

In [1]: import pyspark.sql

In [2]: spark = pyspark.sql.SparkSession.builder.getOrCreate()

In [4]: spark.conf.set("parquet.bloom.filter.max.bytes", 32)

In [5]: spark.conf.set("parquet.bloom.filter.expected.ndv", 10)

In [6]: spark.conf.set("parquet.bloom.filter.enabled", True)

In [8]: data=[('a'+str(i%10),) for i in range(100)]

In [9]: df = spark.createDataFrame(data, ["id"]).repartition(1)

In [10]: df.write.parquet("bla.parquet", mode = "overwrite")

❯ cargo run --features=cli --bin parquet-show-bloom-filter -- --file-name bla.parquet/part-00000-3da010f4-d863-497e-a525-3519b58d00ae-c000.snappy.parquet --column id -v a0 -v a1 -v aa
    Finished dev [unoptimized + debuginfo] target(s) in 0.09s
     Running `target/debug/parquet-show-bloom-filter --file-name bla.parquet/part-00000-3da010f4-d863-497e-a525-3519b58d00ae-c000.snappy.parquet --column id -v a0 -v a1 -v aa`
Row group #0
================================================================================
Value a0 is present in bloom filter
Value a1 is present in bloom filter
Value aa is absent in bloom filter

Are there any user-facing changes?

parquet/src/bloom_filter/mod.rs

alamb · 2022-11-14T14:06:44Z

Thank you @Jimexist -- I have this PR on my review list and hopefully will get to it today. The backlog in DataFusion is substantial, however, so it might not be until tomorrow

tustvold

I've left some suggestions, but taking a step back I wonder if we could do the following.

The bloom filter specification states.

The Bloom filter data can be stored before the page indexes after all row groups
Or it can be stored between row groups

Once we have read the file metadata we know the byte ranges of the column chunks, and page indexes, as well as the offsets of the bloom filter data for each column chunk. It should therefore be possible to get a fairly accurate overestimate of the length of each bloom filter, simply by process of elimination.

Not only would this remove the need to do iterative reads, but would also have a clear path to supporting reading this information from object storage, where we need to know the byte ranges ahead of time. Effectively we could make Sbbf::read_from_column_chunk take an &[u8].

What do you think?

tustvold · 2022-11-14T20:13:14Z

parquet/src/bin/parquet-show-bloom-filter.rs

+//! cargo run --features=cli --bin parquet-show-bloom-filter -- --file-name XYZ.parquet --column id --values a
+//! ```
+
+extern crate parquet;


Suggested change

extern crate parquet;

tustvold · 2022-11-14T20:20:11Z

parquet/src/bloom_filter/mod.rs

+            let bitset_offset = offset + length - buffer.remaining();
+            return Ok((h, bitset_offset));
+        } else {
+            // continue to try by reading another batch


Corrupt data I think will cause this to potentially iterate indefinitely, which would be bad...

tustvold · 2022-11-14T20:24:26Z

parquet/src/bloom_filter/mod.rs

+// this size should not be too large to not to hit short read too early (although unlikely)
+// but also not to small to ensure cache efficiency, this is essential a "guess" of the header
+// size. In the demo test the size is 15 bytes.
+const STEP_SIZE: usize = 16;


Based on the thrift compact protocol encoding - https://raw.githubusercontent.com/apache/thrift/master/doc/specs/thrift-compact-protocol.md.

The BloomFilterHeader consists of 1 int32 and 3 enums. Enumerations are encoded as int32. Each int32 is encoded as 1 to 5 bytes.

Therefore the maximum header size is 20 bytes. One possibility might be to read 20 bytes, and bail on error?

that's a good idea, let me change it to 20 byte read and bail, before switching to more complicated "guessing" from the page indices gaps

Jimexist · 2022-11-15T01:17:21Z

Once we have read the file metadata we know the byte ranges of the column chunks, and page indexes, as well as the offsets of the bloom filter data for each column chunk. It should therefore be possible to get a fairly accurate overestimate of the length of each bloom filter, simply by process of elimination.

Thanks for the suggestion. I wonder if that is future proof, e.g. if there are more data structure to be added later beside sbbf, page index, etc. would that be a problem? Thinking out loud... that this would just be ballooning the over-estimate and/or make the likelihood of needing to look at both locations before it can correctly locate which was the right one when parquet file was written.

tustvold · 2022-11-15T01:21:05Z

I think it unlikely that a subsequent addition would omit sufficient information to properly identify its location in the file, but as you say, the result would simply be overestimating the length of the final bloom filter which is relatively harmless

Changed mind

tustvold · 2022-11-15T20:00:39Z

parquet/src/file/reader.rs

@@ -143,6 +145,10 @@ pub trait RowGroupReader: Send + Sync {
        Ok(col_reader)
    }

+    #[cfg(feature = "bloom")]


What do you think about handling this in the same eager way that we handle page indexes, namely add an option to ReadOptions to enable reading bloom filters, and read this data in SerializedFileReader?

@tustvold are you also suggesting dropping the feature gate altogether and enable it by default? I added the feature gate trying to reduce binary size but then if the feature is very likely to be used there's no need for this gate any more.

I'm suggesting rather than providing a lazy API to read the bloom filter on demand, provide an API to make SerializedReader load blook filters as part of ParquetMetadata if the corresponding feature and ReadOption is enabled. Similar to how we handle the page index.

This is necessary to be able to support object stores, and is generally a good idea to avoid lots of small IO reads.

also thanks for the suggestion, I agree with this direction, however I have:

parquet bloom filter part III: add sbbf writer, remove bloom default feature, add reader properties #3119

coming up, so I'd like to maybe merge this as is and quickly follow up on this after it is merged. do you think that works?

alamb

Thank you for this PR @Jimexist -- it is looking very cool to see this in the implementation

When writing bloom filters, I wonder if we could have the rust writer write them all contiguously to avoid multiple potential random access reads.

alamb · 2022-11-15T22:12:18Z

parquet/src/bin/parquet-show-bloom-filter.rs

+//! cargo run --features=cli --bin parquet-show-bloom-filter -- --file-name XYZ.parquet --column id --values a
+//! ```
+
+use clap::Parser;


Maybe we could add the ability to dump bloom filters to to https://github.com/apache/arrow-rs/blob/master/parquet/src/bin/parquet-schema.rs rather than make a new executable

I don't feel strongly however

I am coding this similar to https://github.com/apache/parquet-mr/blob/master/parquet-cli/src/main/java/org/apache/parquet/cli/commands/ShowBloomFilterCommand.java so I'd like to keep it this way for now.

alamb · 2022-11-15T22:14:40Z

parquet/src/bloom_filter/mod.rs

@@ -125,11 +162,8 @@ impl Sbbf {
        let length: usize = header.num_bytes.try_into().map_err(|_| {


it is unfortunate that we need to do more than one read to potentially read a bloom filter (read the bloom header and then read the length). I think @tustvold noted this as a limitation in the parquet file format itself (that the file metadata only has the bloom filter starting offset, but not its length)

Perhaps the reader abstraction can hide most/all of this nonsense from us

you can take a look at https://github.com/apache/arrow-rs/pull/3119/files#diff-3b307348aabe465890fa39973e9fda0243bd2344cb7cb9cdf02ac2d39521d7caR232-R236 which should show how it works - similar to writing column offsets and indices

tustvold

I'm happy for this to be merged as is, there is plenty of time before the next release to get the APIs to a happy place, and the feature is still experimental anyway

ursabot · 2022-11-16T03:13:13Z

Benchmark runs are scheduled for baseline = c95eb4c and contender = 73d66d8. 73d66d8 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

… update API, and add cli demo (#3102) * add feature flag * add api * fix reading with chunk reader * refactor * add a binary to demo * add bin * remove unused * fix clippy * adjust byte size * update read method * parquet-show-bloom-filter with bloom feature required * remove extern crate * get rid of loop read * refactor to test * rework api * remove unused trait * update help

alamb · 2022-11-16T18:21:14Z

🚄 love to see this -- thank you @tustvold and @Jimexist

github-actions bot added the parquet Changes to the parquet crate label Nov 13, 2022

Jimexist commented Nov 13, 2022

View reviewed changes

parquet/src/bloom_filter/mod.rs Outdated Show resolved Hide resolved

Jimexist changed the title ~~WIP Add bloom filter part II~~ parquet bloom filter part II: read sbbf bitset from row group reader, update API, and add cli demo Nov 14, 2022

Jimexist requested review from alamb and tustvold and removed request for alamb November 14, 2022 12:22

Jimexist added 6 commits November 14, 2022 20:27

add feature flag

d5458bb

add api

2557f2c

fix reading with chunk reader

88cea80

refactor

efd8991

add a binary to demo

5f4deae

add bin

c66d7a0

Jimexist force-pushed the add-bloom-filter-2 branch from 74a191c to c66d7a0 Compare November 14, 2022 12:31

Jimexist added 5 commits November 14, 2022 20:32

remove unused

7a51342

fix clippy

fa3639c

adjust byte size

f8b7a27

update read method

1bc73cd

parquet-show-bloom-filter with bloom feature required

f0041d3

tustvold reviewed Nov 14, 2022

View reviewed changes

Jimexist added 6 commits November 15, 2022 12:42

remove extern crate

3ec6e29

get rid of loop read

e7a33b6

refactor to test

bd2fb2f

rework api

a9480ad

remove unused trait

8667369

update help

415c6fb

Jimexist mentioned this pull request Nov 15, 2022

parquet bloom filter part III: add sbbf writer, remove bloom default feature, add reader properties #3119

Merged

tustvold previously approved these changes Nov 15, 2022

View reviewed changes

tustvold reviewed Nov 15, 2022

View reviewed changes

alamb reviewed Nov 15, 2022

View reviewed changes

tustvold approved these changes Nov 16, 2022

View reviewed changes

tustvold merged commit 73d66d8 into master Nov 16, 2022

Jimexist deleted the add-bloom-filter-2 branch November 16, 2022 03:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parquet bloom filter part II: read sbbf bitset from row group reader, update API, and add cli demo #3102

parquet bloom filter part II: read sbbf bitset from row group reader, update API, and add cli demo #3102

Jimexist commented Nov 13, 2022 •

edited

alamb commented Nov 14, 2022

tustvold left a comment

tustvold Nov 14, 2022

tustvold Nov 14, 2022

tustvold Nov 14, 2022

Jimexist Nov 15, 2022

Jimexist commented Nov 15, 2022

tustvold commented Nov 15, 2022

tustvold Nov 15, 2022

Jimexist Nov 16, 2022 •

edited

tustvold Nov 16, 2022 •

edited

Jimexist Nov 16, 2022 •

edited

alamb left a comment

alamb Nov 15, 2022

Jimexist Nov 16, 2022

alamb Nov 15, 2022

Jimexist Nov 16, 2022

tustvold left a comment

ursabot commented Nov 16, 2022

alamb commented Nov 16, 2022

		@@ -125,11 +162,8 @@ impl Sbbf {
		let length: usize = header.num_bytes.try_into().map_err(\|_\| {

parquet bloom filter part II: read sbbf bitset from row group reader, update API, and add cli demo #3102

parquet bloom filter part II: read sbbf bitset from row group reader, update API, and add cli demo #3102

Conversation

Jimexist commented Nov 13, 2022 • edited

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

alamb commented Nov 14, 2022

tustvold left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Jimexist commented Nov 15, 2022

tustvold commented Nov 15, 2022

Choose a reason for hiding this comment

Jimexist Nov 16, 2022 • edited

Choose a reason for hiding this comment

tustvold Nov 16, 2022 • edited

Choose a reason for hiding this comment

Jimexist Nov 16, 2022 • edited

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold left a comment

Choose a reason for hiding this comment

ursabot commented Nov 16, 2022

alamb commented Nov 16, 2022

Jimexist commented Nov 13, 2022 •

edited

Jimexist Nov 16, 2022 •

edited

tustvold Nov 16, 2022 •

edited

Jimexist Nov 16, 2022 •

edited