parquet bloom filter part III: add sbbf writer, remove `bloom` default feature, add reader properties #3119

Jimexist · 2022-11-15T15:06:13Z

Which issue does this PR close?

previous PR: parquet bloom filter part II: read sbbf bitset from row group reader, update API, and add cli demo #3102
related issue: Support bloom filter reading and writing for parquet #3023

Rationale for this change

add sbbf filter during metadata writing phase
add reading support for bloom filter in reader properties
create bloom filter per column chunk per row group as per writer properties
remove bloom feature

What changes are included in this PR?

Are there any user-facing changes?

now the API is considered complete, the next step is to have an end to end test or cross test with actual parquet file generation.

Jimexist · 2022-11-17T00:03:52Z

parquet/Cargo.toml

@@ -57,7 +57,8 @@ seq-macro = { version = "0.3", default-features = false }
 futures = { version = "0.3", default-features = false, features = ["std"], optional = true }
 tokio = { version = "1.0", optional = true, default-features = false, features = ["macros", "rt", "io-util"] }
 hashbrown = { version = "0.13", default-features = false }
-twox-hash = { version = "1.6", optional = true }
+twox-hash = { version = "1.6", default-features = false }


it by default relies on rand which breaks wasm32

alamb · 2022-11-17T19:45:18Z

I believe @tustvold is away for a few days. I plan to review this PR in more detail tomorrow

alamb

Thank you @Jimexist -- this is very cool. I went through the code fairly thoroughly. I had some minor suggestions / comments for documentation and code structure but nothing that would block merging.

I think the biggest thing I would like to discuss is "what parameters to expose for the writer API". I was thinking, for example, will users of this feature be able to set "fpp" and "ndv" reasonably? I suppose having the number of distinct values before writing a parquet file is reasonable, but maybe not the expected number of distinct values for each row group.

I did some research of other implementations. Here are the spark settingss https://spark.apache.org/docs/latest/configuration.html

spark.sql.optimizer.runtime.bloomFilter.creationSideThreshold	10MB	Size threshold of the bloom filter creation side plan. Estimated size needs to be under this value to try to inject bloom filter.	3.3.0
spark.sql.optimizer.runtime.bloomFilter.enabled	false	When true and if one side of a shuffle join has a selective predicate, we attempt to insert a bloom filter in the other side to reduce the amount of shuffle data.	3.3.0
spark.sql.optimizer.runtime.bloomFilter.expectedNumItems	1000000	The default number of expected items for the runtime bloomfilter	3.3.0
spark.sql.optimizer.runtime.bloomFilter.maxNumBits	67108864	The max number of bits to use for the runtime bloom filter	3.3.0
spark.sql.optimizer.runtime.bloomFilter.maxNumItems	4000000	The max allowed number of expected items for the runtime bloom filter	3.3.0
spark.sql.optimizer.runtime.bloomFilter.numBits	8388608	The default number of bits to use for the runtime bloom filter	3.3.0

the arrow parquet C++ writer seems to allow for the fpp setting

https://arrow.apache.org/docs/cpp/api/formats.html#_CPPv4N5arrow8adapters3orc12WriteOptions16bloom_filter_fppE

double bloom_filter_fpp = 0.05
The upper limit of the false-positive rate of the bloom filter, default 0.05.

Databricks seems to expose the fpp, max_fpp, and num distinct values:
https://docs.databricks.com/sql/language-manual/delta-create-bloomfilter-index.html

alamb · 2022-11-18T18:17:37Z

parquet/Cargo.toml

@@ -77,7 +78,7 @@ rand = { version = "0.8", default-features = false, features = ["std", "std_rng"
 all-features = true

 [features]
-default = ["arrow", "bloom", "snap", "brotli", "flate2", "lz4", "zstd", "base64"]


alamb · 2022-11-18T18:21:34Z

parquet/src/bloom_filter/mod.rs

@@ -128,6 +172,33 @@ impl Sbbf {
        Self(data)
    }

+    /// Write the bitset in serialized form to the writer.
+    pub fn write_bitset<W: Write>(&self, mut writer: W) -> Result<(), ParquetError> {


I think it would be good to write a test for round tripping the bloom filters (as in write a SBFF to a Vec and then read it back out and verify it is the same). Specifically it would be nice to verify the bytes are not scrambled and the lengths are correct and handle empty bitsets (if that is possible)

alamb · 2022-11-18T18:23:26Z

parquet/src/bloom_filter/mod.rs

 impl Sbbf {
+    /// Create a new [Sbbf] with given number of distinct values and false positive probability.
+    /// Will panic if `fpp` is greater than 1.0 or less than 0.0.
+    pub fn new_with_ndv_fpp(ndv: u64, fpp: f64) -> Self {


Since this is a function meant for use outside the parquet crate I would prefer it return an error rather than panic with bad input.

will update once we decided on using fpp or ndv or both:

audit and create a document for bloom filter configurations #3138

alamb · 2022-11-18T18:24:42Z

parquet/src/bloom_filter/mod.rs

+            (0.001, 100000, 1460769),
+            (0.1, 1000000, 5772541),
+            (0.01, 1000000, 9681526),
+            (0.001, 1000000, 14607697),


does this mean a 14MB bloom filter? Its ok as there is a limit in optimal_num_of_bytes but when I saw this it was just 🤯

It might also be good to pass in some value larger than 2^32 to test there isn't an overflow problem lurking

alamb · 2022-11-18T18:25:10Z

parquet/src/file/metadata.rs

@@ -236,7 +236,7 @@ pub struct RowGroupMetaData {
 }

 impl RowGroupMetaData {
-    /// Returns builer for row group metadata.
+    /// Returns builder for row group metadata.


alamb · 2022-11-18T18:29:43Z

parquet/src/file/properties.rs

@@ -255,6 +279,11 @@ impl WriterProperties {
            .or_else(|| self.default_column_properties.max_statistics_size())
            .unwrap_or(DEFAULT_MAX_STATISTICS_SIZE)
    }
+
+    def_col_property_getter!(bloom_filter_enabled, bool, DEFAULT_BLOOM_FILTER_ENABLED);


I think these properties need docstrings -- I am happy to help write them. In particular, I think it should mention that the ndv and fpp are knobs that allow for control over bloom filter accuracy. Also it should mention if these limits are for each column chunk or the entire file (e.g. the ndv value is not the number of distinct values in the entire column)

alamb · 2022-11-18T18:39:59Z

parquet/src/bloom_filter/mod.rs

+    num_bytes.next_power_of_two()
+}
+
+// see http://algo2.iti.kit.edu/documents/cacheefficientbloomfilters-jea.pdf


Here is the parquet-mr code: https://github.com/apache/parquet-mr/blob/d057b39d93014fe40f5067ee4a33621e65c91552/parquet-column/src/main/java/org/apache/parquet/column/values/bloomfilter/BlockSplitBloomFilter.java#L277-L304

That looks very similar

parquet/src/file/properties.rs

parquet/src/file/reader.rs

parquet/src/file/writer.rs

Jimexist · 2022-11-19T05:26:39Z

thanks @alamb for the detailed comment. i wish to merge as is and then for subsequent steps:

add an integration or round trip tests or maybe both
adjust overall configurable parameters and then add docstring after they are settled

do you think this is a good idea given that we are not releasing a new version very soon?

parquet/src/file/properties.rs

- add reader properties - add writer properties - remove `bloom` feature

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

tustvold

I'm happy for this to go in as is, although I personally would prefer to reduce the amount of macros as they make the code quite hard to follow

tustvold · 2022-11-22T11:20:31Z

parquet/src/file/properties.rs

@@ -272,6 +301,52 @@ pub struct WriterPropertiesBuilder {
    sorting_columns: Option<Vec<SortingColumn>>,
 }

+macro_rules! def_opt_field_setter {


Just an observation that these macros are potentially more verbose than the alternative. Perhaps I'm old-fashioned but I'm not a massive fan of macros aside from where absolutely necessary, as they complicate debugging and legibility

tustvold · 2022-11-22T11:21:02Z

parquet/src/file/properties.rs

+    ($field: ident, $type: ty, $min_value:expr, $max_value:expr) => {
+        paste! {
+            pub fn [<set_ $field>](&mut self, value: $type) -> &mut Self {
+                if ($min_value..=$max_value).contains(&value) {


I would expect this to error or panic, not just ignore a value out of range?

tustvold · 2022-11-22T11:33:54Z

parquet/src/bloom_filter/mod.rs

 impl Sbbf {
+    /// Create a new [Sbbf] with given number of distinct values and false positive probability.
+    /// Will panic if `fpp` is greater than 1.0 or less than 0.0.


Suggested change

/// Will panic if `fpp` is greater than 1.0 or less than 0.0.

/// Will panic if `fpp` is greater than or equal to 1.0 or less than 0.0.

tustvold · 2022-11-22T11:37:24Z

parquet/src/file/properties.rs

+            }
+        }
+    };
+    ($field: ident, $type: ty, $min_value:expr, $max_value:expr) => {


This variant it only used in one place, and so I wonder if it needs to be a macro

tustvold · 2022-11-22T11:40:34Z

parquet/src/column/writer/mod.rs

+            if let Some(ndv) = props.bloom_filter_ndv(descr.path()) {
+                let fpp = props.bloom_filter_fpp(descr.path());
+                Some(Sbbf::new_with_ndv_fpp(ndv, fpp))
+            } else {


I think it is perhaps a little surprising that bloom_filter_max_bytes is ignored if ndv is set

tustvold · 2022-11-22T11:43:10Z

parquet/src/file/properties.rs

+    bloom_filter_ndv: Option<u64>,
+    /// bloom filter false positive probability
+    bloom_filter_fpp: Option<f64>,
+    /// bloom filter max number of bytes
+    bloom_filter_max_bytes: Option<u32>,


I wonder if it would be simpler to just ask users to specify the bloom filter size, and provide a free function to compute the size based on ndv and fpp?

The interaction of these three properties isn't immediately apparent?

That was the conclusion I may have come to as well -- see #3138 (comment)

tustvold · 2022-11-22T11:44:50Z

parquet/src/file/properties.rs

 }

 /// Reader properties builder.
 pub struct ReaderPropertiesBuilder {
    codec_options_builder: CodecOptionsBuilder,
+    read_bloom_filter: Option<bool>,


Suggested change

read_bloom_filter: Option<bool>,

read_bloom_filter: bool,

tustvold · 2022-11-22T11:45:07Z

parquet/src/file/properties.rs

@@ -635,13 +731,17 @@ impl ReaderPropertiesBuilder {
    fn with_defaults() -> Self {
        Self {
            codec_options_builder: CodecOptionsBuilder::default(),
+            read_bloom_filter: None,


Suggested change

read_bloom_filter: None,

read_bloom_filter: DEFAULT_READ_BLOOM_FILTER,

Jimexist · 2022-11-22T11:59:51Z

thanks @tustvold i plan to merge as is and then in subsequent pr adjust:

knobs or params that users can tune
usage of macro
panic behavior

because 2 relies on 1, and 3 relies on 2, so i'd like to clean up 3 of them together

ursabot · 2022-11-22T12:12:01Z

Benchmark runs are scheduled for baseline = 004a151 and contender = e214ccc. e214ccc is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

XinyuZeng · 2023-02-18T10:04:48Z

the arrow parquet C++ writer seems to allow for the fpp setting

https://arrow.apache.org/docs/cpp/api/formats.html#_CPPv4N5arrow8adapters3orc12WriteOptions16bloom_filter_fppE

nit: this reference is for arrow ORC C++ writer, parquet C++ does not enable write bloom filter yet.

Jimexist marked this pull request as draft November 15, 2022 15:06

github-actions bot added the parquet Changes to the parquet crate label Nov 15, 2022

Jimexist mentioned this pull request Nov 16, 2022

parquet bloom filter part II: read sbbf bitset from row group reader, update API, and add cli demo #3102

Merged

Jimexist force-pushed the add-bloom-filter-3 branch from 5d76248 to 76aa88f Compare November 16, 2022 03:58

github-actions bot added the arrow Changes to the arrow crate label Nov 16, 2022

Jimexist force-pushed the add-bloom-filter-3 branch 2 times, most recently from 52bf18a to 09ee38c Compare November 16, 2022 04:00

github-actions bot removed the arrow Changes to the arrow crate label Nov 16, 2022

Jimexist marked this pull request as ready for review November 16, 2022 05:48

Jimexist changed the title ~~parquet bloom filter part III: add sbbf writer~~ parquet bloom filter part III: add sbbf writer, remove bloom feature, add reader properties Nov 16, 2022

Jimexist force-pushed the add-bloom-filter-3 branch from ec3b5d0 to 9b55ab6 Compare November 16, 2022 05:57

Jimexist requested review from alamb and tustvold and removed request for alamb November 16, 2022 06:00

Jimexist commented Nov 17, 2022

View reviewed changes

alamb changed the title ~~parquet bloom filter part III: add sbbf writer, remove bloom feature, add reader properties~~ parquet bloom filter part III: add sbbf writer, remove bloom default feature, add reader properties Nov 18, 2022

alamb reviewed Nov 18, 2022

View reviewed changes

Jimexist commented Nov 19, 2022

View reviewed changes

parquet/src/file/properties.rs Outdated Show resolved Hide resolved

Jimexist mentioned this pull request Nov 19, 2022

audit and create a document for bloom filter configurations #3138

Closed

Jimexist added 8 commits November 19, 2022 13:37

bloom filter part III

f4558ae

- add reader properties - add writer properties - remove `bloom` feature

update row group vec

ea13d0a

fix clippy

03edb7d

fix clippy

5fa7476

remove default feature for twox

3732e43

incorporate ndv and fpp

7f46a4b

fix doc

27a404d

add unit test

35e56c1

Jimexist and others added 2 commits November 19, 2022 13:37

fix clippy

ec68e69

Apply suggestions from code review

85014ce

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

Jimexist force-pushed the add-bloom-filter-3 branch from 8ed4337 to 85014ce Compare November 19, 2022 05:45

Jimexist added 2 commits November 19, 2022 13:58

remove underflow logic

0d6540d

refactor write

37e145d

tustvold approved these changes Nov 22, 2022

View reviewed changes

tustvold reviewed Nov 22, 2022

View reviewed changes

tustvold merged commit e214ccc into master Nov 22, 2022

Jimexist deleted the add-bloom-filter-3 branch November 22, 2022 12:08

Jimexist mentioned this pull request Nov 23, 2022

bloom filter part IV: adjust writer properties, bloom filter properties, and incorporate into column encoder #3165

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parquet bloom filter part III: add sbbf writer, remove `bloom` default feature, add reader properties #3119

parquet bloom filter part III: add sbbf writer, remove `bloom` default feature, add reader properties #3119

Jimexist commented Nov 15, 2022 •

edited

Jimexist Nov 17, 2022

alamb commented Nov 17, 2022

alamb left a comment

alamb Nov 18, 2022

alamb Nov 18, 2022

alamb Nov 18, 2022

Jimexist Nov 19, 2022

alamb Nov 18, 2022

alamb Nov 18, 2022

alamb Nov 18, 2022

alamb Nov 18, 2022

Jimexist commented Nov 19, 2022

tustvold left a comment

tustvold Nov 22, 2022

tustvold Nov 22, 2022

tustvold Nov 22, 2022

tustvold Nov 22, 2022

tustvold Nov 22, 2022

tustvold Nov 22, 2022

alamb Nov 22, 2022

tustvold Nov 22, 2022

tustvold Nov 22, 2022

Jimexist commented Nov 22, 2022

ursabot commented Nov 22, 2022

XinyuZeng commented Feb 18, 2023

	/// Will panic if `fpp` is greater than 1.0 or less than 0.0.
	/// Will panic if `fpp` is greater than or equal to 1.0 or less than 0.0.

	read_bloom_filter: None,
	read_bloom_filter: DEFAULT_READ_BLOOM_FILTER,

parquet bloom filter part III: add sbbf writer, remove bloom default feature, add reader properties #3119

parquet bloom filter part III: add sbbf writer, remove bloom default feature, add reader properties #3119

Conversation

Jimexist commented Nov 15, 2022 • edited

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Choose a reason for hiding this comment

alamb commented Nov 17, 2022

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Jimexist commented Nov 19, 2022

tustvold left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Jimexist commented Nov 22, 2022

ursabot commented Nov 22, 2022

XinyuZeng commented Feb 18, 2023

parquet bloom filter part III: add sbbf writer, remove `bloom` default feature, add reader properties #3119

parquet bloom filter part III: add sbbf writer, remove `bloom` default feature, add reader properties #3119

Jimexist commented Nov 15, 2022 •

edited