Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parquet bloom filter part III: add sbbf writer, remove bloom default feature, add reader properties #3119

Merged
merged 12 commits into from Nov 22, 2022

Conversation

Jimexist
Copy link
Member

@Jimexist Jimexist commented Nov 15, 2022

Which issue does this PR close?

Rationale for this change

  • add sbbf filter during metadata writing phase
  • add reading support for bloom filter in reader properties
  • create bloom filter per column chunk per row group as per writer properties
  • remove bloom feature

What changes are included in this PR?

Are there any user-facing changes?

now the API is considered complete, the next step is to have an end to end test or cross test with actual parquet file generation.

@Jimexist Jimexist marked this pull request as draft November 15, 2022 15:06
@github-actions github-actions bot added the parquet Changes to the parquet crate label Nov 15, 2022
@github-actions github-actions bot added the arrow Changes to the arrow crate label Nov 16, 2022
@Jimexist Jimexist force-pushed the add-bloom-filter-3 branch 2 times, most recently from 52bf18a to 09ee38c Compare November 16, 2022 04:00
@github-actions github-actions bot removed the arrow Changes to the arrow crate label Nov 16, 2022
@Jimexist Jimexist marked this pull request as ready for review November 16, 2022 05:48
@Jimexist Jimexist changed the title parquet bloom filter part III: add sbbf writer parquet bloom filter part III: add sbbf writer, remove bloom feature, add reader properties Nov 16, 2022
@Jimexist Jimexist requested review from alamb and tustvold and removed request for alamb November 16, 2022 06:00
@@ -57,7 +57,8 @@ seq-macro = { version = "0.3", default-features = false }
futures = { version = "0.3", default-features = false, features = ["std"], optional = true }
tokio = { version = "1.0", optional = true, default-features = false, features = ["macros", "rt", "io-util"] }
hashbrown = { version = "0.13", default-features = false }
twox-hash = { version = "1.6", optional = true }
twox-hash = { version = "1.6", default-features = false }
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it by default relies on rand which breaks wasm32

@alamb
Copy link
Contributor

alamb commented Nov 17, 2022

I believe @tustvold is away for a few days. I plan to review this PR in more detail tomorrow

@alamb alamb changed the title parquet bloom filter part III: add sbbf writer, remove bloom feature, add reader properties parquet bloom filter part III: add sbbf writer, remove bloom default feature, add reader properties Nov 18, 2022
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Jimexist -- this is very cool. I went through the code fairly thoroughly. I had some minor suggestions / comments for documentation and code structure but nothing that would block merging.

I think the biggest thing I would like to discuss is "what parameters to expose for the writer API". I was thinking, for example, will users of this feature be able to set "fpp" and "ndv" reasonably? I suppose having the number of distinct values before writing a parquet file is reasonable, but maybe not the expected number of distinct values for each row group.

I did some research of other implementations. Here are the spark settingss https://spark.apache.org/docs/latest/configuration.html

spark.sql.optimizer.runtime.bloomFilter.creationSideThreshold 10MB Size threshold of the bloom filter creation side plan. Estimated size needs to be under this value to try to inject bloom filter. 3.3.0
spark.sql.optimizer.runtime.bloomFilter.enabled false When true and if one side of a shuffle join has a selective predicate, we attempt to insert a bloom filter in the other side to reduce the amount of shuffle data. 3.3.0
spark.sql.optimizer.runtime.bloomFilter.expectedNumItems 1000000 The default number of expected items for the runtime bloomfilter 3.3.0
spark.sql.optimizer.runtime.bloomFilter.maxNumBits 67108864 The max number of bits to use for the runtime bloom filter 3.3.0
spark.sql.optimizer.runtime.bloomFilter.maxNumItems 4000000 The max allowed number of expected items for the runtime bloom filter 3.3.0
spark.sql.optimizer.runtime.bloomFilter.numBits 8388608 The default number of bits to use for the runtime bloom filter 3.3.0

the arrow parquet C++ writer seems to allow for the fpp setting

https://arrow.apache.org/docs/cpp/api/formats.html#_CPPv4N5arrow8adapters3orc12WriteOptions16bloom_filter_fppE

double bloom_filter_fpp = 0.05
The upper limit of the false-positive rate of the bloom filter, default 0.05.

Databricks seems to expose the fpp, max_fpp, and num distinct values:
https://docs.databricks.com/sql/language-manual/delta-create-bloomfilter-index.html

@@ -77,7 +78,7 @@ rand = { version = "0.8", default-features = false, features = ["std", "std_rng"
all-features = true

[features]
default = ["arrow", "bloom", "snap", "brotli", "flate2", "lz4", "zstd", "base64"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -128,6 +172,33 @@ impl Sbbf {
Self(data)
}

/// Write the bitset in serialized form to the writer.
pub fn write_bitset<W: Write>(&self, mut writer: W) -> Result<(), ParquetError> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to write a test for round tripping the bloom filters (as in write a SBFF to a Vec and then read it back out and verify it is the same). Specifically it would be nice to verify the bytes are not scrambled and the lengths are correct and handle empty bitsets (if that is possible)

impl Sbbf {
/// Create a new [Sbbf] with given number of distinct values and false positive probability.
/// Will panic if `fpp` is greater than 1.0 or less than 0.0.
pub fn new_with_ndv_fpp(ndv: u64, fpp: f64) -> Self {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is a function meant for use outside the parquet crate I would prefer it return an error rather than panic with bad input.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will update once we decided on using fpp or ndv or both:

(0.001, 100000, 1460769),
(0.1, 1000000, 5772541),
(0.01, 1000000, 9681526),
(0.001, 1000000, 14607697),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this mean a 14MB bloom filter? Its ok as there is a limit in optimal_num_of_bytes but when I saw this it was just 🤯

It might also be good to pass in some value larger than 2^32 to test there isn't an overflow problem lurking

@@ -236,7 +236,7 @@ pub struct RowGroupMetaData {
}

impl RowGroupMetaData {
/// Returns builer for row group metadata.
/// Returns builder for row group metadata.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -255,6 +279,11 @@ impl WriterProperties {
.or_else(|| self.default_column_properties.max_statistics_size())
.unwrap_or(DEFAULT_MAX_STATISTICS_SIZE)
}

def_col_property_getter!(bloom_filter_enabled, bool, DEFAULT_BLOOM_FILTER_ENABLED);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these properties need docstrings -- I am happy to help write them. In particular, I think it should mention that the ndv and fpp are knobs that allow for control over bloom filter accuracy. Also it should mention if these limits are for each column chunk or the entire file (e.g. the ndv value is not the number of distinct values in the entire column)

num_bytes.next_power_of_two()
}

// see http://algo2.iti.kit.edu/documents/cacheefficientbloomfilters-jea.pdf
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parquet/src/file/properties.rs Outdated Show resolved Hide resolved
parquet/src/file/reader.rs Outdated Show resolved Hide resolved
parquet/src/file/writer.rs Show resolved Hide resolved
@Jimexist
Copy link
Member Author

thanks @alamb for the detailed comment. i wish to merge as is and then for subsequent steps:

  1. add an integration or round trip tests or maybe both
  2. adjust overall configurable parameters and then add docstring after they are settled

do you think this is a good idea given that we are not releasing a new version very soon?

Jimexist and others added 2 commits November 19, 2022 13:37
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy for this to go in as is, although I personally would prefer to reduce the amount of macros as they make the code quite hard to follow

@@ -272,6 +301,52 @@ pub struct WriterPropertiesBuilder {
sorting_columns: Option<Vec<SortingColumn>>,
}

macro_rules! def_opt_field_setter {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just an observation that these macros are potentially more verbose than the alternative. Perhaps I'm old-fashioned but I'm not a massive fan of macros aside from where absolutely necessary, as they complicate debugging and legibility

($field: ident, $type: ty, $min_value:expr, $max_value:expr) => {
paste! {
pub fn [<set_ $field>](&mut self, value: $type) -> &mut Self {
if ($min_value..=$max_value).contains(&value) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect this to error or panic, not just ignore a value out of range?

impl Sbbf {
/// Create a new [Sbbf] with given number of distinct values and false positive probability.
/// Will panic if `fpp` is greater than 1.0 or less than 0.0.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Will panic if `fpp` is greater than 1.0 or less than 0.0.
/// Will panic if `fpp` is greater than or equal to 1.0 or less than 0.0.

}
}
};
($field: ident, $type: ty, $min_value:expr, $max_value:expr) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This variant it only used in one place, and so I wonder if it needs to be a macro

if let Some(ndv) = props.bloom_filter_ndv(descr.path()) {
let fpp = props.bloom_filter_fpp(descr.path());
Some(Sbbf::new_with_ndv_fpp(ndv, fpp))
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is perhaps a little surprising that bloom_filter_max_bytes is ignored if ndv is set

Comment on lines +608 to +612
bloom_filter_ndv: Option<u64>,
/// bloom filter false positive probability
bloom_filter_fpp: Option<f64>,
/// bloom filter max number of bytes
bloom_filter_max_bytes: Option<u32>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it would be simpler to just ask users to specify the bloom filter size, and provide a free function to compute the size based on ndv and fpp?

The interaction of these three properties isn't immediately apparent?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was the conclusion I may have come to as well -- see #3138 (comment)

}

/// Reader properties builder.
pub struct ReaderPropertiesBuilder {
codec_options_builder: CodecOptionsBuilder,
read_bloom_filter: Option<bool>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
read_bloom_filter: Option<bool>,
read_bloom_filter: bool,

@@ -635,13 +731,17 @@ impl ReaderPropertiesBuilder {
fn with_defaults() -> Self {
Self {
codec_options_builder: CodecOptionsBuilder::default(),
read_bloom_filter: None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
read_bloom_filter: None,
read_bloom_filter: DEFAULT_READ_BLOOM_FILTER,

@Jimexist
Copy link
Member Author

thanks @tustvold i plan to merge as is and then in subsequent pr adjust:

  1. knobs or params that users can tune
  2. usage of macro
  3. panic behavior

because 2 relies on 1, and 3 relies on 2, so i'd like to clean up 3 of them together

@tustvold tustvold merged commit e214ccc into master Nov 22, 2022
@Jimexist Jimexist deleted the add-bloom-filter-3 branch November 22, 2022 12:08
@ursabot
Copy link

ursabot commented Nov 22, 2022

Benchmark runs are scheduled for baseline = 004a151 and contender = e214ccc. e214ccc is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@XinyuZeng
Copy link
Contributor

the arrow parquet C++ writer seems to allow for the fpp setting

https://arrow.apache.org/docs/cpp/api/formats.html#_CPPv4N5arrow8adapters3orc12WriteOptions16bloom_filter_fppE

nit: this reference is for arrow ORC C++ writer, parquet C++ does not enable write bloom filter yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants