New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
parquet bloom filter part III: add sbbf writer, remove bloom
default feature, add reader properties
#3119
Conversation
5d76248
to
76aa88f
Compare
52bf18a
to
09ee38c
Compare
bloom
feature, add reader properties
ec3b5d0
to
9b55ab6
Compare
@@ -57,7 +57,8 @@ seq-macro = { version = "0.3", default-features = false } | |||
futures = { version = "0.3", default-features = false, features = ["std"], optional = true } | |||
tokio = { version = "1.0", optional = true, default-features = false, features = ["macros", "rt", "io-util"] } | |||
hashbrown = { version = "0.13", default-features = false } | |||
twox-hash = { version = "1.6", optional = true } | |||
twox-hash = { version = "1.6", default-features = false } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it by default relies on rand
which breaks wasm32
I believe @tustvold is away for a few days. I plan to review this PR in more detail tomorrow |
bloom
feature, add reader propertiesbloom
default feature, add reader properties
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @Jimexist -- this is very cool. I went through the code fairly thoroughly. I had some minor suggestions / comments for documentation and code structure but nothing that would block merging.
I think the biggest thing I would like to discuss is "what parameters to expose for the writer API". I was thinking, for example, will users of this feature be able to set "fpp" and "ndv" reasonably? I suppose having the number of distinct values before writing a parquet file is reasonable, but maybe not the expected number of distinct values for each row group.
I did some research of other implementations. Here are the spark settingss https://spark.apache.org/docs/latest/configuration.html
spark.sql.optimizer.runtime.bloomFilter.creationSideThreshold | 10MB | Size threshold of the bloom filter creation side plan. Estimated size needs to be under this value to try to inject bloom filter. | 3.3.0 |
---|---|---|---|
spark.sql.optimizer.runtime.bloomFilter.enabled | false | When true and if one side of a shuffle join has a selective predicate, we attempt to insert a bloom filter in the other side to reduce the amount of shuffle data. | 3.3.0 |
spark.sql.optimizer.runtime.bloomFilter.expectedNumItems | 1000000 | The default number of expected items for the runtime bloomfilter | 3.3.0 |
spark.sql.optimizer.runtime.bloomFilter.maxNumBits | 67108864 | The max number of bits to use for the runtime bloom filter | 3.3.0 |
spark.sql.optimizer.runtime.bloomFilter.maxNumItems | 4000000 | The max allowed number of expected items for the runtime bloom filter | 3.3.0 |
spark.sql.optimizer.runtime.bloomFilter.numBits | 8388608 | The default number of bits to use for the runtime bloom filter | 3.3.0 |
the arrow parquet C++ writer seems to allow for the fpp setting
double bloom_filter_fpp = 0.05
The upper limit of the false-positive rate of the bloom filter, default 0.05.
Databricks seems to expose the fpp, max_fpp, and num distinct values:
https://docs.databricks.com/sql/language-manual/delta-create-bloomfilter-index.html
@@ -77,7 +78,7 @@ rand = { version = "0.8", default-features = false, features = ["std", "std_rng" | |||
all-features = true | |||
|
|||
[features] | |||
default = ["arrow", "bloom", "snap", "brotli", "flate2", "lz4", "zstd", "base64"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
parquet/src/bloom_filter/mod.rs
Outdated
@@ -128,6 +172,33 @@ impl Sbbf { | |||
Self(data) | |||
} | |||
|
|||
/// Write the bitset in serialized form to the writer. | |||
pub fn write_bitset<W: Write>(&self, mut writer: W) -> Result<(), ParquetError> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be good to write a test for round tripping the bloom filters (as in write a SBFF to a Vec and then read it back out and verify it is the same). Specifically it would be nice to verify the bytes are not scrambled and the lengths are correct and handle empty bitsets (if that is possible)
impl Sbbf { | ||
/// Create a new [Sbbf] with given number of distinct values and false positive probability. | ||
/// Will panic if `fpp` is greater than 1.0 or less than 0.0. | ||
pub fn new_with_ndv_fpp(ndv: u64, fpp: f64) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is a function meant for use outside the parquet crate I would prefer it return an error rather than panic with bad input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will update once we decided on using fpp or ndv or both:
(0.001, 100000, 1460769), | ||
(0.1, 1000000, 5772541), | ||
(0.01, 1000000, 9681526), | ||
(0.001, 1000000, 14607697), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this mean a 14MB bloom filter? Its ok as there is a limit in optimal_num_of_bytes
but when I saw this it was just 🤯
It might also be good to pass in some value larger than 2^32 to test there isn't an overflow problem lurking
@@ -236,7 +236,7 @@ pub struct RowGroupMetaData { | |||
} | |||
|
|||
impl RowGroupMetaData { | |||
/// Returns builer for row group metadata. | |||
/// Returns builder for row group metadata. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@@ -255,6 +279,11 @@ impl WriterProperties { | |||
.or_else(|| self.default_column_properties.max_statistics_size()) | |||
.unwrap_or(DEFAULT_MAX_STATISTICS_SIZE) | |||
} | |||
|
|||
def_col_property_getter!(bloom_filter_enabled, bool, DEFAULT_BLOOM_FILTER_ENABLED); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these properties need docstrings -- I am happy to help write them. In particular, I think it should mention that the ndv and fpp are knobs that allow for control over bloom filter accuracy. Also it should mention if these limits are for each column chunk or the entire file (e.g. the ndv value is not the number of distinct values in the entire column)
num_bytes.next_power_of_two() | ||
} | ||
|
||
// see http://algo2.iti.kit.edu/documents/cacheefficientbloomfilters-jea.pdf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the parquet-mr code: https://github.com/apache/parquet-mr/blob/d057b39d93014fe40f5067ee4a33621e65c91552/parquet-column/src/main/java/org/apache/parquet/column/values/bloomfilter/BlockSplitBloomFilter.java#L277-L304
That looks very similar
thanks @alamb for the detailed comment. i wish to merge as is and then for subsequent steps:
do you think this is a good idea given that we are not releasing a new version very soon? |
- add reader properties - add writer properties - remove `bloom` feature
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
8ed4337
to
85014ce
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy for this to go in as is, although I personally would prefer to reduce the amount of macros as they make the code quite hard to follow
@@ -272,6 +301,52 @@ pub struct WriterPropertiesBuilder { | |||
sorting_columns: Option<Vec<SortingColumn>>, | |||
} | |||
|
|||
macro_rules! def_opt_field_setter { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just an observation that these macros are potentially more verbose than the alternative. Perhaps I'm old-fashioned but I'm not a massive fan of macros aside from where absolutely necessary, as they complicate debugging and legibility
($field: ident, $type: ty, $min_value:expr, $max_value:expr) => { | ||
paste! { | ||
pub fn [<set_ $field>](&mut self, value: $type) -> &mut Self { | ||
if ($min_value..=$max_value).contains(&value) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expect this to error or panic, not just ignore a value out of range?
impl Sbbf { | ||
/// Create a new [Sbbf] with given number of distinct values and false positive probability. | ||
/// Will panic if `fpp` is greater than 1.0 or less than 0.0. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// Will panic if `fpp` is greater than 1.0 or less than 0.0. | |
/// Will panic if `fpp` is greater than or equal to 1.0 or less than 0.0. |
} | ||
} | ||
}; | ||
($field: ident, $type: ty, $min_value:expr, $max_value:expr) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This variant it only used in one place, and so I wonder if it needs to be a macro
if let Some(ndv) = props.bloom_filter_ndv(descr.path()) { | ||
let fpp = props.bloom_filter_fpp(descr.path()); | ||
Some(Sbbf::new_with_ndv_fpp(ndv, fpp)) | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is perhaps a little surprising that bloom_filter_max_bytes
is ignored if ndv
is set
bloom_filter_ndv: Option<u64>, | ||
/// bloom filter false positive probability | ||
bloom_filter_fpp: Option<f64>, | ||
/// bloom filter max number of bytes | ||
bloom_filter_max_bytes: Option<u32>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it would be simpler to just ask users to specify the bloom filter size, and provide a free function to compute the size based on ndv and fpp?
The interaction of these three properties isn't immediately apparent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was the conclusion I may have come to as well -- see #3138 (comment)
} | ||
|
||
/// Reader properties builder. | ||
pub struct ReaderPropertiesBuilder { | ||
codec_options_builder: CodecOptionsBuilder, | ||
read_bloom_filter: Option<bool>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
read_bloom_filter: Option<bool>, | |
read_bloom_filter: bool, |
@@ -635,13 +731,17 @@ impl ReaderPropertiesBuilder { | |||
fn with_defaults() -> Self { | |||
Self { | |||
codec_options_builder: CodecOptionsBuilder::default(), | |||
read_bloom_filter: None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
read_bloom_filter: None, | |
read_bloom_filter: DEFAULT_READ_BLOOM_FILTER, |
thanks @tustvold i plan to merge as is and then in subsequent pr adjust:
because 2 relies on 1, and 3 relies on 2, so i'd like to clean up 3 of them together |
Benchmark runs are scheduled for baseline = 004a151 and contender = e214ccc. e214ccc is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
nit: this reference is for arrow ORC C++ writer, parquet C++ does not enable write bloom filter yet. |
Which issue does this PR close?
Rationale for this change
bloom
featureWhat changes are included in this PR?
Are there any user-facing changes?
now the API is considered complete, the next step is to have an end to end test or cross test with actual parquet file generation.