Bloom filters for i8 and i16 always return false negatives #5550

progval · 2024-03-24T19:34:38Z

Describe the bug

It is unclear to me if this is an issue when building or checking the Bloom filter; but either way, building a Bloom filter with i8 or i16 values (as opposed to i32 or i64) always returns false when checked.

To Reproduce

diff --git a/parquet/src/arrow/arrow_writer/mod.rs b/parquet/src/arrow/arrow_writer/mod.rs
index 18c8617e07..d6b14e2899 100644
--- a/parquet/src/arrow/arrow_writer/mod.rs
+++ b/parquet/src/arrow/arrow_writer/mod.rs
@@ -2039,6 +2039,36 @@ mod tests {
         values_required::<BinaryArray, _>(many_vecs_iter);
     }
 
+    #[test]
+    fn i8_column_bloom_filter() {
+        let array = Arc::new(Int8Array::from_iter(0..SMALL_SIZE as i8));
+        let mut options = RoundTripOptions::new(array, false);
+        options.bloom_filter = true;
+
+        let files = one_column_roundtrip_with_options(options);
+        check_bloom_filter(
+            files,
+            "col".to_string(),
+            (0..SMALL_SIZE as i8).collect(),
+            (SMALL_SIZE as i8 + 1..SMALL_SIZE as i8 + 10).collect(),
+        );
+    }
+
+    #[test]
+    fn i16_column_bloom_filter() {
+        let array = Arc::new(Int16Array::from_iter(0..SMALL_SIZE as i16));
+        let mut options = RoundTripOptions::new(array, false);
+        options.bloom_filter = true;
+
+        let files = one_column_roundtrip_with_options(options);
+        check_bloom_filter(
+            files,
+            "col".to_string(),
+            (0..SMALL_SIZE as i16).collect(),
+            (SMALL_SIZE as i16 + 1..SMALL_SIZE as i16 + 10).collect(),
+        );
+    }
+
     #[test]
     fn i32_column_bloom_filter() {
         let array = Arc::new(Int32Array::from_iter(0..SMALL_SIZE as i32));
@@ -2054,6 +2084,21 @@ mod tests {
         );
     }
 
+    #[test]
+    fn i64_column_bloom_filter() {
+        let array = Arc::new(Int64Array::from_iter(0..SMALL_SIZE as i64));
+        let mut options = RoundTripOptions::new(array, false);
+        options.bloom_filter = true;
+
+        let files = one_column_roundtrip_with_options(options);
+        check_bloom_filter(
+            files,
+            "col".to_string(),
+            (0..SMALL_SIZE as i64).collect(),
+            (SMALL_SIZE as i64 + 1..SMALL_SIZE as i64 + 10).collect(),
+        );
+    }
+
     #[test]
     fn binary_column_bloom_filter() {
         let one_vec: Vec<u8> = (0..SMALL_SIZE as u8).collect();

returns:

failures:

---- arrow::arrow_writer::tests::i16_column_bloom_filter stdout ----
thread 'arrow::arrow_writer::tests::i16_column_bloom_filter' panicked at parquet/src/arrow/arrow_writer/mod.rs:1792:17:
Value [0, 0] should be in bloom filter

---- arrow::arrow_writer::tests::i8_column_bloom_filter stdout ----
thread 'arrow::arrow_writer::tests::i8_column_bloom_filter' panicked at parquet/src/arrow/arrow_writer/mod.rs:1792:17:
Value [0] should be in bloom filter


failures:
    arrow::arrow_writer::tests::i16_column_bloom_filter
    arrow::arrow_writer::tests::i8_column_bloom_filter

Expected behavior
These tests should pass

Additional context

I found this from Datafusion: apache/datafusion#9779

The text was updated successfully, but these errors were encountered:

tustvold · 2024-03-24T21:00:02Z

I suspect this is probably an issue with sign extension, as parquet doesn't natively support these types and they must be widened to int32

mr-brobot · 2024-03-27T01:06:44Z

I started looking into this. Hoping to have a PR open in the next few days. :)

mr-brobot · 2024-03-28T14:27:09Z

As @tustvold mentioned, this appears to be caused by widening Int8 and Int16 columns to Int32 when writing, then later checking the bloom filter with the unwidened type, resulting in different hashes.

I think a general solution would be to apply the same Arrow-to-Parquet type coercion that occurs on write prior to checking the bloom filter.

mr-brobot · 2024-03-30T22:21:36Z

Overview

The general issue is that Parquet types are a subset of Arrow types, so the Arrow writer must coerce to Parquet types. In some cases, this changes the physical representation. Thus, bloom filter checks might produce false negatives if passing the original Arrow representation. Correctness is only guaranteed when checking with the coerced Parquet value.

For the issue mentioned above, Parquet only supports Int32 and Int64 integer types:

16-bit ints are not explicitly supported in the storage format since they are covered by 32-bit ints with an efficient encoding

Int8 and Int16 Arrow types are widened to Int32 when writing. For correct bloom filter behavior, they must be widened prior to calling Sbbf::check. Widening before checking causes the tests mentioned above to pass.

However, this issue is not limited to Int8 and Int16. For example, when writing Date64, values are coerced to Date32 and then to Int32. This changes the physical representation from Int64 "milliseconds since epoch" to Int32 "days since epoch".

#[test]
fn date64_column_bloom_filter() {
    use chrono::NaiveDate;

    // construct input data
    let epoch = NaiveDate::from_ymd_opt(1970, 1, 1).unwrap();
    let dates = vec![
        NaiveDate::from_ymd_opt(1915, 11, 25).unwrap(),
        NaiveDate::from_ymd_opt(1964, 4, 14).unwrap(),
        NaiveDate::from_ymd_opt(1992, 9, 2).unwrap(),
        NaiveDate::from_ymd_opt(2024, 1, 1).unwrap(),
    ];
    let values: Vec<_> = dates
        .iter()
        .map(|d| d.signed_duration_since(epoch).num_milliseconds())
        .collect();

    // write (asserts written values match input)
    let array = Arc::new(Date64Array::from(values.clone()));
    let mut options = RoundTripOptions::new(array, false);
    options.bloom_filter = true;
    let files = one_column_roundtrip_with_options(options);

    // checks bloom filter with original input
    // fails with message: Value [0, 100, 252, 121, 114, 254, 255, 255] should be in bloom filter        
    check_bloom_filter(
        files,
        "col".to_string(),
        values,
        (0..SMALL_SIZE as i64).collect(),
    );

    // convert to "days since epoch"
    let values: Vec<_> = dates
        .iter()
        .map(|d| d.signed_duration_since(epoch).num_days() as i32)
        .collect();

    // passes
    check_bloom_filter(
        files,
        "col".to_string(),
        values,
        (0..SMALL_SIZE as i64).collect(),
    );
}

I think the same would apply to decimal and interval types.

Solution

This only affects scenarios where Parquet was written via ArrowWriter. The solution should not impact scenarios where the Parquet reader/writer APIs are used directly.

Since the SBBF is checked by the crate user, part of the solution involves documenting how the Arrow-Parquet type coercion affects the physical representation and, therefore, the bloom filter. However, I think we can make some changes to improve the safety/convenience of working with bloom filters. Some options:

Introduce an arrow_value_to_parquet_value function for callers to use prior to calling Sbbf::check.
Introduce a new function, perhaps named arrow_sbbf_check, that accepts an Arrow array and bloom filter. This method would map the Arrow type to the correct Parquet type before checking the bloom filter, returning a BooleanArray.
Optionally use the bloom filter in the Arrow reader. This can perform the same type coercion that occurs in Arrow writer. This way, Arrow users never need to use the Sbbf API directly. There is a similar conversation happening regarding the C++ implementation.

The first two options are simple but still push some complexity to Arrow users. I think the last option is ideal but definitely a much larger effort. But, these options aren't mutually exclusive. Would love to hear other thoughts on this. :)

alamb · 2024-04-01T13:37:46Z

FWIW it appears that the same issue applies to unsigned u8 and u16 that apache/datafusion#9770

alamb · 2024-04-01T13:40:25Z

Thank you for the analysis @mr-brobot. I think your suggestions 1 or 2 in #5550 (comment) sound reasonable to me (as that function would be needed even if we were to implement 3.

cc @Ted-Jiang

mr-brobot · 2024-04-03T21:13:23Z

Ok, I'll move forward with options 1 and 2. Will let folks decide if we keep both or just one in the PR discussion. 🙂

I'll also open a separate issue for option 3. I think that topic deserves a dedicated conversation thread.

I'm traveling this week but should have time to make progress this weekend. 🙂

progval added the bug label Mar 24, 2024

tustvold added good first issue Good for newcomers help wanted labels Mar 24, 2024

progval mentioned this issue Mar 24, 2024

Incorrect results: Bloom filters on UInt8, Int8, UInt16 and Int16 columns always return false negatives apache/datafusion#9779

Open

mr-brobot linked a pull request Apr 18, 2024 that will close this issue

[Parquet] Arrow Coercion & SBBF Check Functions #5661

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bloom filters for i8 and i16 always return false negatives #5550

Bloom filters for i8 and i16 always return false negatives #5550

progval commented Mar 24, 2024

tustvold commented Mar 24, 2024

mr-brobot commented Mar 27, 2024

mr-brobot commented Mar 28, 2024 •

edited

mr-brobot commented Mar 30, 2024

alamb commented Apr 1, 2024

alamb commented Apr 1, 2024

mr-brobot commented Apr 3, 2024

Bloom filters for i8 and i16 always return false negatives #5550

Bloom filters for i8 and i16 always return false negatives #5550

Comments

progval commented Mar 24, 2024

tustvold commented Mar 24, 2024

mr-brobot commented Mar 27, 2024

mr-brobot commented Mar 28, 2024 • edited

mr-brobot commented Mar 30, 2024

Overview

Solution

alamb commented Apr 1, 2024

alamb commented Apr 1, 2024

mr-brobot commented Apr 3, 2024

mr-brobot commented Mar 28, 2024 •

edited