Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bloom filters for i8 and i16 always return false negatives #5550

Open
progval opened this issue Mar 24, 2024 · 7 comments · May be fixed by #5661
Open

Bloom filters for i8 and i16 always return false negatives #5550

progval opened this issue Mar 24, 2024 · 7 comments · May be fixed by #5661
Labels

Comments

@progval
Copy link
Contributor

progval commented Mar 24, 2024

Describe the bug

It is unclear to me if this is an issue when building or checking the Bloom filter; but either way, building a Bloom filter with i8 or i16 values (as opposed to i32 or i64) always returns false when checked.

To Reproduce

diff --git a/parquet/src/arrow/arrow_writer/mod.rs b/parquet/src/arrow/arrow_writer/mod.rs
index 18c8617e07..d6b14e2899 100644
--- a/parquet/src/arrow/arrow_writer/mod.rs
+++ b/parquet/src/arrow/arrow_writer/mod.rs
@@ -2039,6 +2039,36 @@ mod tests {
         values_required::<BinaryArray, _>(many_vecs_iter);
     }
 
+    #[test]
+    fn i8_column_bloom_filter() {
+        let array = Arc::new(Int8Array::from_iter(0..SMALL_SIZE as i8));
+        let mut options = RoundTripOptions::new(array, false);
+        options.bloom_filter = true;
+
+        let files = one_column_roundtrip_with_options(options);
+        check_bloom_filter(
+            files,
+            "col".to_string(),
+            (0..SMALL_SIZE as i8).collect(),
+            (SMALL_SIZE as i8 + 1..SMALL_SIZE as i8 + 10).collect(),
+        );
+    }
+
+    #[test]
+    fn i16_column_bloom_filter() {
+        let array = Arc::new(Int16Array::from_iter(0..SMALL_SIZE as i16));
+        let mut options = RoundTripOptions::new(array, false);
+        options.bloom_filter = true;
+
+        let files = one_column_roundtrip_with_options(options);
+        check_bloom_filter(
+            files,
+            "col".to_string(),
+            (0..SMALL_SIZE as i16).collect(),
+            (SMALL_SIZE as i16 + 1..SMALL_SIZE as i16 + 10).collect(),
+        );
+    }
+
     #[test]
     fn i32_column_bloom_filter() {
         let array = Arc::new(Int32Array::from_iter(0..SMALL_SIZE as i32));
@@ -2054,6 +2084,21 @@ mod tests {
         );
     }
 
+    #[test]
+    fn i64_column_bloom_filter() {
+        let array = Arc::new(Int64Array::from_iter(0..SMALL_SIZE as i64));
+        let mut options = RoundTripOptions::new(array, false);
+        options.bloom_filter = true;
+
+        let files = one_column_roundtrip_with_options(options);
+        check_bloom_filter(
+            files,
+            "col".to_string(),
+            (0..SMALL_SIZE as i64).collect(),
+            (SMALL_SIZE as i64 + 1..SMALL_SIZE as i64 + 10).collect(),
+        );
+    }
+
     #[test]
     fn binary_column_bloom_filter() {
         let one_vec: Vec<u8> = (0..SMALL_SIZE as u8).collect();

returns:

failures:

---- arrow::arrow_writer::tests::i16_column_bloom_filter stdout ----
thread 'arrow::arrow_writer::tests::i16_column_bloom_filter' panicked at parquet/src/arrow/arrow_writer/mod.rs:1792:17:
Value [0, 0] should be in bloom filter

---- arrow::arrow_writer::tests::i8_column_bloom_filter stdout ----
thread 'arrow::arrow_writer::tests::i8_column_bloom_filter' panicked at parquet/src/arrow/arrow_writer/mod.rs:1792:17:
Value [0] should be in bloom filter


failures:
    arrow::arrow_writer::tests::i16_column_bloom_filter
    arrow::arrow_writer::tests::i8_column_bloom_filter

Expected behavior
These tests should pass

Additional context

I found this from Datafusion: apache/datafusion#9779

@progval progval added the bug label Mar 24, 2024
@tustvold
Copy link
Contributor

I suspect this is probably an issue with sign extension, as parquet doesn't natively support these types and they must be widened to int32

@mr-brobot
Copy link
Contributor

I started looking into this. Hoping to have a PR open in the next few days. :)

@mr-brobot
Copy link
Contributor

mr-brobot commented Mar 28, 2024

As @tustvold mentioned, this appears to be caused by widening Int8 and Int16 columns to Int32 when writing, then later checking the bloom filter with the unwidened type, resulting in different hashes.

I think a general solution would be to apply the same Arrow-to-Parquet type coercion that occurs on write prior to checking the bloom filter.

@mr-brobot
Copy link
Contributor

Overview

The general issue is that Parquet types are a subset of Arrow types, so the Arrow writer must coerce to Parquet types. In some cases, this changes the physical representation. Thus, bloom filter checks might produce false negatives if passing the original Arrow representation. Correctness is only guaranteed when checking with the coerced Parquet value.

For the issue mentioned above, Parquet only supports Int32 and Int64 integer types:

16-bit ints are not explicitly supported in the storage format since they are covered by 32-bit ints with an efficient encoding

Int8 and Int16 Arrow types are widened to Int32 when writing. For correct bloom filter behavior, they must be widened prior to calling Sbbf::check. Widening before checking causes the tests mentioned above to pass.

However, this issue is not limited to Int8 and Int16. For example, when writing Date64, values are coerced to Date32 and then to Int32. This changes the physical representation from Int64 "milliseconds since epoch" to Int32 "days since epoch".

#[test]
fn date64_column_bloom_filter() {
    use chrono::NaiveDate;

    // construct input data
    let epoch = NaiveDate::from_ymd_opt(1970, 1, 1).unwrap();
    let dates = vec![
        NaiveDate::from_ymd_opt(1915, 11, 25).unwrap(),
        NaiveDate::from_ymd_opt(1964, 4, 14).unwrap(),
        NaiveDate::from_ymd_opt(1992, 9, 2).unwrap(),
        NaiveDate::from_ymd_opt(2024, 1, 1).unwrap(),
    ];
    let values: Vec<_> = dates
        .iter()
        .map(|d| d.signed_duration_since(epoch).num_milliseconds())
        .collect();

    // write (asserts written values match input)
    let array = Arc::new(Date64Array::from(values.clone()));
    let mut options = RoundTripOptions::new(array, false);
    options.bloom_filter = true;
    let files = one_column_roundtrip_with_options(options);

    // checks bloom filter with original input
    // fails with message: Value [0, 100, 252, 121, 114, 254, 255, 255] should be in bloom filter        
    check_bloom_filter(
        files,
        "col".to_string(),
        values,
        (0..SMALL_SIZE as i64).collect(),
    );

    // convert to "days since epoch"
    let values: Vec<_> = dates
        .iter()
        .map(|d| d.signed_duration_since(epoch).num_days() as i32)
        .collect();

    // passes
    check_bloom_filter(
        files,
        "col".to_string(),
        values,
        (0..SMALL_SIZE as i64).collect(),
    );
}

I think the same would apply to decimal and interval types.

Solution

This only affects scenarios where Parquet was written via ArrowWriter. The solution should not impact scenarios where the Parquet reader/writer APIs are used directly.

Since the SBBF is checked by the crate user, part of the solution involves documenting how the Arrow-Parquet type coercion affects the physical representation and, therefore, the bloom filter. However, I think we can make some changes to improve the safety/convenience of working with bloom filters. Some options:

  1. Introduce an arrow_value_to_parquet_value function for callers to use prior to calling Sbbf::check.
  2. Introduce a new function, perhaps named arrow_sbbf_check, that accepts an Arrow array and bloom filter. This method would map the Arrow type to the correct Parquet type before checking the bloom filter, returning a BooleanArray.
  3. Optionally use the bloom filter in the Arrow reader. This can perform the same type coercion that occurs in Arrow writer. This way, Arrow users never need to use the Sbbf API directly. There is a similar conversation happening regarding the C++ implementation.

The first two options are simple but still push some complexity to Arrow users. I think the last option is ideal but definitely a much larger effort. But, these options aren't mutually exclusive. Would love to hear other thoughts on this. :)

@alamb
Copy link
Contributor

alamb commented Apr 1, 2024

FWIW it appears that the same issue applies to unsigned u8 and u16 that apache/datafusion#9770

@alamb
Copy link
Contributor

alamb commented Apr 1, 2024

Thank you for the analysis @mr-brobot. I think your suggestions 1 or 2 in #5550 (comment) sound reasonable to me (as that function would be needed even if we were to implement 3.

cc @Ted-Jiang

@mr-brobot
Copy link
Contributor

Ok, I'll move forward with options 1 and 2. Will let folks decide if we keep both or just one in the PR discussion. 🙂

I'll also open a separate issue for option 3. I think that topic deserves a dedicated conversation thread.

I'm traveling this week but should have time to make progress this weekend. 🙂

@mr-brobot mr-brobot linked a pull request Apr 18, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants