Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnionArray::try_new is not strict enough when validating inputs #5577

Open
Jefffrey opened this issue Apr 1, 2024 · 0 comments
Open

UnionArray::try_new is not strict enough when validating inputs #5577

Jefffrey opened this issue Apr 1, 2024 · 0 comments
Labels

Comments

@Jefffrey
Copy link
Contributor

Jefffrey commented Apr 1, 2024

Describe the bug

We can create invalid UnionArrays when using try_new(...) as the validation checks done are not strict enough.

To Reproduce

Run test:

    #[test]
    fn test_invalid_type_ids() {
        let int_array = Arc::new(Int32Array::from(vec![5, 6]));
        let f32_array = Arc::new(Float64Array::from(vec![10.0]));

        let type_ids = [1_i8, 0, 2]; // child with type ID '2' does not exist
        let offsets = [0_i32, 0, 0];

        let type_id_buffer = Buffer::from_slice_ref(type_ids);
        let value_offsets_buffer = Buffer::from_slice_ref(offsets);

        let children: Vec<(Field, Arc<dyn Array>)> = vec![
            (Field::new("A", DataType::Int32, false), int_array),
            (Field::new("B", DataType::Float64, false), f32_array),
        ];
        let array = UnionArray::try_new(
            &[0, 1, 2],
            type_id_buffer,
            Some(value_offsets_buffer),
            children,
        )
        .unwrap();
        dbg!(&array);
        dbg!(array.len());
        dbg!(array.value(2));
    }

Get output:

arrow-rs$ cargo test -p arrow-array --lib array::union_array::tests::test_invalid_type_ids -- --exact --nocapture
    Finished test [unoptimized + debuginfo] target(s) in 0.05s
     Running unittests src/lib.rs (target/debug/deps/arrow_array-d885d14cd41402de)

running 1 test
[arrow-array/src/array/union_array.rs:851:9] &array = UnionArray(Dense)
[
-- type id buffer:
ScalarBuffer([1, 0, 2])
-- offsets buffer:
ScalarBuffer([0, 0, 0])
-- child 0: "A" (Int32)
PrimitiveArray<Int32>
[
  5,
  6,
]
-- child 1: "B" (Float64)
PrimitiveArray<Float64>
[
  10.0,
]
]

[arrow-array/src/array/union_array.rs:852:9] array.len() = 3
thread 'array::union_array::tests::test_invalid_type_ids' panicked at arrow-array/src/array/union_array.rs:232:9:
assertion failed: (type_id as usize) < self.fields.len()
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
test array::union_array::tests::test_invalid_type_ids ... FAILED

failures:

failures:
    array::union_array::tests::test_invalid_type_ids

test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 427 filtered out; finished in 0.00s

error: test failed, to rerun pass `-p arrow-array --lib`

Expected behavior

Array should fail to create. It suggests an array with length 3 was created but panics when trying to fetch the 2nd index.

Additional context

Note that some of the documentation is misleading too. For example, on new_unchecked(...):

/// The `type_ids` `Buffer` should contain `i8` values. These values should be greater than
/// zero and must be less than the number of children provided in `child_arrays`. These values
/// are used to index into the `child_arrays`.

  • Though this is for new_unchecked(...), it implies these inputs are validated by try_new(...)

However this is contradicted by the following tests:

#[test]
fn test_dense_mixed_with_str() {
let string_array = StringArray::from(vec!["foo", "bar", "baz"]);
let int_array = Int32Array::from(vec![5, 6]);
let float_array = Float64Array::from(vec![10.0]);
let type_ids = [1_i8, 0, 0, 2, 0, 1];
let offsets = [0_i32, 0, 1, 0, 2, 1];
let type_id_buffer = Buffer::from_slice_ref(type_ids);
let value_offsets_buffer = Buffer::from_slice_ref(offsets);
let children: Vec<(Field, Arc<dyn Array>)> = vec![
(
Field::new("A", DataType::Utf8, false),
Arc::new(string_array),
),
(Field::new("B", DataType::Int32, false), Arc::new(int_array)),
(
Field::new("C", DataType::Float64, false),
Arc::new(float_array),
),
];
let array = UnionArray::try_new(
&[0, 1, 2],
type_id_buffer,
Some(value_offsets_buffer),
children,
)
.unwrap();

  • Here it shows we can provide 0-values in the type_ids buffer

let offsets = Buffer::from_slice_ref([0, 0, 1, 1, 2, 2, 3, 4i32]);
let type_ids = Buffer::from_slice_ref([42, 84, 42, 84, 84, 42, 84, 84i8]);
let array = UnionArray::try_new(
&[84, 42],
type_ids,
Some(offsets),
vec![
(Field::new("int", DataType::Int32, false), ints),
(Field::new("string", DataType::Utf8, false), strings),
],
)
.unwrap()

  • Here it shows we can provide values greater than the length of the Vec of child arrays passed in, for the type_ids buffer

Edit: another note is that field_type_ids isn't clearly documented on what exactly it represents

@Jefffrey Jefffrey added the bug label Apr 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant