Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ListArray::try_new disallows null elements if field.is_nullable() == false #4602

Closed
kawadakk opened this issue Jul 31, 2023 · 2 comments
Closed
Labels
bug development-process Related to development process of arrow-rs

Comments

@kawadakk
Copy link
Contributor

kawadakk commented Jul 31, 2023

Describe the bug
Even if a field is non-nullable in its schema definition, it's only semantic and I think this should not prevent the creation of arrays containing null values. Null values in semantically non-nullable fields may arise from, for example, being nested inside another nullable field (as in #3900) or a sparse union field.

Arrow Columnar Format:

The Field Flatbuffers type contains the metadata for a single array. This includes: [...] Whether the field is semantically nullable. While this has no bearing on the array’s physical layout, many systems distinguish nullable and non-nullable fields and we want to allow them to preserve this metadata to enable faithful schema round trips.

To Reproduce

let field = Arc::new(Field::new("element", DataType::Int32, false));
let offsets = OffsetBuffer::new(vec![0, 1, 4, 5].into());
let values = new_null_array(&DataType::Int32, 5);
ListArray::new(field, offsets, values, None);

The above code panics with message called `Result::unwrap()` on an `Err` value: InvalidArgumentError("Non-nullable field of ListArray \"element\" cannot contain nulls")

Expected behavior
Not panicking

Additional context

@kawadakk kawadakk added the bug label Jul 31, 2023
@tustvold
Copy link
Contributor

tustvold commented Jul 31, 2023

This is intentional, and in keeping with the arrow specification. The specification is merely stating that the nullability of a field does not impact the physical layout of a type, which is entirely true, in much the same way as a timestamp timezone also does not. However, it is still incorrect for a non-nullable ListArray to contain nullable values, in much the same way as it would be incorrect for it to contain values of a different type.

For an example of why this is important, the record shredding when writing to parquet cares about schema nullability and will be incorrect if there is inconsistency. Similarly some of DataFusion's optimisation passes are incorrect if the arrays don't contain nulls. Checking this at runtime helps to avoid subtle bugs that can result in data corruption.

For more context see #1890

Perhaps if you can describe your use-case I can advise on how best to achieve this

@kawadakk
Copy link
Contributor Author

kawadakk commented Aug 1, 2023

Thank you for clarification!

@kawadakk kawadakk closed this as completed Aug 1, 2023
@tustvold tustvold added the development-process Related to development process of arrow-rs label Aug 1, 2023
@kawadakk kawadakk closed this as not planned Won't fix, can't repro, duplicate, stale Aug 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug development-process Related to development process of arrow-rs
Projects
None yet
Development

No branches or pull requests

2 participants