Consistent Schema Enforcement #4801

tustvold · 2023-09-08T17:28:41Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

Comparing primitive arrays for equality, perhaps in the context of a compute kernel, is relatively straightforward. A DataType::Int8 is equal to DataType::Int8 and not equal to DataType::UInt8.

For nested types such as StructArray, ListArray and RecordBatch this gets more complex, how strictly should we enforce that a schema is consistent. Should we allow an array to be of a different type to its schema, what about nullability or metadata?

We currently have a range of approaches:

RowConvert accepts any arrays so long as the array's DataType::equals_datatype, ignoring metadata and field names, but validating nullability
Parquet's ArrowWriter accept any RecordBatch so long as the writer's Schema::contains the provided batch schema, this forces nullability and metadata to be a subset
arrow-ipc, arrow-json, arrow-csv writers ignore the schema entirely, potentially allowing it to change between batches
Most kernels and array constructors use PartialEq to enforce consistency of their inputs
Array constructors enforce field nullability matches the presence of null buffers, this is non-trivial in the case of StructArray
MutableArrayData uses the types of the first batch and will panic if it encounters inconsistency

Describe the solution you'd like

I don't really know, eagerly performing validation can help to catch bugs and issues, but on the flip side it is frustrating to be validating things like field names, metadata, or even nullability, that in most cases won't make a different to correctness

Describe alternatives you've considered

Additional context

#1888
#3226
#4799

The text was updated successfully, but these errors were encountered:

alamb · 2023-09-08T17:32:59Z

More context: #4800 (comment)

I wonder if we could have "strict" and "non strict" schema checking -- e.g. for some things like arrow-json where there is a configuration object there is a natural place to add enforce_schema_equality: bool which users can then turn off of they don't want that definition of equality

However, functions like concat_batches don't have a natural way to configure behavior like this (ConcatOptions seems like maybe it would be overkill 🤔 )

tustvold · 2023-09-08T17:37:43Z

Perhaps concat, etc... could take an explicit schema, this would also sidestep the issues around an empty slice...

I could also be convinced to do away with all the validation, and just do explicit validation in the places it matters to correctness - e.g. parquet and nullability

alamb · 2023-09-08T18:55:58Z

concat_batches already does take an explicit schema 🤔

https://docs.rs/arrow/latest/arrow/compute/fn.concat_batches.html

pub fn concat_batches<'a>(
    schema: &Arc<Schema, Global>,
    input_batches: impl IntoIterator<Item = &'a RecordBatch>
) -> Result<RecordBatch, ArrowError>

tustvold · 2023-09-08T20:09:55Z

In which case perhaps that is the answer to #4800, just use the provided schema and don't perform any additional validation?

wjones127 · 2023-09-09T21:00:27Z

There are some field names that are kind of useless (such as those in map and list). In Arrow C++, we disabled checking those as part of equality unless specifically asked for. apache/arrow#14847

Field metadata definitely matters, since that may contain extension type information.

I'm not sure about top-level schema metadata. In many cases I think I'd be fine ignoring that by default, or at least I haven't encountered a situation where I really wanted it.

waynexia · 2023-10-19T13:17:07Z

There are some field names that are kind of useless (such as those in map and list)

What do you think of removing field names from those types. I find them a bit annoying sometimes. Or is there any place it matters?

tustvold added the enhancement Any new improvement worthy of a entry in the changelog label Sep 8, 2023

alamb mentioned this issue Sep 8, 2023

Allow concat_batches to work with RecordBatches that have different metadata #4800

Closed

alamb mentioned this issue Sep 13, 2023

Do not check schema for equality in concat_batches #4815

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consistent Schema Enforcement #4801

Consistent Schema Enforcement #4801

tustvold commented Sep 8, 2023 •

edited

alamb commented Sep 8, 2023

tustvold commented Sep 8, 2023 •

edited

alamb commented Sep 8, 2023 •

edited

tustvold commented Sep 8, 2023

wjones127 commented Sep 9, 2023

waynexia commented Oct 19, 2023

Consistent Schema Enforcement #4801

Consistent Schema Enforcement #4801

Comments

tustvold commented Sep 8, 2023 • edited

alamb commented Sep 8, 2023

tustvold commented Sep 8, 2023 • edited

alamb commented Sep 8, 2023 • edited

tustvold commented Sep 8, 2023

wjones127 commented Sep 9, 2023

waynexia commented Oct 19, 2023

tustvold commented Sep 8, 2023 •

edited

tustvold commented Sep 8, 2023 •

edited

alamb commented Sep 8, 2023 •

edited