Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consistent Schema Enforcement #4801

Open
tustvold opened this issue Sep 8, 2023 · 6 comments
Open

Consistent Schema Enforcement #4801

tustvold opened this issue Sep 8, 2023 · 6 comments
Labels
enhancement Any new improvement worthy of a entry in the changelog

Comments

@tustvold
Copy link
Contributor

tustvold commented Sep 8, 2023

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

Comparing primitive arrays for equality, perhaps in the context of a compute kernel, is relatively straightforward. A DataType::Int8 is equal to DataType::Int8 and not equal to DataType::UInt8.

For nested types such as StructArray, ListArray and RecordBatch this gets more complex, how strictly should we enforce that a schema is consistent. Should we allow an array to be of a different type to its schema, what about nullability or metadata?

We currently have a range of approaches:

  • RowConvert accepts any arrays so long as the array's DataType::equals_datatype, ignoring metadata and field names, but validating nullability
  • Parquet's ArrowWriter accept any RecordBatch so long as the writer's Schema::contains the provided batch schema, this forces nullability and metadata to be a subset
  • arrow-ipc, arrow-json, arrow-csv writers ignore the schema entirely, potentially allowing it to change between batches
  • Most kernels and array constructors use PartialEq to enforce consistency of their inputs
  • Array constructors enforce field nullability matches the presence of null buffers, this is non-trivial in the case of StructArray
  • MutableArrayData uses the types of the first batch and will panic if it encounters inconsistency

Describe the solution you'd like

I don't really know, eagerly performing validation can help to catch bugs and issues, but on the flip side it is frustrating to be validating things like field names, metadata, or even nullability, that in most cases won't make a different to correctness

Describe alternatives you've considered

Additional context

#1888
#3226
#4799

@tustvold tustvold added the enhancement Any new improvement worthy of a entry in the changelog label Sep 8, 2023
@alamb
Copy link
Contributor

alamb commented Sep 8, 2023

More context: #4800 (comment)

I wonder if we could have "strict" and "non strict" schema checking -- e.g. for some things like arrow-json where there is a configuration object there is a natural place to add enforce_schema_equality: bool which users can then turn off of they don't want that definition of equality

However, functions like concat_batches don't have a natural way to configure behavior like this (ConcatOptions seems like maybe it would be overkill 🤔 )

@tustvold
Copy link
Contributor Author

tustvold commented Sep 8, 2023

Perhaps concat, etc... could take an explicit schema, this would also sidestep the issues around an empty slice...

I could also be convinced to do away with all the validation, and just do explicit validation in the places it matters to correctness - e.g. parquet and nullability

@alamb
Copy link
Contributor

alamb commented Sep 8, 2023

concat_batches already does take an explicit schema 🤔

https://docs.rs/arrow/latest/arrow/compute/fn.concat_batches.html

pub fn concat_batches<'a>(
    schema: &Arc<Schema, Global>,
    input_batches: impl IntoIterator<Item = &'a RecordBatch>
) -> Result<RecordBatch, ArrowError>

@tustvold
Copy link
Contributor Author

tustvold commented Sep 8, 2023

In which case perhaps that is the answer to #4800, just use the provided schema and don't perform any additional validation?

@wjones127
Copy link
Member

There are some field names that are kind of useless (such as those in map and list). In Arrow C++, we disabled checking those as part of equality unless specifically asked for. apache/arrow#14847

Field metadata definitely matters, since that may contain extension type information.

I'm not sure about top-level schema metadata. In many cases I think I'd be fine ignoring that by default, or at least I haven't encountered a situation where I really wanted it.

@waynexia
Copy link
Member

There are some field names that are kind of useless (such as those in map and list)

What do you think of removing field names from those types. I find them a bit annoying sometimes. Or is there any place it matters?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

No branches or pull requests

4 participants