Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow concat_batches to work with RecordBatches that have different metadata #4800

Closed
wants to merge 1 commit into from

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Sep 8, 2023

Which issue does this PR close?

Resolves #4799

Rationale for this change

See Resolves #4799

What changes are included in this PR?

  1. Only check datatype and field name when verifying compatibility -- ignore other details like metadata
  2. Tests for same

Are there any user-facing changes?

concat will not error in as many cases

@alamb alamb changed the title Allow concat_batches to work with different metadata Allow concat_batches to work with RecordBatches that have different metadata Sep 8, 2023
@github-actions github-actions bot added the arrow Changes to the arrow crate label Sep 8, 2023
@alamb alamb marked this pull request as ready for review September 8, 2023 16:13
@tustvold
Copy link
Contributor

tustvold commented Sep 8, 2023

Is this not correct behaviour? If the metadata is inconsistent how does it know which metadata to preserve?

@@ -204,13 +204,32 @@ pub fn concat_batches<'a>(
RecordBatch::try_new(schema.clone(), arrays)
}

/// Returns true if data with the `source` Schema can be placed in a
/// record batch with `target` Schema
fn concatable_schema(target: &Schema, source: &Schema) -> bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb
Copy link
Contributor Author

alamb commented Sep 8, 2023

If the metadata is inconsistent how does it know which metadata to preserve?

Right now the Schema of the output RecordBatch is the schema that was provided by the caller as the first argument to concat_batches

To summarize a conversation I had with @tustvold over slack

  1. I believe his core concern with this PR is that making this check more lax means that it is likely papering over what some people might perceive as a bug in the caller (in this case, inconsistent metadata)

  2. An alternative interpretation might be that by checking for exactly the same schema for all input batches, the concat_batches kernel is imposing a particular definition of schema equality and enforcing an invariant that might not be what other systems have in mind. From this point of view, removing the Schema equality check entirely might be appropriate

I am sure I can fix my particular problem (see https://github.com/influxdata/influxdb_iox/pull/8691/files#r1319044861) other level of the stack (e.g in DataFusion) but it didn't feel right to me that concat_batches was enforcing some particular invariant that is not enforced elsewhere

@alamb
Copy link
Contributor Author

alamb commented Sep 8, 2023

See discussion on #4801

@alamb
Copy link
Contributor Author

alamb commented Sep 13, 2023

Conclusion of the discussion I think is that we would prefer to remove the error checking entirely. See #4815

@alamb alamb marked this pull request as draft September 13, 2023 20:57
@alamb alamb closed this in #4815 Sep 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

concat_batches errors with "schema mismatch" error when only metadata differs
2 participants