Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unnecessary ownership makes it harder to use RecordBatch::schema and Schema::try_merge than it needs to be #5342

Closed
carols10cents opened this issue Jan 29, 2024 · 14 comments
Labels
arrow Changes to the arrow crate arrow-flight Changes to the arrow-flight crate enhancement Any new improvement worthy of a entry in the changelog good first issue Good for newcomers help wanted parquet Changes to the parquet crate

Comments

@carols10cents
Copy link
Contributor

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

I have a bunch of record batches, and I'd like to unify their schemas. I would like to create one new Schema instance, which I will have ownership of, of course, but it seems like the schema merge should only need to borrow the record batches' schemas.

Describe the solution you'd like

I wish I could write:

use arrow::{datatypes::Schema, error::ArrowError, record_batch::RecordBatch};
fn schema_merge(batches: &[RecordBatch]) -> Result<Schema, ArrowError> {
    Schema::try_merge(batches.iter().map(|b| b.schema()))
}

but that doesn't compile because batch.schema() returns an owned Arc<Schema> and try_merge expects impl IntoIterator<Item = Schema>:

425 |         Schema::try_merge(batches.iter().map(|b| b.schema()))
    |         ----------------- ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ expected `Schema`, found `Arc<Schema>`
    |         |
    |         required by a bound introduced by this call
    |
    = note: expected struct `arrow::datatypes::Schema`
               found struct `std::sync::Arc<arrow::datatypes::Schema>`

It seems like Schema::try_merge could be changed to only need references to the schemas (or a new function if you didn't want to break backwards compatibility):

-    pub fn try_merge(schemas: impl IntoIterator<Item = Self>) -> Result<Self, ArrowError> {
+    pub fn try_merge<'a>(
+        schemas: impl IntoIterator<Item = &'a Schema>,
+    ) -> Result<Schema, ArrowError> {
...
            // merge metadata
            for (key, value) in metadata.into_iter() {
-                if let Some(old_val) = out_meta.get(&key) {
-                    if old_val != &value {
+                if let Some(old_val) = out_meta.get(key) {
+                    if old_val != value {
...
-                out_meta.insert(key, value);
+                out_meta.insert(key.clone(), value.clone());
            }
}

buuuuuut then you'd have this problem if you try to use as_ref to get &Schema from the Arc<Schema>s:

error[E0515]: cannot return value referencing temporary value
   --> bulk_ingester/src/fsm/create_intermediate_files.rs:425:40
    |
425 |    my_try_merge(batches.iter().map(|b| b.schema().as_ref()))
    |                                        ----------^^^^^^^^^
    |                                        |
    |                                        returns a value referencing data owned by the current function
    |                                        temporary value created here

so it seems like if there's a Schema::try_merge_from_borrowed_schemas, then it'd also be nice to have RecordBatch::borrow_schema that returns &Schema (not to be confused with SchemaRef, of course ;))

What do you think?

@carols10cents carols10cents added the enhancement Any new improvement worthy of a entry in the changelog label Jan 29, 2024
@tustvold
Copy link
Contributor

No objections to adding a schema_ref method to RecordBatch or even changing schema to return a reference.

W.r.t to merging, perhaps we could just add an appropriate method to SchemaBuilder?

@monkwire
Copy link
Contributor

I recently started working on changing RecordBatch::Schema so that it returns an &Schema instead of a SchemaRef. This is a breaking change, and in order to get this code to compile, I had to make changes to files where RecordBatch::Schema is called. These changes were pretty straightforward, and in many cases I just had to remove .as_ref() from places where the SchemaRef was being treated as a &Schema. In other cases, I wrapped the new &Schema return values in Arcs.

I will put up a PR for this soon. If these changes are too significant, I would be happy to make another PR that creates a schema_ref method and doesn't alter the functionality of Schema.

@tustvold
Copy link
Contributor

I think it should probably return a &SchemaRef as this will allow users to cheaply clone it if they want an owned SchemaRef

@monkwire
Copy link
Contributor

That makes sense. I will work on a PR for that tonight.

@alamb
Copy link
Contributor

alamb commented Mar 4, 2024

If the goal is to allow @carols10cents to write this (which I agree would be much more ergonomic):

fn schema_merge(batches: &[RecordBatch]) -> Result<Schema, ArrowError> {
    Schema::try_merge(batches.iter().map(|b| b.schema()))
}

Perhaps we can change try_merge from

pub fn try_merge(
    schemas: impl IntoIterator<Item = Schema>
) -> Result<Schema, ArrowError>

To take SchemaRef:

pub fn try_merge(
    schemas: impl IntoIterator<Item = SchemaRef>
) -> Result<Schema, ArrowError>

Or alternately make a new function Schema::try_merge_refs or something

I realize this would incur extra runtime overhead of additional Arc clones, but I think in the grand scheme of things it is pretty minimal compared to logic that Schema::try_merge is doing

@tustvold
Copy link
Contributor

tustvold commented Mar 4, 2024

If the goal is to allow @carols10cents to write this (which I agree would be much more ergonomic):

I think the broader goal is to enable borrow propagation to work correctly, try_merge is just an example of where this causes problems. In general returning owned references when not necessary is an anti-pattern, it's not only inefficient but breaks borrow propagation. FWIW we've made similar changes in the past, e.g. #2035 and #313

@alamb
Copy link
Contributor

alamb commented Mar 4, 2024

I think the broader goal is to enable borrow propagation to work correctly, try_merge is just an example of where this causes problems.

If that is the broader goal, I think we should track it under a different ticket given its potential for large breaking (albiet mechanical) API changes for users of arrow-rs.

The idea is good (avoid clone'ing Arcs) but I have't seen this ever show up in benchmarks

I will file a ticket to discuss

@tustvold
Copy link
Contributor

tustvold commented Mar 4, 2024

seen this ever show up in benchmarks

It is unlikely to because of inline and because synchonisation overheads are hard to observe, but IMO the justification is w.r.t borrow propogation, the performance is not a strong justification I agree.

FWIW I would suggest we just add a schema_ref method and everyone can then be happy 😄

@alamb
Copy link
Contributor

alamb commented Mar 4, 2024

FWIW I would suggest we just add a schema_ref method and everyone can then be happy 😄

This sounds good to me

@alamb
Copy link
Contributor

alamb commented Mar 4, 2024

I filed #5463 to track the larger discussion

@tustvold
Copy link
Contributor

tustvold commented Mar 16, 2024

Closed by #5474, SchemaBuilder::try_merge already accepts a &FieldRef so I don't believe further changes are necessary

@tustvold tustvold added the parquet Changes to the parquet crate label Apr 17, 2024
@tustvold
Copy link
Contributor

label_issue.py automatically added labels {'parquet'} from #5448

@tustvold tustvold added the arrow Changes to the arrow crate label Apr 17, 2024
@tustvold
Copy link
Contributor

label_issue.py automatically added labels {'arrow'} from #5448

@tustvold tustvold added the arrow-flight Changes to the arrow-flight crate label Apr 17, 2024
@tustvold
Copy link
Contributor

label_issue.py automatically added labels {'arrow-flight'} from #5448

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate arrow-flight Changes to the arrow-flight crate enhancement Any new improvement worthy of a entry in the changelog good first issue Good for newcomers help wanted parquet Changes to the parquet crate
Projects
None yet
Development

No branches or pull requests

4 participants