New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unnecessary ownership makes it harder to use RecordBatch::schema
and Schema::try_merge
than it needs to be
#5342
Comments
No objections to adding a schema_ref method to RecordBatch or even changing schema to return a reference. W.r.t to merging, perhaps we could just add an appropriate method to SchemaBuilder? |
I recently started working on changing I will put up a PR for this soon. If these changes are too significant, I would be happy to make another PR that creates a |
I think it should probably return a |
That makes sense. I will work on a PR for that tonight. |
If the goal is to allow @carols10cents to write this (which I agree would be much more ergonomic): fn schema_merge(batches: &[RecordBatch]) -> Result<Schema, ArrowError> {
Schema::try_merge(batches.iter().map(|b| b.schema()))
} Perhaps we can change try_merge from pub fn try_merge(
schemas: impl IntoIterator<Item = Schema>
) -> Result<Schema, ArrowError> To take pub fn try_merge(
schemas: impl IntoIterator<Item = SchemaRef>
) -> Result<Schema, ArrowError> Or alternately make a new function I realize this would incur extra runtime overhead of additional |
I think the broader goal is to enable borrow propagation to work correctly, try_merge is just an example of where this causes problems. In general returning owned references when not necessary is an anti-pattern, it's not only inefficient but breaks borrow propagation. FWIW we've made similar changes in the past, e.g. #2035 and #313 |
If that is the broader goal, I think we should track it under a different ticket given its potential for large breaking (albiet mechanical) API changes for users of arrow-rs. The idea is good (avoid clone'ing Arcs) but I have't seen this ever show up in benchmarks I will file a ticket to discuss |
It is unlikely to because of inline and because synchonisation overheads are hard to observe, but IMO the justification is w.r.t borrow propogation, the performance is not a strong justification I agree. FWIW I would suggest we just add a |
This sounds good to me |
I filed #5463 to track the larger discussion |
Closed by #5474, |
|
|
|
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
I have a bunch of record batches, and I'd like to unify their schemas. I would like to create one new
Schema
instance, which I will have ownership of, of course, but it seems like the schema merge should only need to borrow the record batches' schemas.Describe the solution you'd like
I wish I could write:
but that doesn't compile because
batch.schema()
returns an ownedArc<Schema>
andtry_merge
expectsimpl IntoIterator<Item = Schema>
:It seems like
Schema::try_merge
could be changed to only need references to the schemas (or a new function if you didn't want to break backwards compatibility):buuuuuut then you'd have this problem if you try to use
as_ref
to get&Schema
from theArc<Schema>
s:so it seems like if there's a
Schema::try_merge_from_borrowed_schemas
, then it'd also be nice to haveRecordBatch::borrow_schema
that returns&Schema
(not to be confused withSchemaRef
, of course ;))What do you think?
The text was updated successfully, but these errors were encountered: