Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return reference from UnionArray::child #2035

Closed
kondziu opened this issue Jul 9, 2022 · 1 comment · Fixed by #2099
Closed

Return reference from UnionArray::child #2035

kondziu opened this issue Jul 9, 2022 · 1 comment · Fixed by #2099
Labels
good first issue Good for newcomers help wanted question Further information is requested

Comments

@kondziu
Copy link

kondziu commented Jul 9, 2022

Which part is this question about
The question is about the return type of UnionArray::child, how it compares to StructArray::column, and why they are different.

Describe your question
StructArray::column returns &ArrayRef, a reference to Arc<dyn Array> owned by the StructArray object. However, its spiritual equivalent in UnionArray, the method UnionArray::child returns an owned Arc<dyn Array>.

The implementations of both are relatively similar: StructArray::column and UnionArray::child essentially both look up an index in an array. The only difference that I see is that StructArray::column returns the reference to what it finds, whereas UnionArray::child calls clone on that reference to return an owned object.

// StructArray
pub fn column(&self, pos: usize) -> &ArrayRef {
    &self.boxed_fields[pos]
}
// Union
pub fn child(&self, type_id: i8) -> ArrayRef {
    assert!(0 <= type_id);
    assert!((type_id as usize) < self.boxed_fields.len());
    self.boxed_fields[type_id as usize].clone()
}

It makes sense to me to always return a reference. Returning a reference gives an additional amount of flexibility to the user, who can call clone on the reference to get an owned Arc, or who can retain the reference to maintain a connection between the returned column and the union/array from which it came. The contrary also has the downside of always engaging with the reference counter and paying its overhead, regardless of whether its necessary or not.

I am curious why this discrepancy exists. I am guessing there is a specific technical reason that I don't understand, in which case I'd love to find out what it is. On the other hand, I am surreptitiously hoping this is an oversight and I might possibly agitate to harmonize the return types :)

Additional context
I am writing a function that traverses a &'a RecordBatch following some user-defined path and returns a column it finds at the end. The column is meant to be downcast into a specific type (such as UInt32Array).

Since StructArray::column provides a reference with a lifetime of &'a, I can downcast it with .as_any().downcast() and receive a reference with a lifetime 'a. However, if the column is found inside a UnionArray, the method UnionArray::child produces an owned object. I can only downcast this object to a reference type, which will then only live as long as the Arc from which it came, which means only until the end of the function

A simplified example:

fn find_u32_column_of_struct(batch: &'a RecordBatch, path: &[&str]) -> &'a UInt32Array {
    // Find array
    let array: &'a StructArray = find_path_in_batch_somehow(batch, path[0..path.len() - 1])
  
    // Find column
    let schema = array.schema();
    let fields = array.fields().iter();
    let columns = array.columns().iter();
    let column: &'a Arc<dyn Array> = fields
        .zip(columns)
        .filter(|(field, _)| field.name() == path[path.len() - 1])
        .map(|(_, column)| column)
        .exactly_one()
        .unwrap();
  
    // Downcast to UInt32Array    
    column.as_ref()
        .as_any()
        .downcast_ref()
        .unwrap()
}
fn find_u32_child_of_union(batch: &'a RecordBatch, path: &[&str]) -> &'a UInt32Array {
    // Find array
    let union: &'a UnionArray = find_path_in_batch_somehow(batch, path[0..path.len() - 1])
  
    // Find child
    let index = union.type_names()
        .into_iter()
        .enumerate()
        .filter(|(_index, type_name)| type_name == &path[path.len() - 1])
        .map(|(index, _)| index as i8)
        .exactly_one()
        .unwrap();
    let child: Arc<dyn Array> = union.child(index);
  
    // Downcast to UInt32Array    
    child.as_ref()
       .as_any()
       .downcast_ref()
       .unwrap()
}
@kondziu kondziu added the question Further information is requested label Jul 9, 2022
@tustvold
Copy link
Contributor

tustvold commented Jul 16, 2022

I would support changing to always returning a reference and would happily approve a PR that did so. Historically a number of methods have returned owned references as a quirk of being based on equivalent methods in C++, where returning references is potentially error prone. In Rust I agree this is a code smell we should fix.

See #313 for prior art in this space

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers help wanted question Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants