Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make slice work for nested types #389

Merged
merged 1 commit into from Jul 14, 2021
Merged

Conversation

nevi-me
Copy link
Contributor

@nevi-me nevi-me commented Jun 2, 2021

Which issue does this PR close?

Closes #554

Rationale for this change

ArrayData::slice() does not work for nested types, because only the ArrayData::buffers are updated with the new offset and length. This has caused a lot of issues in the past.

This blocks us from being able to implement RecordBatch::slice(), and has led to creating #381 to sidestep this issue.

What changes are included in this PR?

I propose a manual slice where we check if ArrayData::child_data is not empty, and then propagate the offset and length down to them. The only trick is with list and largelist, where we have to inspect the offset buffers to determine what the new offset and length should be.

Are there any user-facing changes?

No UFC

@nevi-me
Copy link
Contributor Author

nevi-me commented Jun 2, 2021

@jorgecarleitao I have the below failures, which are mostly related to MutableArrayData. I need your help when you have some time to spare.

I worked on this over the weekend out of curiousity, so I'll continue working on the other tests this week.

failures:
    array::array_union::tests::test_dense_mixed_with_nulls_and_offset
    array::array_union::tests::test_sparse_mixed_with_nulls_and_offset
    array::transform::tests::test_list_append
    array::transform::tests::test_list_nulls_append
    array::transform::tests::test_struct_offset
    compute::kernels::cast::tests::test_list_cast_offsets
    compute::kernels::concat::tests::test_concat_struct_array_slices
    json::writer::tests::write_dictionary
    json::writer::tests::write_nested_list

@jhorstmann
Copy link
Contributor

I think instead of switching of child_data.is_empty we rather need a special case for struct and union types. ListArray for example should work fine with the current logic. The arrays' offset applies to the offsets buffer which indexes into the child buffer, slicing just has to adjust that one offset without having to change the offset of child arrays. The same applies to Binary or StringArrays, which have the same layout as lists.

For struct (and probably union too) however we'd need to adjust the offset field of all the child arrays by calling slice on them.

@nevi-me
Copy link
Contributor Author

nevi-me commented Jun 5, 2021

@jhorstmann I'm still looking into this. Nested list slices currently don't work in the main branch; that's what made me come back to this issue. The primitive in <list<list<primitive>> only sees the offset and length of its parent list, which in master is not updated.

@jhorstmann
Copy link
Contributor

@nevi-me can you point me to a testcase and does this only involve arrow or the parquet roundtrip? I'd be happy to take a look. My thoughts are that a List<Utf8> is basically the same as List<List<u8>>, and the former one seems to work fine. So requiring special logic for nested lists but not nested strings seems a bit strange.

@jorgecarleitao
Copy link
Member

jorgecarleitao commented Jun 6, 2021

Isn't it possible to not offset the child data, and instead only offset the offsets, or do we have some logic around that assumes that offsets always start from 0?

In arrow2, I am doing

impl<O: Offset> ListArray<O> {
    pub fn slice(&self, offset: usize, length: usize) -> Self {
        let validity = self.validity.clone().map(|x| x.slice(offset, length));
        let offsets = self.offsets.clone().slice(offset, length + 1);
        Self {
            data_type: self.data_type.clone(),
            offsets,
            values: self.values.clone(),
            validity,
            offset: self.offset + offset,
        }
    }
}

note how the offsets are sliced and the values are cloned. The ListArray<O>::offset is only used for compatibility with the C data interface.

@jhorstmann
Copy link
Contributor

@jorgecarleitao I assume in your implementation the validity bitmap itself takes care of the offset? That seems like a clean solution. But we then have to ensure that no kernels access the underlying bytes/buffer of the bitmap directly and all access goes through a (chunked) iterator or some other abstraction.

@@ -85,12 +85,7 @@ impl From<ArrayData> for StructArray {
fn from(data: ArrayData) -> Self {
let mut boxed_fields = vec![];
for cd in data.child_data() {
let child_data = if data.offset() != 0 || data.len() != cd.len() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jhorstmann @jorgecarleitao the problem was here. We take data which has children and their own offsets, then slice that data correctly to populate boxed_fields, but then we still use the data with the child arrays that aren't offset.

The result's that the struct is correct when looking at it through boxed_fields (e.g. the print utility uses this), but once you need to do anything with data, it's as if the child values were never sliced.

The lightbulb turned on while i was trying to figure out why a test for #491 was failing.

There's a change that @bjchambers authored (9f96561) which addressed the child offsets, but my tests were failing because I needed to revert what they did.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to make sense -- my main interest was in getting the tests I added (concat on sliced struct arrays) to pass because I had run into problems with that. It seems like the tests are still here and passing, and it sounds like this may be more robust as well (addressing other issues I've hit when slicing StructArray such as the mentioned equality issues).

For my own curiosity -- If I understand correctly, slicing the child data doesn't actually copy the data, but just creates a reference to the same data with different offsets, so this shouldn't affect performance significantly, correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to make sense -- my main interest was in getting the tests I added (concat on sliced struct arrays) to pass because I had run into problems with that. It seems like the tests are still here and passing, and it sounds like this may be more robust as well (addressing other issues I've hit when slicing StructArray such as the mentioned equality issues).

Yes, that's correct. I relied on your tests to verify that my changes make sense.

For my own curiosity -- If I understand correctly, slicing the child data doesn't actually copy the data, but just creates a reference to the same data with different offsets, so this shouldn't affect performance significantly, correct?

Yes. The cost should be negligible as we're cloning a bunch of arcs individually instead of all at once

@github-actions github-actions bot added the arrow Changes to the arrow crate label Jul 11, 2021
@nevi-me nevi-me marked this pull request as ready for review July 11, 2021 10:32
@codecov-commenter
Copy link

Codecov Report

Merging #389 (f72c86e) into master (f1fb2b1) will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #389   +/-   ##
=======================================
  Coverage   82.60%   82.60%           
=======================================
  Files         167      167           
  Lines       45984    45986    +2     
=======================================
+ Hits        37984    37986    +2     
  Misses       8000     8000           
Impacted Files Coverage Δ
arrow/src/array/array_struct.rs 89.11% <100.00%> (-0.14%) ⬇️
arrow/src/array/data.rs 73.60% <100.00%> (+0.93%) ⬆️
arrow/src/array/transform/structure.rs 80.00% <100.00%> (-3.34%) ⬇️
parquet/src/encodings/encoding.rs 94.85% <0.00%> (-0.20%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f1fb2b1...f72c86e. Read the comment docs.

@nevi-me
Copy link
Contributor Author

nevi-me commented Jul 11, 2021

@jorgecarleitao @jhorstmann @bjchambers this is ready for review, and partially fixes #514

revert changes made in ARROW-11394

See commit apache@9f96561

Only slice into structs
array.offset() + start + len,
)
})
mutable
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reverses 9f96561

@jhorstmann
Copy link
Contributor

This looks good and makes sense to me. StructArray (and probably union too) are a bit of a special case in how offsets are propagated. Since we currently do not push down offsets into the buffers, the offset of an array applies to the direct buffers of the array (including the validity buffer). That is still the case for StructArray, which has no other buffers, so Arrays are still working consistently in that regard.

A test for concatenation of sliced StructArrays would be nice to have.

@bjchambers
Copy link
Contributor

@jorgecarleitao @jhorstmann @bjchambers this is ready for review, and partially fixes #514

What parts of #514 doesn't this address? I have a branch with some tests for equality that I can try on-top of this MR, but it seems (possibly naively) that this should address the struct equality after slicing problem mentioned there.

@nevi-me
Copy link
Contributor Author

nevi-me commented Jul 13, 2021

A test for concatenation of sliced StructArrays would be nice to have.

@jhorstmann this already exists at https://github.com/apache/arrow-rs/blob/master/arrow/src/compute/kernels/concat.rs#L367-L409, it was added by @bjchambers

@nevi-me
Copy link
Contributor Author

nevi-me commented Jul 13, 2021

@bjchambers I forgot to go back to #514 and add details there. I tried your test case on master and on this PR, and it passed on this PR, but with a slight change to the test case.

I made the below change, and the test passed.

        let a = make_struct(vec![
-           None,
+           Some((None, None)),
            Some((Some("joe"), Some(1))),
            Some((None, Some(2))),
            Some((None, None)),
            Some((Some("mark"), Some(4))),
            Some((Some("doe"), Some(5))),
        ]);

@nevi-me nevi-me requested a review from alamb July 14, 2021 14:47
@nevi-me
Copy link
Contributor Author

nevi-me commented Jul 14, 2021

@alamb I need this PR to fix a test failure on the map PR

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I filed #554 for this change

Looks good to me -- and it sounds like there is additional coverage coming in support of MapArray 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ArrayData::slice() does not work for nested types such as StructArray
6 participants