Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

arrow-ord: lt and eq for nested list #5408

Draft
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

jayzhan211
Copy link
Contributor

Which issue does this PR close?

Ref #5407

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
@github-actions github-actions bot added the arrow Changes to the arrow crate label Feb 19, 2024
@tustvold
Copy link
Contributor

I wonder if it might be more efficient to make use of DynComparator for this, especially for smaller list elements?

@jayzhan211
Copy link
Contributor Author

I wonder if it might be more efficient to make use of DynComparator for this, especially for smaller list elements?

I can apply DynComparator for non-nested part, for nested part, I did't find a good way to build something like compare_list

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
use arrow_schema::DataType::*;
if let (List(_), List(_)) = (l_t, r_t) {
// Process nested data types
match op {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to construct the DynComparator once for l.values() and r.values() and then use the offsets to drive the comparison?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unsure about using l.values and offsets for the comparison. Instead, I find looping through the index with .value(i) much clearer and straightforward for me.

let values = values.finish();
Ok(Some(values))
}
Op::Equal => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This duplication could be eliminated by extracting out a function that maps the Ordering result of DynComparator to the boolean result

@@ -198,12 +316,16 @@ fn compare_op(op: Op, lhs: &dyn Datum, rhs: &dyn Datum) -> Result<BooleanArray,
let r = r_v.map(|x| x.values().as_ref()).unwrap_or(r);
let r_t = r.data_type();

if l_t != r_t || l_t.is_nested() {
if l_t != r_t {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This now allows hitting unreachable in the below code block

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I return an error in process_nested, so the nested type that is NYI will not go down there

@jayzhan211 jayzhan211 marked this pull request as draft February 20, 2024 12:10
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
let l = l.as_list::<i32>();
let r = r.as_list::<i32>();

Box::new(move |i, j| compare_list(l.value(i), r.value(j)))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got a lifetime error when moving this to fn that returns Dyncompartor. To avoid getting it too complex, I leave it here.

error: lifetime may not live long enough
  --> arrow-ord/src/ord.rs:36:9
   |
32 | fn compare_ist(l: &dyn Array, r: &dyn Array) -> DynComparator {
   |                                  - let's call the lifetime of this reference `'2`
...
36 |         Box::new(move |i, j| compare_list(l.value(i), r.value(j)))
   |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ returning this value requires that `'2` must outlive `'static`
   |
help: to declare that the trait object captures data from argument `r`, you can add a lifetime parameter `'a` in the type alias
   |
30 | pub type DynComparator<'a> = Box<dyn Fn(usize, usize) -> Ordering + Send + Sync + 'a>;
   |                       ++++                                                      ++++

error: could not compile `arrow-ord` (lib test) due to 2 previous errors

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
@jayzhan211 jayzhan211 marked this pull request as ready for review February 20, 2024 15:16
@tustvold
Copy link
Contributor

I will try to find some time over the next few days to see if i can't simplify this

@@ -702,4 +784,216 @@ mod tests {

neq(&col.slice(0, col.len() - 1), &col.slice(1, col.len() - 1)).unwrap();
}

Copy link
Contributor

@tustvold tustvold Feb 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to see some tests of

  • Scalar arguments
  • Nulls masking non-empty slices
  • DictionaryArray of ListArray (returning an error would be perfectly valid for this)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

@tustvold
Copy link
Contributor

tustvold commented Feb 24, 2024

Ok, so I have had a play with this and I think I would recommend the following course of action:

  • Add support to DynComparator for List / LargeList / etc...
  • Potentially add support to DynComparator for Structs, etc...
  • Add a fallback mechanism to compare_op that makes use of DynComparator for types that lack a specialized implementation

I would recommend doing these as separate PRs.

This seems like the only way to support arbitrarily nested lists, including lists of lists.

@tustvold tustvold marked this pull request as draft February 24, 2024 02:17
@tustvold tustvold mentioned this pull request Feb 24, 2024
@tustvold
Copy link
Contributor

tustvold commented Feb 24, 2024

Actually I remember now why DynComparator lacks support for nested types, the ordering of nulls is not well defined. Tricky... 🤔

Edit: Filed #5426 with how I think we should proceed, feedback welcome

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
@jayzhan211
Copy link
Contributor Author

I find a way to build dyncompare for list, but nulls handling is not considered yet.

@tustvold
Copy link
Contributor

tustvold commented Feb 25, 2024

Yes #5426 will be necessary to support nulls or lists containing lists or structs. I think we should probably do #5426 first

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants