feat(ord): Support equality of StructArray #5217

my-vegetable-has-exploded · 2023-12-17T10:47:48Z

Which issue does this PR close?

Closes #5199

Rationale for this change

What changes are included in this PR?

refactor closure values() to function compare_op_struct_values()
compare struct arrays by recursively checking each field

Are there any user-facing changes?

arrow-ord/src/cmp.rs

my-vegetable-has-exploded · 2023-12-17T10:58:43Z

And there are some points that I am not sure about

for stuctarray, a struct array has its own validity bitmap that is independent of its child arrays’ validity bitmaps. So I don't handle nullbuffer for each field.
I don't find a way to new a structscalar, so I don't test scalar yet.

Maybe another question, how can I make these test codes shorter?

arrow-ord/src/cmp.rs

alamb

Thank you @my-vegetable-has-exploded -- this is looking pretty close to me

Can you please add tests for

distinct / not_distinct
A negative test that some operation like lt or lt_eq returns an error (not a panic) for struct arrays?
A negative test that a struct array like {a: int, b:int} doesn't return true when compared to a struct array with a prefix like `{a:int}

Also @tustvold do you have any suggestions for what benchmarks to run this on?

arrow-ord/src/cmp.rs

tustvold · 2023-12-18T21:48:44Z

So I don't handle nullbuffer for each field.

I think the output should be the union of all the null buffers.

I'll try to review this in the next couple of days

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

my-vegetable-has-exploded · 2023-12-19T03:12:25Z

I think the output should be the union of all the null buffers.

I think the nullbuffer for subarrays is only valid for the subarray itself.

Take the Example Layout in the documentation as an example(https://arrow.apache.org/docs/format/Columnar.html#struct-layout), if use the union of all the null buffers, the second slot also gets null, which is a little different from my understanding.

[{'joe', 1}, {null, 2}, null, {'mark', 4}]

* Length: 4, Null count: 1
* Validity bitmap buffer:

  | Byte 0 (validity bitmap) | Bytes 1-63            |
  |--------------------------|-----------------------|
  | 00001011                 | 0 (padding)           |

* Children arrays:
  * field-0 array (`VarBinary`):
    * Length: 4, Null count: 2
    * Validity bitmap buffer:

      | Byte 0 (validity bitmap) | Bytes 1-63            |
      |--------------------------|-----------------------|
      | 00001001                 | 0 (padding)           |

    * Offsets buffer:

      | Bytes 0-19     | Bytes 20-63           |
      |----------------|-----------------------|
      | 0, 3, 3, 3, 7  | unspecified (padding) |

     * Value buffer:

      | Bytes 0-6      | Bytes 7-63            |
      |----------------|-----------------------|
      | joemark        | unspecified (padding) |

  * field-1 array (int32 array):
    * Length: 4, Null count: 1
    * Validity bitmap buffer:

      | Byte 0 (validity bitmap) | Bytes 1-63            |
      |--------------------------|-----------------------|
      | 00001011                 | 0 (padding)           |

    * Value Buffer:

      | Bytes 0-3   | Bytes 4-7   | Bytes 8-11  | Bytes 12-15 | Bytes 16-63           |
      |-------------|-------------|-------------|-------------|-----------------------|
      | 1           | 2           | unspecified | 4           | unspecified (padding) |

thanks, @tustvold @alamb

tustvold · 2023-12-19T06:00:15Z

Correct, but the semantic of these kernels is any comparison against a null results in null output for that position

my-vegetable-has-exploded · 2023-12-19T11:03:13Z

Can you please add tests for

1. `distinct` / `not_distinct`

2. A negative test that some operation like `lt` or `lt_eq` returns an error (not a panic) for struct arrays?

3. A negative test that a struct array like `{a: int, b:int}` doesn't return `true` when compared to a struct array with a prefix like `{a:int}

Sure.

Correct, but the semantic of these kernels is any comparison against a null results in null output for that position

I was wondering if it would be better to use Op::NotDistinct to check each field? More precisely, we need to go through the process in

arrow-rs/arrow-ord/src/cmp.rs

Lines 226 to 284 in 9e060dc

    
           (Some(l), true, Some(r), true) | (Some(l), false, Some(r), false) => { 
        
               // Either both sides are scalar or neither side is scalar 
        
               match op { 
        
                   Op::Distinct => { 
        
                       let values = values(); 
        
                       let l = l.inner().bit_chunks().iter_padded(); 
        
                       let r = r.inner().bit_chunks().iter_padded(); 
        
                       let ne = values.bit_chunks().iter_padded(); 
        
                       let c = |((l, r), n)| ((l ^ r) | (l & r & n)); 
        
                       let buffer = l.zip(r).zip(ne).map(c).collect(); 
        
                       BooleanBuffer::new(buffer, 0, len).into() 
        
                   } 
        
                   Op::NotDistinct => { 
        
                       let values = values(); 
        
                       let l = l.inner().bit_chunks().iter_padded(); 
        
                       let r = r.inner().bit_chunks().iter_padded(); 
        
                       let e = values.bit_chunks().iter_padded(); 
        
                       let c = |((l, r), e)| u64::not(l | r) | (l & r & e); 
        
                       let buffer = l.zip(r).zip(e).map(c).collect(); 
        
                       BooleanBuffer::new(buffer, 0, len).into() 
        
                   } 
        
                   _ => BooleanArray::new(values(), NullBuffer::union(Some(&l), Some(&r))), 
        
               } 
        
           } 
        
           (Some(_), true, Some(a), false) | (Some(a), false, Some(_), true) => { 
        
               // Scalar is null, other side is non-scalar and nullable 
        
               match op { 
        
                   Op::Distinct => a.into_inner().into(), 
        
                   Op::NotDistinct => a.into_inner().not().into(), 
        
                   _ => BooleanArray::new_null(len), 
        
               } 
        
           } 
        
           (Some(nulls), is_scalar, None, _) | (None, _, Some(nulls), is_scalar) => { 
        
               // Only one side is nullable 
        
               match is_scalar { 
        
                   true => match op { 
        
                       // Scalar is null, other side is not nullable 
        
                       Op::Distinct => BooleanBuffer::new_set(len).into(), 
        
                       Op::NotDistinct => BooleanBuffer::new_unset(len).into(), 
        
                       _ => BooleanArray::new_null(len), 
        
                   }, 
        
                   false => match op { 
        
                       Op::Distinct => { 
        
                           let values = values(); 
        
                           let l = nulls.inner().bit_chunks().iter_padded(); 
        
                           let ne = values.bit_chunks().iter_padded(); 
        
                           let c = |(l, n)| u64::not(l) | n; 
        
                           let buffer = l.zip(ne).map(c).collect(); 
        
                           BooleanBuffer::new(buffer, 0, len).into() 
        
                       } 
        
                       Op::NotDistinct => (nulls.inner() & &values()).into(), 
        
                       _ => BooleanArray::new(values(), Some(nulls)), 
        
                   }, 
        
               } 
        
           } 
        
           // Neither side is nullable 
        
           (None, _, None, _) => BooleanArray::new(values(), None),

after getting BooleanBuffer.

tustvold · 2023-12-19T12:26:50Z

I was wondering if it would be better to use Op::NotDistinct to check each field?

That would be a different kernel then. We definitely could/should support distinct/not_distinct for StructArray also, the difference with standard equality is how nulls are handled. Distinct follow the intuitive notions of equality, the equality kernels follow the SQL formulation of equality and the somewhat perverse null semantics it has 😅

https://learn.microsoft.com/en-us/sql/t-sql/queries/is-distinct-from-transact-sql?view=sql-server-ver16#remarks

my-vegetable-has-exploded · 2023-12-20T04:40:53Z

I was wondering if it would be better to use Op::NotDistinct to check each field?

That would be a different kernel then. We definitely could/should support distinct/not_distinct for StructArray also, the difference with standard equality is how nulls are handled. Distinct follow the intuitive notions of equality, the equality kernels follow the SQL formulation of equality and the somewhat perverse null semantics it has 😅

https://learn.microsoft.com/en-us/sql/t-sql/queries/is-distinct-from-transact-sql?view=sql-server-ver16#remarks

I feel like I have caught your drift this time. Because the comparison between None and any value is Unknown, So {null, 2} is also not comparable. Thanks, I will change my code based on this suggestion.

tustvold

Had a review, I like where this is headed but I don't think the null mask handling is quite right yet.

FWIW I'm not sure that separating out the null mask and values comparison makes sense, instead I would expect the logic to just recurse across the fields and union the null masks of the results (if any), with a little bit of extra logic to handle any null mask in the struct array proper.

tustvold · 2023-12-27T10:58:08Z

arrow-ord/src/cmp.rs

+    if l_t.is_nested() {
+        if !l_t.equals_datatype(r_t) {
+            return Err(ArrowError::InvalidArgumentError(format!(
+                "Invalid comparison operation: {l_t} {op} {r_t}"
+            )));
+        }
+        match (l_t, op) {
+            (Struct(_), Op::Equal | Op::NotEqual | Op::Distinct | Op::NotDistinct) => {}
+            _ => {
+                return Err(ArrowError::InvalidArgumentError(format!(
+                    "Invalid comparison operation: {l_t} {op} {r_t}"
+                )));
+            }
+        }
+    } else if r_t != l_t {
        return Err(ArrowError::InvalidArgumentError(format!(
            "Invalid comparison operation: {l_t} {op} {r_t}"
        )));
    }


Suggested change

if l_t.is_nested() {

if !l_t.equals_datatype(r_t) {

return Err(ArrowError::InvalidArgumentError(format!(

"Invalid comparison operation: {l_t} {op} {r_t}"

)));

}

match (l_t, op) {

(Struct(_), Op::Equal | Op::NotEqual | Op::Distinct | Op::NotDistinct) => {}

_ => {

return Err(ArrowError::InvalidArgumentError(format!(

"Invalid comparison operation: {l_t} {op} {r_t}"

)));

}

}

} else if r_t != l_t {

return Err(ArrowError::InvalidArgumentError(format!(

"Invalid comparison operation: {l_t} {op} {r_t}"

)));

}

if !l_t.equals_datatype(r_t) {

return Err(ArrowError::InvalidArgumentError(format!(

"Invalid comparison operation: {l_t} {op} {r_t}"

)));

}

tustvold · 2023-12-27T10:59:32Z

arrow-ord/src/cmp.rs

+    let l_t = l.data_type();
+    let r_t = r.data_type();
+    let l_nulls = l.logical_nulls().filter(|n| n.null_count() > 0);
+    let r_nulls = r.logical_nulls().filter(|n| n.null_count() > 0);
+    // for [not]Distinct, the result is never null
+    match op {
+        Op::Distinct | Op::NotDistinct => {
+            return Ok(None);
+        }
+        _ => {}
+    }


Suggested change

let l_t = l.data_type();

let r_t = r.data_type();

let l_nulls = l.logical_nulls().filter(|n| n.null_count() > 0);

let r_nulls = r.logical_nulls().filter(|n| n.null_count() > 0);

// for [not]Distinct, the result is never null

match op {

Op::Distinct | Op::NotDistinct => {

return Ok(None);

}

_ => {}

}

if matches!(op, Op::Distinct | Op::NotDistinct) {

// for [not]Distinct, the result is never null

return Ok(None)

}

let l_t = l.data_type();

let r_t = r.data_type();

let l_nulls = l.logical_nulls().filter(|n| n.null_count() > 0);

let r_nulls = r.logical_nulls().filter(|n| n.null_count() > 0);

tustvold · 2023-12-27T11:02:07Z

arrow-ord/src/cmp.rs

+    // when one of field is equal, the result is false for not equal
+    // so we use neg to reverse the result of equal when handle not equal


Why not just pass the operator into compare_op_values?

tustvold · 2023-12-27T11:02:50Z

arrow-ord/src/cmp.rs

+        .columns()
+        .iter()
+        .zip(r.columns().iter())
+        .map(|(col_l, col_r)| compare_op_values(Op::Equal, col_l, l_s, col_r, r_s, len))


I don't think this will correctly handle the null masks for a Distinct?

tustvold · 2023-12-27T11:08:43Z

arrow-ord/src/cmp.rs

+            Some(vec![true, false, true, true].into()),
+        ));
+        let right_a = Arc::new(Int32Array::new(
+            vec![0, 1, 2, 3].into(),


Suggested change

vec![0, 1, 2, 3].into(),

vec![0, 72, 2, 3].into(),

This helps verify the null mask comparison is correct, and not relying on the values comparison

tustvold · 2023-12-27T11:10:40Z

arrow-ord/src/cmp.rs

+            ],
+            Buffer::from([0b0111]),
+        ));
+        let right_struct = StructArray::from((


Suggested change

let right_struct = StructArray::from((

// right [{a: 0, b: 0}, {a: NULL, b: 1}, {a: 2, b: 2}, {a: 3, b: 3} ]

let right_struct = StructArray::from((

tustvold · 2023-12-27T11:10:50Z

arrow-ord/src/cmp.rs

+        ));
+        let field_a = Arc::new(Field::new("a", DataType::Int32, true));
+        let field_b = Arc::new(Field::new("b", DataType::Int32, true));
+        let left_struct = StructArray::from((


Suggested change

let left_struct = StructArray::from((

// [{a: 0, b: 0}, {a: NULL, b: 1}, {a: 2, b: 20}, {a: 3, b: 3}]

let left_struct = StructArray::from((

my-vegetable-has-exploded · 2023-12-27T13:46:04Z

Had a review, I like where this is headed but I don't think the null mask handling is quite right yet.

FWIW I'm not sure that separating out the null mask and values comparison makes sense, instead I would expect the logic to just recurse across the fields and union the null masks of the results (if any), with a little bit of extra logic to handle any null mask in the struct array proper.

I wanted to do that at first. The main reason is that I found it's hard to reuse those codes

arrow-rs/arrow-ord/src/cmp.rs

Lines 225 to 285 in 3cd6da0

    
           Ok(match (l_nulls, l_s, r_nulls, r_s) { 
        
               (Some(l), true, Some(r), true) | (Some(l), false, Some(r), false) => { 
        
                   // Either both sides are scalar or neither side is scalar 
        
                   match op { 
        
                       Op::Distinct => { 
        
                           let values = values(); 
        
                           let l = l.inner().bit_chunks().iter_padded(); 
        
                           let r = r.inner().bit_chunks().iter_padded(); 
        
                           let ne = values.bit_chunks().iter_padded(); 
        
                           let c = |((l, r), n)| ((l ^ r) | (l & r & n)); 
        
                           let buffer = l.zip(r).zip(ne).map(c).collect(); 
        
                           BooleanBuffer::new(buffer, 0, len).into() 
        
                       } 
        
                       Op::NotDistinct => { 
        
                           let values = values(); 
        
                           let l = l.inner().bit_chunks().iter_padded(); 
        
                           let r = r.inner().bit_chunks().iter_padded(); 
        
                           let e = values.bit_chunks().iter_padded(); 
        
                           let c = |((l, r), e)| u64::not(l | r) | (l & r & e); 
        
                           let buffer = l.zip(r).zip(e).map(c).collect(); 
        
                           BooleanBuffer::new(buffer, 0, len).into() 
        
                       } 
        
                       _ => BooleanArray::new(values(), NullBuffer::union(Some(&l), Some(&r))), 
        
                   } 
        
               } 
        
               (Some(_), true, Some(a), false) | (Some(a), false, Some(_), true) => { 
        
                   // Scalar is null, other side is non-scalar and nullable 
        
                   match op { 
        
                       Op::Distinct => a.into_inner().into(), 
        
                       Op::NotDistinct => a.into_inner().not().into(), 
        
                       _ => BooleanArray::new_null(len), 
        
                   } 
        
               } 
        
               (Some(nulls), is_scalar, None, _) | (None, _, Some(nulls), is_scalar) => { 
        
                   // Only one side is nullable 
        
                   match is_scalar { 
        
                       true => match op { 
        
                           // Scalar is null, other side is not nullable 
        
                           Op::Distinct => BooleanBuffer::new_set(len).into(), 
        
                           Op::NotDistinct => BooleanBuffer::new_unset(len).into(), 
        
                           _ => BooleanArray::new_null(len), 
        
                       }, 
        
                       false => match op { 
        
                           Op::Distinct => { 
        
                               let values = values(); 
        
                               let l = nulls.inner().bit_chunks().iter_padded(); 
        
                               let ne = values.bit_chunks().iter_padded(); 
        
                               let c = |(l, n)| u64::not(l) | n; 
        
                               let buffer = l.zip(ne).map(c).collect(); 
        
                               BooleanBuffer::new(buffer, 0, len).into() 
        
                           } 
        
                           Op::NotDistinct => (nulls.inner() & &values()).into(), 
        
                           _ => BooleanArray::new(values(), Some(nulls)), 
        
                       }, 
        
                   } 
        
               } 
        
               // Neither side is nullable 
        
               (None, _, None, _) => BooleanArray::new(values(), None), 
        
           })

If there is a better way to organize those codes, I'd like to have a try! Thanks a lot!

tustvold · 2023-12-27T13:57:20Z

It should be possible to just call compare_op recursively

tustvold · 2023-12-27T15:47:10Z

I'll have a play later today/tomorrow and see if I can't simplify this a bit

my-vegetable-has-exploded · 2023-12-27T15:56:00Z

I'll have a play later today/tomorrow and see if I can't simplify this a bit

Thanks a lot，I’m sorry to add to your workload.

Jefffrey · 2024-04-26T13:39:02Z

Hey @tustvold & @my-vegetable-has-exploded , do we know the status of this PR now? It's been open for a bit and it seems there has been another PR for the same issue in the meantime, #5423, so wondering if efforts should be focused on a single PR? Otherwise can keep both open but mark this as draft as there hasn't been movement for a bit?

tustvold · 2024-04-26T13:46:03Z

Sorry this is partly on me, I'm somewhat struggling to keep up with all the various things going on. I think my preference is towards something along the lines of #5672 which would allow us to handle StructArray more comprehensively in the comparison kernels, instead of having non-trivial logic just for the case of equality.

I think let's mark this as a draft and I will try to find sometime next week to sort something out in this space

Support equality of StructArray

d9783dc

github-actions bot added the arrow Changes to the arrow crate label Dec 17, 2023

my-vegetable-has-exploded commented Dec 17, 2023

View reviewed changes

arrow-ord/src/cmp.rs Outdated Show resolved Hide resolved

my-vegetable-has-exploded changed the title ~~feat: Support equality of StructArray~~ feat(ord): Support equality of StructArray Dec 17, 2023

This was referenced Dec 18, 2023

DataFusion weekly project plan (Andrew Lamb) - Dec 11, 2023 apache/datafusion#8490

Closed

DataFusion weekly project plan (Andrew Lamb) - Dec 18, 2023 apache/datafusion#8577

Closed

jayzhan211 reviewed Dec 18, 2023

View reviewed changes

arrow-ord/src/cmp.rs Outdated Show resolved Hide resolved

use as_struct & collect.

73f2a56

alamb reviewed Dec 18, 2023

View reviewed changes

arrow-ord/src/cmp.rs Outdated Show resolved Hide resolved

arrow-ord/src/cmp.rs Show resolved Hide resolved

arrow-ord/src/cmp.rs Outdated Show resolved Hide resolved

arrow-ord/src/cmp.rs Outdated Show resolved Hide resolved

arrow-ord/src/cmp.rs Show resolved Hide resolved

rm useless to_vec()

ddcd6f4

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

my-vegetable-has-exploded mentioned this pull request Dec 20, 2023

Support multi column IN lists apache/datafusion#8590

Closed

my-vegetable-has-exploded added 2 commits December 24, 2023 16:24

union nullsbuffer for struct & add tests

b925319

fix dict.

1b19ec5

tustvold reviewed Dec 27, 2023

View reviewed changes

fix distinct for struct.

4f8522d

my-vegetable-has-exploded mentioned this pull request Mar 15, 2024

eq for struct #5423

Draft

tustvold marked this pull request as draft April 26, 2024 13:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ord): Support equality of StructArray #5217

feat(ord): Support equality of StructArray #5217

my-vegetable-has-exploded commented Dec 17, 2023

my-vegetable-has-exploded commented Dec 17, 2023

alamb left a comment

tustvold commented Dec 18, 2023

my-vegetable-has-exploded commented Dec 19, 2023 •

edited

tustvold commented Dec 19, 2023

my-vegetable-has-exploded commented Dec 19, 2023

tustvold commented Dec 19, 2023 •

edited

my-vegetable-has-exploded commented Dec 20, 2023

tustvold left a comment •

edited

tustvold Dec 27, 2023

tustvold Dec 27, 2023

tustvold Dec 27, 2023

tustvold Dec 27, 2023

tustvold Dec 27, 2023 •

edited

tustvold Dec 27, 2023

tustvold Dec 27, 2023

my-vegetable-has-exploded commented Dec 27, 2023

tustvold commented Dec 27, 2023

tustvold commented Dec 27, 2023

my-vegetable-has-exploded commented Dec 27, 2023

Jefffrey commented Apr 26, 2024

tustvold commented Apr 26, 2024

		// when one of field is equal, the result is false for not equal
		// so we use neg to reverse the result of equal when handle not equal

	let right_struct = StructArray::from((
	// right [{a: 0, b: 0}, {a: NULL, b: 1}, {a: 2, b: 2}, {a: 3, b: 3} ]
	let right_struct = StructArray::from((

	let left_struct = StructArray::from((
	// [{a: 0, b: 0}, {a: NULL, b: 1}, {a: 2, b: 20}, {a: 3, b: 3}]
	let left_struct = StructArray::from((

feat(ord): Support equality of StructArray #5217

Are you sure you want to change the base?

feat(ord): Support equality of StructArray #5217

Conversation

my-vegetable-has-exploded commented Dec 17, 2023

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

my-vegetable-has-exploded commented Dec 17, 2023

alamb left a comment

Choose a reason for hiding this comment

tustvold commented Dec 18, 2023

my-vegetable-has-exploded commented Dec 19, 2023 • edited

tustvold commented Dec 19, 2023

my-vegetable-has-exploded commented Dec 19, 2023

tustvold commented Dec 19, 2023 • edited

my-vegetable-has-exploded commented Dec 20, 2023

tustvold left a comment • edited

Choose a reason for hiding this comment

tustvold Dec 27, 2023

Choose a reason for hiding this comment

tustvold Dec 27, 2023

Choose a reason for hiding this comment

tustvold Dec 27, 2023

Choose a reason for hiding this comment

tustvold Dec 27, 2023

Choose a reason for hiding this comment

tustvold Dec 27, 2023 • edited

Choose a reason for hiding this comment

tustvold Dec 27, 2023

Choose a reason for hiding this comment

tustvold Dec 27, 2023

Choose a reason for hiding this comment

my-vegetable-has-exploded commented Dec 27, 2023

tustvold commented Dec 27, 2023

tustvold commented Dec 27, 2023

my-vegetable-has-exploded commented Dec 27, 2023

Jefffrey commented Apr 26, 2024

tustvold commented Apr 26, 2024

my-vegetable-has-exploded commented Dec 19, 2023 •

edited

tustvold commented Dec 19, 2023 •

edited

tustvold left a comment •

edited

tustvold Dec 27, 2023 •

edited