Support SQL-compliant NaN behavior on eq_dyn, neq_dyn, lt_dyn, lt_eq_dyn, gt_dyn, gt_eq_dyn #2570

viirya · 2022-08-24T05:42:14Z

Which issue does this PR close?

Closes #2569.

Rationale for this change

These comparison kernels behaves different with SQL semantics on NaN handling. By definition, NaN is not equal to itself. But NaN is equal to NaN with SQL semantics and NaN is larger than any other numeric values.

Using current comparison kernels in SQL system leads to different behavior and generates incorrect results.

What changes are included in this PR?

Are there any user-facing changes?

viirya · 2022-08-24T05:42:29Z

cc @sunchao

viirya · 2022-08-24T05:43:44Z

arrow/src/compute/kernels/comparison.rs

@@ -2386,7 +2408,30 @@ pub fn eq_dyn(left: &dyn Array, right: &dyn Array) -> Result<BooleanArray> {
        _ if matches!(right.data_type(), DataType::Dictionary(_, _)) => {
            typed_cmp_dict_non_dict!(right, left, |a, b| a == b, |a, b| a == b)


For dictionary/non_dictionary comparison, it should be updated too. I will add it in follow-up PR. The PR is quite large.

tustvold · 2022-08-24T07:42:28Z

Have we considered just making this the default behaviour? If we don't want to do that, I think we should name the feature flag something like ordered_nan or something to make clear it controls nan ordering and not something else?

sunchao

What is the current behavior for NaN? maybe worth adding some context in the PR description? #264 is a good reference on this topic too. Note there is no SQL standard for NaN, and different engines may have different behaviors. For instance, in PostgresSQL NaN is only considered equal to NaN in sort, but not other cases. Therefore, I think we should clearly document the behavior introduced here.

Similar question as @tustvold : should we aim to make all the compute kernels SQL compliant? in that case we should no longer need a flag like this.

Also, could we have some tests for this too?

viirya · 2022-08-24T17:27:37Z

I'm open to the feature flag naming. sql_compliant is used is because this is to follow SQL semantics regarding NaN handling. Maybe it sounds too broad as currently it only deals with NaN specially.

Have we considered just making this the default behaviour?

I'm not sure that this would make sense for other usecases other than SQL.

Note there is no SQL standard for NaN, and different engines may have different behaviors. For instance, in PostgresSQL NaN is only considered equal to NaN in sort, but not other cases. Therefore, I think we should clearly document the behavior introduced here.

I think that PostgresSQL also treats NaN equal to NaN, as Spark does.

Quote from https://www.postgresql.org/docs/current/datatype-numeric.html:

In order to allow numeric values to be sorted and used in tree-based indexes, PostgreSQL treats NaN values as equal, and greater than all non-NaN values.

Just did a test SELECT double precision 'NaN' = double precision 'NaN'; in PostgresSQL, it returns true.

I agree that we should document it clearly. I will update the document.

Similar question as @tustvold : should we aim to make all the compute kernels SQL compliant? in that case we should no longer need a flag like this.

As I answered above for @tustvold's question, I think we need both behaviors of compute kernels. For non-SQL usecases, current behavior is correct. But for SQL semantics, NaN not equal to NaN will cause practical issue when processing data, so we need different behavior with it. That's said, I think that we cannot just change all compute kernels to follow SQL semantics.

Also, could we have some tests for this too?

I have some tests for this already.

sunchao · 2022-08-24T18:04:54Z

I think that PostgresSQL also treats NaN equal to NaN, as Spark does.

Actually Vertica is a better example there.

I have some tests for this already.

Oops didn't see them.

viirya · 2022-08-25T05:07:02Z

Renamed the feature flag and added documentation about it on these kernels.

viirya · 2022-08-25T20:37:58Z

More comments? @sunchao @tustvold

sunchao

LGTM

kazuyukitanimura

LGTM (non-binding)

viirya · 2022-08-26T07:22:35Z

Thank you for review.

viirya · 2022-08-26T07:23:33Z

I will submit the missing pieces (dictionary array with non dictionary array, etc.) later.

ursabot · 2022-08-26T07:31:12Z

Benchmark runs are scheduled for baseline = a685c5f and contender = 63afe25. 63afe25 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

github-actions bot added the arrow Changes to the arrow crate label Aug 24, 2022

viirya commented Aug 24, 2022

View reviewed changes

viirya force-pushed the sql_comparison branch 3 times, most recently from f4b0ae8 to 1a6ccbf Compare August 24, 2022 06:43

Add sql-compliant feature for enabling sql-compliant kernel behavior

a5af59e

viirya force-pushed the sql_comparison branch from 1a6ccbf to a5af59e Compare August 24, 2022 07:10

tustvold changed the title ~~Support SQL-compliant behavior on eq_dyn, neq_dyn, lt_dyn, lt_eq_dyn, gt_dyn, gt_eq_dyn~~ Support SQL-compliant NaN behavior on eq_dyn, neq_dyn, lt_dyn, lt_eq_dyn, gt_dyn, gt_eq_dyn Aug 24, 2022

sunchao reviewed Aug 24, 2022

View reviewed changes

Rename feature flag and add document

d6fb174

sunchao approved these changes Aug 25, 2022

View reviewed changes

kazuyukitanimura approved these changes Aug 25, 2022

View reviewed changes

viirya merged commit 63afe25 into apache:master Aug 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support SQL-compliant NaN behavior on eq_dyn, neq_dyn, lt_dyn, lt_eq_dyn, gt_dyn, gt_eq_dyn #2570

Support SQL-compliant NaN behavior on eq_dyn, neq_dyn, lt_dyn, lt_eq_dyn, gt_dyn, gt_eq_dyn #2570

viirya commented Aug 24, 2022 •

edited

viirya commented Aug 24, 2022

viirya Aug 24, 2022 •

edited

tustvold commented Aug 24, 2022

sunchao left a comment •

edited

viirya commented Aug 24, 2022

sunchao commented Aug 24, 2022

viirya commented Aug 25, 2022 •

edited

viirya commented Aug 25, 2022

sunchao left a comment

kazuyukitanimura left a comment

viirya commented Aug 26, 2022

viirya commented Aug 26, 2022

ursabot commented Aug 26, 2022

		@@ -2386,7 +2408,30 @@ pub fn eq_dyn(left: &dyn Array, right: &dyn Array) -> Result<BooleanArray> {
		_ if matches!(right.data_type(), DataType::Dictionary(_, _)) => {
		typed_cmp_dict_non_dict!(right, left, \|a, b\| a == b, \|a, b\| a == b)

Support SQL-compliant NaN behavior on eq_dyn, neq_dyn, lt_dyn, lt_eq_dyn, gt_dyn, gt_eq_dyn #2570

Support SQL-compliant NaN behavior on eq_dyn, neq_dyn, lt_dyn, lt_eq_dyn, gt_dyn, gt_eq_dyn #2570

Conversation

viirya commented Aug 24, 2022 • edited

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

viirya commented Aug 24, 2022

viirya Aug 24, 2022 • edited

Choose a reason for hiding this comment

tustvold commented Aug 24, 2022

sunchao left a comment • edited

Choose a reason for hiding this comment

viirya commented Aug 24, 2022

sunchao commented Aug 24, 2022

viirya commented Aug 25, 2022 • edited

viirya commented Aug 25, 2022

sunchao left a comment

Choose a reason for hiding this comment

kazuyukitanimura left a comment

Choose a reason for hiding this comment

viirya commented Aug 26, 2022

viirya commented Aug 26, 2022

ursabot commented Aug 26, 2022

viirya commented Aug 24, 2022 •

edited

viirya Aug 24, 2022 •

edited

sunchao left a comment •

edited

viirya commented Aug 25, 2022 •

edited