Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add sql-compliant feature for enabling sql-compliant kernel behavior #2568

Closed
viirya opened this issue Aug 24, 2022 · 6 comments
Closed

Add sql-compliant feature for enabling sql-compliant kernel behavior #2568

viirya opened this issue Aug 24, 2022 · 6 comments
Labels
arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog

Comments

@viirya
Copy link
Member

viirya commented Aug 24, 2022

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

Some kernels behaves different with SQL semantics. For example, by definition, NaN is not equal to itself but NaN is equal to NaN with SQL semantics. Using current comparison kernels in SQL system leads to different behavior and generates incorrect results.

Describe the solution you'd like

We should provide SQL-compliant kernels which can be enabled by feature flag.

Describe alternatives you've considered

Additional context

@tustvold
Copy link
Contributor

tustvold commented Aug 29, 2022

There is an IEEE standard for total ordering of floats, I wonder if we could use that? Aside from being a standard, where I've not managed to find an authoritative SQL standard for floats, it can be implemented with relatively cheap bit manipulation, instead of more expensive branching.

There is a built-in implementation here, which also does a good job explaining its behaviour. Theoretically the performance may be good enough that we can just always have this behaviour without needing a feature flag?

What do you think?

Also tagging @alamb

@viirya
Copy link
Member Author

viirya commented Aug 29, 2022

Hmm, the NaN ordering of the IEEE standard treats NaN value larger than any other numbers. Looks it is consistent with the SQL-compliant ordering we need. I can rewrite these kernels with total_cmp. As without a flag, it is somehow breaking existing behavior but I think current behavior can be thought as wrong or not useful.

@viirya
Copy link
Member Author

viirya commented Aug 29, 2022

I think total_cmp looks promising as it treats NaN ordering the same way as Spark/Postgresql. cc @sunchao to take a look too in case if I miss any point here. 😄

@NGA-TRAN
Copy link
Contributor

NGA-TRAN commented Aug 29, 2022

@viirya

For example, by definition, NaN is not equal to itself but NaN is equal to NaN with SQL semantics

Can you provide a more specific example for NaN is equal to NaN with SQL semantics? This document says NaN is not equal to itself in SQL https://www.vertica.com/blog/vertica-quick-tip-query-nan-values/

@viirya
Copy link
Member Author

viirya commented Aug 29, 2022

We've discussed this in PRs. It depends on SQL implementations. As far as I know, Spark and postgresql treats NaNs equal.

@sunchao
Copy link
Member

sunchao commented Aug 29, 2022

I think total_cmp looks promising as it treats NaN ordering the same way as Spark/Postgresql. cc @sunchao to take a look too in case if I miss any point here. 😄

I took a look too, and yes it looks like total_cmp exhibits the same behavior as Spark/Postgres/Snowflake etc, so I think it could be good idea to replace the existing code with it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

No branches or pull requests

4 participants