Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support casting Utf8 to Boolean #1738

Merged
merged 1 commit into from May 30, 2022

Conversation

MazterQyou
Copy link
Contributor

@MazterQyou MazterQyou commented May 24, 2022

Which issue does this PR close?

Closes #1740.

Rationale for this change

Casting Utf8 to Boolean is useful for comparison between the two types.
Different Utf8 inputs might be implied as true or false Boolean type; the change implements conversion from strings supported by PostgreSQL.

What changes are included in this PR?

This PR implements Utf8 to Boolean cast, respecting the cast_options.safe value, and adds related tests.

Are there any user-facing changes?

@github-actions github-actions bot added the arrow Changes to the arrow crate label May 24, 2022
@MazterQyou
Copy link
Contributor Author

Should I submit a related issue?

@viirya
Copy link
Member

viirya commented May 24, 2022

@MazterQyou Yea, we prefer to have related issue submitted for tracking purpose. Thank you.

@codecov-commenter
Copy link

codecov-commenter commented May 24, 2022

Codecov Report

Merging #1738 (e3e1f7a) into master (722fcfc) will increase coverage by 0.02%.
The diff coverage is 98.07%.

❗ Current head e3e1f7a differs from pull request most recent head ac19755. Consider uploading reports for the commit ac19755 to get more accurate results

@@            Coverage Diff             @@
##           master    #1738      +/-   ##
==========================================
+ Coverage   83.27%   83.30%   +0.02%     
==========================================
  Files         195      196       +1     
  Lines       55896    55998     +102     
==========================================
+ Hits        46549    46649     +100     
- Misses       9347     9349       +2     
Impacted Files Coverage Δ
arrow/src/compute/kernels/cast.rs 95.74% <94.11%> (-0.04%) ⬇️
arrow/src/compute/kernels/string.rs 100.00% <100.00%> (ø)
arrow/src/datatypes/datatype.rs 65.42% <0.00%> (-0.38%) ⬇️
parquet_derive/src/parquet_field.rs 65.75% <0.00%> (-0.23%) ⬇️
arrow/src/array/transform/mod.rs 86.85% <0.00%> (+0.11%) ⬆️
parquet/src/encodings/encoding.rs 93.65% <0.00%> (+0.19%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2ba1ef4...ac19755. Read the comment docs.

@MazterQyou
Copy link
Contributor Author

Submitted and mentioned the related issue.

@@ -280,6 +280,8 @@ pub fn can_cast_types(from_type: &DataType, to_type: &DataType) -> bool {
///
/// Behavior:
/// * Boolean to Utf8: `true` => '1', `false` => `0`
/// * Utf8 to boolean: `true`, `yes`, `on`, `1` => `true`, `false`, `no`, `off`, `0` => `false`,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just looked at C++ Arrow: https://github.com/apache/arrow/blob/b8431fba68e2540b3e57def0bd0ad718652c4b98/cpp/src/arrow/util/value_parsing.h#L92. Seems it only converts "0", "1", "true", "false". Should we follow it and remove "on", "off", "yes", "no"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't come across "on", "off", "yes", "no" in practice. They're language specific, unlike T and F which are computing primitives.

So I'd stick to "1", "true" and their inverses.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess removing those rare variants makes sense.

unlike T and F which are computing primitives

Should t, f shorter equivalents be left in then? If yes, just t and f, or all shorter variants?
What about whitespace trimming?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't come across "on", "off", "yes", "no" in practice

Actually, I just remembered that YES, NO variants are used in ANSI information_schema views.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @MazterQyou -- this is looking good. A little more polishing and hopefully this can make it into the next release in a day or two

arrow/src/compute/kernels/cast.rs Outdated Show resolved Hide resolved
@MazterQyou MazterQyou force-pushed the upstream-patch/utf8-to-boolean-cast branch 2 times, most recently from aeb564c to ac19755 Compare May 25, 2022 21:35
@MazterQyou MazterQyou requested a review from alamb May 25, 2022 21:38
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the code looks good -- thank you @MazterQyou

I left some style comments, but I think this PR is looking ready to go now

arrow/src/compute/kernels/cast.rs Outdated Show resolved Hide resolved
@MazterQyou MazterQyou force-pushed the upstream-patch/utf8-to-boolean-cast branch from ac19755 to f2d2e41 Compare May 26, 2022 17:56
@MazterQyou MazterQyou force-pushed the upstream-patch/utf8-to-boolean-cast branch from f2d2e41 to 65b2d7c Compare May 26, 2022 17:57
@MazterQyou MazterQyou requested a review from viirya May 26, 2022 17:58
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good . thanks @MazterQyou

Are we all happy to keep the on/off/yes/no?

@viirya
Copy link
Member

viirya commented May 28, 2022

I prefer to stick with "1", "true", "0", "false" only, but not strong opinion.

Copy link
Contributor

@liukun4515 liukun4515 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
It's the first time for me to learn that there are so many strings can be accepted for BOOLEAN TYPE.

@nevi-me
Copy link
Contributor

nevi-me commented May 30, 2022

I prefer to stick with "1", "true", "0", "false" only, but not strong opinion.

I echo this, but also not a strong opinion

@nevi-me nevi-me merged commit 486118c into apache:master May 30, 2022
MazterQyou added a commit to cube-js/arrow-rs that referenced this pull request May 30, 2022
@MazterQyou MazterQyou deleted the upstream-patch/utf8-to-boolean-cast branch May 30, 2022 08:39
MazterQyou added a commit to cube-js/arrow-rs that referenced this pull request May 30, 2022
ovr pushed a commit to cube-js/arrow-rs that referenced this pull request Aug 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support casting from DataType::Utf8 to DataType::Boolean
7 participants