Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Implement string cast operations for Time32 and Time64 #2251

Merged
merged 3 commits into from Aug 2, 2022

Conversation

stuartcarnie
Copy link
Contributor

Which issue does this PR close?

Closes #2053 and helps apache/datafusion#2883.

Rationale for this change

N/A

What changes are included in this PR?

Implements cast operations following precedence of existing implementations.

Are there any user-facing changes?

The cast API now supports string -> Time32 and Time64 transformations.

@github-actions github-actions bot added the arrow Changes to the arrow crate label Aug 1, 2022
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @stuartcarnie -- this is looking great. I left some feedback but I ran out of time today -- I will complete my review first thing tomorrow.

cc @avantgardnerio

(Utf8,
Date32
| Date64
| Time32(TimeUnit::Second)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

fn seconds_since_midnight(time: &chrono::NaiveTime) -> i32 {
let sec = time.num_seconds_from_midnight();
let frac = time.nanosecond();
let adjust = if frac < 1_000_000_000 { 0 } else { 1 };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It took me a while to grok this -- I think it is leap second handling
https://docs.rs/chrono/0.4.19/chrono/trait.Timelike.html#tymethod.nanosecond

Maybe we could add a comment explaining what was going on

Suggested change
let adjust = if frac < 1_000_000_000 { 0 } else { 1 };
// handle leap second
// see https://docs.rs/chrono/0.4.19/chrono/trait.Timelike.html#tymethod.nanosecond
let adjust = if frac < 1_000_000_000 { 0 } else { 1 };

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused about this myself. It was my understanding that these were added manually by timekeepers? Does chrono keep a list of historical leap seconds? https://en.wikipedia.org/wiki/Leap_second#:~:text=Between%201972%20and%202020%2C%20a,every%2021%20months%2C%20on%20average.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is possible to parse a time with a leap second, using the 60th second:

https://docs.rs/chrono/0.3.1/chrono/naive/time/index.html#reading-and-writing-leap-seconds

The leap second is stored as a fractional second of 1_000_000_000 nanoseconds, but I realise now that I don't need any logic to handle it. The code is much simpler 😂

Comment on lines 1643 to 1648
let (frac, adjust) = if frac < 1_000_000_000 {
(frac, 0)
} else {
(frac - 1_000_000_000, MILLIS_PER_SEC)
};
(sec + adjust + frac / NANOS_PER_MILLI) as i32
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is probably a good reason, but my feeble mind can't figure it out. Why is it important to break up frac and adjust?

For example, isn't this equivalent?

Suggested change
let (frac, adjust) = if frac < 1_000_000_000 {
(frac, 0)
} else {
(frac - 1_000_000_000, MILLIS_PER_SEC)
};
(sec + adjust + frac / NANOS_PER_MILLI) as i32
(sec + frac / NANOS_PER_MILLI) as i32

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes indeed, I concluded the same thing and pushed up the changes, thanks! I will move the functions back inline as I originally had them, before I overcomplicated it 😂

@@ -1584,6 +1625,303 @@ fn cast_string_to_date64<Offset: OffsetSizeTrait>(
Ok(Arc::new(array) as ArrayRef)
}

fn seconds_since_midnight(time: &chrono::NaiveTime) -> i32 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/apache/arrow-rs/blob/master/arrow/src/temporal_conversions.rs may be a good place to put these functions too (so they have a chance of being found / reused)

Comment on lines 1662 to 1672
let iter = (0..string_array.len()).map(|i| {
if string_array.is_null(i) {
None
} else {
string_array
.value(i)
.parse::<chrono::NaiveTime>()
.map(|time| seconds_since_midnight(&time))
.ok()
}
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will be faster as it will not require checking bounds for calls to is_null or value:

Suggested change
let iter = (0..string_array.len()).map(|i| {
if string_array.is_null(i) {
None
} else {
string_array
.value(i)
.parse::<chrono::NaiveTime>()
.map(|time| seconds_since_midnight(&time))
.ok()
}
});
let iter = string_array
.iter()
.flat_map(|v| {
v.map(|v| {
v.parse::<chrono::NaiveTime>()
.map(|time| seconds_since_midnight(&time))
.ok()
})
});

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the same type of transformation can be applied to the iterator below this as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good – it looks like the string to date transformation functions would benefit from the same treatment, but I'll leave that to another PR so as to not cloud this one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried, but unfortunately it fails at runtime with a panic:

thread 'compute::kernels::cast::tests::test_cast_string_to_time32second' panicked at 'trusted_len_unzip requires an upper limit', arrow/src/array/array_primitive.rs:470:25

at

let (_, upper) = iterator.size_hint();
let len = upper.expect("trusted_len_unzip requires an upper limit");

The previous iter was ultimately of type Range<usize>, which returns Some(_) for the size_hint:

https://github.com/rust-lang/rust/blob/fe3342816a282949f014caa05ea2e669ff9d3d3c/library/core/src/iter/range.rs#L714-L722

whereas our new iter does not, as it does not know the size.

There are probably good reasons, but it is unfortunate that size_hint was not a separate trait, so this could be caught at compile time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for checking @stuartcarnie -- I wonder if flat_map is the problem -- let me take another crack at this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found a way to do this cleanly -- in #2284

} else {
let string = string_array
.value(i);
chrono::Duration::days(3);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The

                    chrono::Duration::days(3);

Seems like a leftover?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed – it is gone now, thanks for the spot

Remove the unnecessary conditionals to extract the leap second, as it is
already handled when converting to a time unit relative to midnight 🤦🏻‍♂️
Copy link
Contributor

@avantgardnerio avantgardnerio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarifications! Hopefully having them on this PR will help someone in the future - they definitely helped me.

LGTM

@codecov-commenter
Copy link

Codecov Report

Merging #2251 (7340b19) into master (3032a52) will decrease coverage by 0.03%.
The diff coverage is 73.44%.

@@            Coverage Diff             @@
##           master    #2251      +/-   ##
==========================================
- Coverage   82.29%   82.26%   -0.04%     
==========================================
  Files         243      245       +2     
  Lines       62443    62863     +420     
==========================================
+ Hits        51387    51713     +326     
- Misses      11056    11150      +94     
Impacted Files Coverage Δ
arrow/src/compute/kernels/cast.rs 94.12% <73.44%> (-1.71%) ⬇️
...row/src/array/builder/string_dictionary_builder.rs 90.64% <0.00%> (-0.72%) ⬇️
parquet/src/encodings/encoding/dict_encoder.rs 90.74% <0.00%> (-0.49%) ⬇️
...rquet/src/arrow/record_reader/definition_levels.rs 88.60% <0.00%> (-0.43%) ⬇️
parquet/src/column/writer/mod.rs 92.85% <0.00%> (-0.15%) ⬇️
arrow/src/array/mod.rs 100.00% <0.00%> (ø)
parquet/src/arrow/arrow_writer/byte_array.rs 76.72% <0.00%> (ø)
arrow/src/array/array_fixed_size_list.rs 92.91% <0.00%> (ø)
parquet/src/arrow/arrow_writer/mod.rs 97.66% <0.00%> (+0.01%) ⬆️
parquet/src/arrow/arrow_reader.rs 95.12% <0.00%> (+0.12%) ⬆️
... and 10 more

Help us with your feedback. Take ten seconds to tell us how you rate us.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this code's logic and tests looks good to me. I do think it would be nice to clean up the iterator logic a bit, though I think we can do that as a follow on PR.

])) as ArrayRef;
let a2 = Arc::new(LargeStringArray::from(vec![
Some("08:08:35.091323414"),
Some("08:08:60.091323414"), // leap second
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯 for the leap second

@@ -2854,6 +3172,102 @@ mod tests {
}
}

#[test]
fn test_cast_string_to_time32second() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 these are great tests.

assert!(c.is_null(2));
assert!(c.is_null(3));
assert!(c.is_null(4));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't see any tests for the fallable path (aka that throws an error). I want to have another crack at using the iterators so I'll add those tests in a follow up

@alamb alamb merged commit 9a4b1c9 into apache:master Aug 2, 2022
@ursabot
Copy link

ursabot commented Aug 2, 2022

Benchmark runs are scheduled for baseline = ed9fc56 and contender = 9a4b1c9. 9a4b1c9 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support for casting from Utf8/String to Time32 / Time64
5 participants