Parse Time32/Time64 from formatted string #3101

Jefffrey · 2022-11-13T10:41:26Z

Which issue does this PR close?

Closes #3100.

Rationale for this change

What changes are included in this PR?

Enable parsing Time32/Time64 from formatted string.

Enable reading Time32/Time64 from CSV files.

Are there any user-facing changes?

Able to parse Time32/Time64 types from formatted string, and from CSV.

tustvold · 2022-11-14T05:21:57Z

This looks good to me, I especially love the test coverage.

Before merging I would like to get consensus on whether we want to support such a broad range of representations, I could definitely see an argument to just support RFC3339 style times, i.e. %H:%M:%S and %H:%M:%S%.f.

Thoughts @waitingkuo @alamb ?

Jefffrey · 2022-11-14T05:32:27Z

It's a fair point, as I was unsure which to support too. I was planning to base it on other arrow implementations, like pyarrow, but had difficulty tracking down the actual code that did the parsing, to reference, so I just went and did a bunch of formats which seemed reasonable. Happy to cut down any if its too broad

tustvold · 2022-11-14T05:41:51Z

One other option would be to support a default representation, but support specifying a custom format string by using Parse::parse_formatted originally added by @sum12 in #1451

Jefffrey · 2022-11-14T09:03:46Z

One other option would be to support a default representation, but support specifying a custom format string by using Parse::parse_formatted originally added by @sum12 in #1451

Can definitely add the support for Parse::parse_formatted, but still would leave the question of what the default(s) should be, especially for the original use case of parsing from csv. Ideally wouldn't want to have to slap parse_formatted for all the different representations in here:

arrow-rs/arrow-csv/src/reader.rs

Lines 587 to 604 in 3ca41f5

    
           DataType::Time32(TimeUnit::Second) => { 
        
               build_primitive_array::<Time32SecondType>(line_number, rows, i, None) 
        
           } 
        
           DataType::Time32(TimeUnit::Millisecond) => build_primitive_array::< 
        
               Time32MillisecondType, 
        
           >( 
        
               line_number, rows, i, None 
        
           ), 
        
           DataType::Time64(TimeUnit::Microsecond) => build_primitive_array::< 
        
               Time64MicrosecondType, 
        
           >( 
        
               line_number, rows, i, None 
        
           ), 
        
           DataType::Time64(TimeUnit::Nanosecond) => build_primitive_array::< 
        
               Time64NanosecondType, 
        
           >( 
        
               line_number, rows, i, None 
        
           ),

alamb

I think this looks great personally -- thanks @Jefffrey

My opinion on the wide range of formats is that it is a good feature. My rationale is:

It is consistent with the variety of timestamp handling we have
It is a better user experience (if i have Time made by excel or some other program I don't want to have to specify a custom timestamp format to deal with it).

If the speed of parsing a csv file is the core issue perhaps @tustvold 's suggestion of allowing a custom format string #3101 (comment) would be one way to speed it up

alamb · 2022-11-14T20:31:41Z

arrow-cast/src/parse.rs

-parser_primitive!(Time32MillisecondType);
-parser_primitive!(Time32SecondType);
+impl Parser for Time64NanosecondType {
+    fn parse(string: &str) -> Option<Self::Native> {


I think accepting a wide range of formats is consistent with string_to_timestamp_nanos for better or worse

https://github.com/Jefffrey/arrow-rs/blob/3ca41f50d0e8b6da95d83e5bf0b09fd518e2110f/arrow-cast/src/parse.rs#L71-L133

The only thing I recommend is adding docstring documentation (that will show up on docs.rs) for the types of formats accepted. We could follow the example of string_to_timestamp_nanos :
https://github.com/Jefffrey/arrow-rs/blob/3ca41f50d0e8b6da95d83e5bf0b09fd518e2110f/arrow-cast/src/parse.rs#L23-L54

One other option to speed this up might be to do a pass through the string and compute

Number of colons

Presence of space

Presence of capital M

Presence of decimal point

And use this to prune the list of candidates

Thanks for the feedback, I'll update the documentation & take a shot at implementing that string pre-pass to prune the formats to try parse for

Implemented the parse_formatted function, and also implemented a preprocess on the string to prune the formats to attempt parsing for

waitingkuo

thank you @Jefffrey
here's my comments

i wonder whether there's any public api that other crate could use? (like string_to_timestamp_nanos for timestamp`)
move the default display as the first place (commented bellow)
consider support leap second 23:59:60 (commented bellow)
this is what postgresql has

willy=# select time '23:59:60';
   time   
----------
 24:00:00
(1 row)

consider the timezone
we could discuss either drop the timezone directly or shift it to utc and then get the time
postgrseql drop the timezone directly

willy=# select time '00:00:00+08:00';
   time   
----------
 00:00:00
(1 row)

willy=# select timestamp '2000-01-01T00:00:00+08:00';
      timestamp      
---------------------
 2000-01-01 00:00:00
(1 row)

while datafusion's timestamp shifts it to utc first

❯ select timestamp '2000-01-01T00:00:00+08:00';
+-----------------------------------+
| Utf8("2000-01-01T00:00:00+08:00") |
+-----------------------------------+
| 1999-12-31T16:00:00               |
+-----------------------------------+
1 row in set. Query took 0.003 seconds.

we could submit another issue/pr for 3 or 4 later if it's too large or we need time to discuss

waitingkuo · 2022-11-15T03:11:07Z

arrow-cast/src/parse.rs

+            "%I:%M:%S%.9f %p",
+            "%l:%M:%S%.9f %P",
+            "%l:%M:%S%.9f %p",
+            "%H:%M:%S%.9f",


I'll suggest that move this as the first place as this is our default display (23:59:59.123456789)

this one %H:%M:%S%.9f

Sure thing, will do

Should no longer be relevant as I've implemented a preprocess pass which should ensure there isn't a priority clash between 24 and 12 hour time formats now

waitingkuo · 2022-11-15T03:18:56Z

arrow-cast/src/parse.rs

+        .map(|nt| {
+            nt.num_seconds_from_midnight() as i64 * 1_000_000
+                + (nt.nanosecond() as i64) / 1_000
+        })


we could consider whether to support leap second 23:59:60

%S already captured `Second number (00–60), zero-padded to 2 digits.

22:59:60 is parsed as 82800000000000 nanos which works.

23:59:60 is parsed as 86400000000000 nanos which overflows while we construct the array by Time64NanosecondArray::from(vec![86400000000000]); so it'll return a null

All four should now support having a leap second, per what chrono NaiveTime also supports

tustvold · 2022-11-15T07:07:04Z

consider the timezone

I originally was going to suggest this, but I decided against it as the semantics are actually a little bit funky. In particular, courtesy of the wonders of daylight savings, a non-FixedOffset timezone requires the date in order to be interpreted. I think it is acceptable to only handle timezones for timestamps, and not for times. FWIW this is the same approach taken by chrono - there is DateTime<Tz> and NaiveDateTime but only NaiveTime and no Time<Tz>

Jefffrey · 2022-11-15T09:39:46Z

A behaviour I've changed which is worth noting is that it is valid to pass in fractions of a second, even if the type you're parsing for doesn't support that precision; it'll simply be truncated from the final representation.

See:

assert_eq!(Time32SecondType::parse("02:10:01.1"), Some(7_801));

This technically was already happening for milli/micro/nano seconds anyway, but has been extended to seconds as well, to centralize all the behaviour. Let me know any thoughts on if instead it should be stricter and fail the parsing, rather than passing and truncating.

tustvold

Left some minor suggestions, but also happy for this to go in as is. Nice work 👍

tustvold · 2022-11-15T19:38:46Z

arrow-cast/src/parse.rs

+            .fold((0, false, false), |tup, char| match char {
+                ':' => (tup.0.saturating_add(1), tup.1, tup.2),


Suggested change

.fold((0, false, false), |tup, char| match char {

':' => (tup.0.saturating_add(1), tup.1, tup.2),

.fold((0_usize, false, false), |tup, char| match char {

':' => (tup.0 + 1, tup.1, tup.2),

Using a usize means this can't actually overflow

tustvold · 2022-11-15T19:41:37Z

arrow-cast/src/parse.rs

+    // colon count, presence of decimal, presence of whitespace
+    fn preprocess_time_string(string: &str) -> (u8, bool, bool) {
+        string
+            .chars()


Suggested change

.chars()

.as_bytes()

.iter()

And then match using b':'

We don't actually need to use chars here, as the nature of the UTF-8 encoding is such that ASCII can be compared without ambiguity - https://en.wikipedia.org/wiki/UTF-8#Encoding

tustvold · 2022-11-15T19:42:21Z

arrow-cast/src/parse.rs

+            Some(86_400_000_000_000)
+        );
+
+        // custom format


tustvold · 2022-11-15T19:44:01Z

arrow-cast/src/parse.rs

+            })
+    }
+
+    fn naive_time_parser(string: &str, formats: &[&str]) -> Option<NaiveTime> {


It might be cleaner to make the match block evaluate to &[&str] and then have this follow.

e.g. something like

let formats = match preprocess_time_string(s.trim()) { ... }; formats .iter() .find_map(|f| NaiveTime::parse_from_str(string, f).ok())

ursabot · 2022-11-15T22:13:27Z

Benchmark runs are scheduled for baseline = c99d2f3 and contender = c95eb4c. c95eb4c is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

* Parse Time32/Time64 from formatted string * PR comments * PR comments refactoring

github-actions bot added the arrow Changes to the arrow crate label Nov 13, 2022

alamb approved these changes Nov 14, 2022

View reviewed changes

tustvold approved these changes Nov 14, 2022

View reviewed changes

waitingkuo reviewed Nov 15, 2022

View reviewed changes

Jefffrey force-pushed the read_time_from_csv branch from 3ca41f5 to 1fba65a Compare November 15, 2022 09:35

tustvold approved these changes Nov 15, 2022

View reviewed changes

Jefffrey added 3 commits November 16, 2022 08:26

Parse Time32/Time64 from formatted string

8d2e7c6

PR comments

1720c45

PR comments refactoring

dea411d

Jefffrey force-pushed the read_time_from_csv branch from 1fba65a to dea411d Compare November 15, 2022 21:26

tustvold approved these changes Nov 15, 2022

View reviewed changes

tustvold merged commit c95eb4c into apache:master Nov 15, 2022

Jefffrey deleted the read_time_from_csv branch November 15, 2022 22:10

Jefffrey mentioned this pull request Nov 15, 2022

Cannot import Time64 from CSV apache/datafusion#3176

Closed

Jimexist pushed a commit that referenced this pull request Nov 16, 2022

Parse Time32/Time64 from formatted string (#3101)

3baf6eb

* Parse Time32/Time64 from formatted string * PR comments * PR comments refactoring

Jimexist pushed a commit that referenced this pull request Nov 16, 2022

Parse Time32/Time64 from formatted string (#3101)

b45790b

* Parse Time32/Time64 from formatted string * PR comments * PR comments refactoring

alamb mentioned this pull request Nov 25, 2022

Be able to parse time formatted strings #3100

Closed

Jefffrey mentioned this pull request Jan 27, 2024

Explicitly don't support leap seconds for Time32/Time64 arrays when parsing #5335

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parse Time32/Time64 from formatted string #3101

Parse Time32/Time64 from formatted string #3101

Jefffrey commented Nov 13, 2022 •

edited

tustvold commented Nov 14, 2022 •

edited

Jefffrey commented Nov 14, 2022

tustvold commented Nov 14, 2022

Jefffrey commented Nov 14, 2022

alamb left a comment

alamb Nov 14, 2022

tustvold Nov 14, 2022

Jefffrey Nov 15, 2022

Jefffrey Nov 15, 2022

waitingkuo left a comment •

edited

waitingkuo Nov 15, 2022

waitingkuo Nov 15, 2022

Jefffrey Nov 15, 2022

Jefffrey Nov 15, 2022

waitingkuo Nov 15, 2022

Jefffrey Nov 15, 2022

tustvold commented Nov 15, 2022

Jefffrey commented Nov 15, 2022

tustvold left a comment

tustvold Nov 15, 2022

tustvold Nov 15, 2022

tustvold Nov 15, 2022

tustvold Nov 15, 2022

ursabot commented Nov 15, 2022

		.fold((0, false, false), \|tup, char\| match char {
		':' => (tup.0.saturating_add(1), tup.1, tup.2),

Parse Time32/Time64 from formatted string #3101

Parse Time32/Time64 from formatted string #3101

Conversation

Jefffrey commented Nov 13, 2022 • edited

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

tustvold commented Nov 14, 2022 • edited

Jefffrey commented Nov 14, 2022

tustvold commented Nov 14, 2022

Jefffrey commented Nov 14, 2022

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

waitingkuo left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold commented Nov 15, 2022

Jefffrey commented Nov 15, 2022

tustvold left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ursabot commented Nov 15, 2022

Jefffrey commented Nov 13, 2022 •

edited

tustvold commented Nov 14, 2022 •

edited

waitingkuo left a comment •

edited