Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support SET timezone to non-UTC time zone #4106

Closed
waitingkuo opened this issue Nov 4, 2022 · 16 comments · Fixed by #4107
Closed

Support SET timezone to non-UTC time zone #4106

waitingkuo opened this issue Nov 4, 2022 · 16 comments · Fixed by #4107
Labels
enhancement New feature or request

Comments

@waitingkuo
Copy link
Contributor

waitingkuo commented Nov 4, 2022

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

I'd like to add the support for SET TIMEZONE.
I'd like to have now() and TimestampTz use the timezone in config_options instead of the fixed UTC

Describe the solution you'd like

  1. enable SET TIME TO [SOME_TIMEZONE]
    It's currently disabled on purpose

https://github.com/apache/arrow-datafusion/blob/7e944ede86457fe0f43be44e0e5550229ecaf008/datafusion/sql/src/planner.rs#L2484-L2489

i'd like to remove it (probably replaced by some tz verification) and have this as the result

set time zone to '+08:00';
0 rows in set. Query took 0.000 seconds.
❯ show time zone;
+--------------------------------+---------+
| name                           | setting |
+--------------------------------+---------+
| datafusion.execution.time_zone | +08:00  |
+--------------------------------+---------+
1 row in set. Query took 0.007 seconds.
  1. now() returns the Timestamp<TimeUnit::Nanosecond, Some("SOME_TIMEZONE") according to the time zone we have. it's currently fixed as "UTC"
    https://github.com/apache/arrow-datafusion/blob/7e944ede86457fe0f43be44e0e5550229ecaf008/datafusion/physical-expr/src/datetime_expressions.rs#L176-L186

Some("UTC".to_owned()) should be modified. we need to pass config_options or time zone into here this function

e.g.
current version

set time zone to '+08:00';
0 rows in set. Query took 0.000 seconds.
❯ select arrow_typeof(now());
+------------------------------------+
| arrowtypeof(now())                 |
+------------------------------------+
| Timestamp(Nanosecond, Some("UTC")) |
+------------------------------------+
1 row in set. Query took 0.003 seconds.

becomes

set time zone to '+08:00';
0 rows in set. Query took 0.000 seconds.
❯ select arrow_typeof(now());
+---------------------------------------+
| arrowtypeof(now())                    |
+---------------------------------------+
| Timestamp(Nanosecond, Some("+08:00")) |
+---------------------------------------+
1 row in set. Query took 0.003 seconds.
  1. TimestampTz should use the timezone from config_options

https://github.com/apache/arrow-datafusion/blob/7e944ede86457fe0f43be44e0e5550229ecaf008/datafusion/sql/src/planner.rs#L2832-L2841

we currently fixed TimestampTz as "UTC" without considering config_options's timezone.
to enable it to consider config_options, we need to

3-1. make convert_simple_data_type as a method of SqlToRel

https://github.com/apache/arrow-datafusion/blob/7e944ede86457fe0f43be44e0e5550229ecaf008/datafusion/sql/src/planner.rs#L2811-L2817

3-2. Add get_config_option in ContextProvider so that we could get the time zone

https://github.com/apache/arrow-datafusion/blob/7e944ede86457fe0f43be44e0e5550229ecaf008/datafusion/sql/src/planner.rs#L2832-L2841

3-3 then we can use the timezone in config_options here to replace fixed UTC
https://github.com/apache/arrow-datafusion/blob/7e944ede86457fe0f43be44e0e5550229ecaf008/datafusion/sql/src/planner.rs#L2832-L2841

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

@waitingkuo waitingkuo added the enhancement New feature or request label Nov 4, 2022
@waitingkuo
Copy link
Contributor Author

@alamb @liukun4515 @avantgardnerio @tustvold would you mind helping to check whether this make sense?

@waitingkuo
Copy link
Contributor Author

waitingkuo commented Nov 4, 2022

i added a POC to showcase some functionalities
#4107

perhaps we should break it to several pr later

@avantgardnerio
Copy link
Contributor

avantgardnerio commented Nov 4, 2022

@waitingkuo thank you for your clear communication as always! You have the most reviewable PRs and issues I've seen 😄 .

The issue looks good to me, but I would propose tackling in stages:

  1. Allow things like set time zone to '+08:00'; as described
  2. Allow set time zone to 'US/Mountain'; or set time zone to 'US/Denver'; or set time zone to 'MT';

This is due to the extra complication of what to do when someone does a select TO_TIMESTAMP('2022-11-06T01:30:00'); and the server timezone is 'US/Mountain' and there are two distinct 1:30AMs one in MDT and one in MST. Do we know what postgres does here? The only sensible thing I can think to do is fail the query with an error.

@alamb
Copy link
Contributor

alamb commented Nov 4, 2022

You have the most reviewable PRs and issues I've seen 😄 .

💯 agree -- thank you so much @waitingkuo

@waitingkuo
Copy link
Contributor Author

@waitingkuo thank you for your clear communication as always! You have the most reviewable PRs and issues I've seen 😄 .

❤️

The issue looks good to me, but I would propose tackling in stages:

  1. Allow things like set time zone to '+08:00'; as described
  2. Allow set time zone to 'US/Mountain'; or set time zone to 'US/Denver'; or set time zone to 'MT';

This is due to the extra complication of what to do when someone does a select TO_TIMESTAMP('2022-11-06T01:30:00'); and the server timezone is 'US/Mountain' and there are two distinct 1:30AMs one in MDT and one in MST. Do we know what postgres does here? The only sensible thing I can think to do is fail the query with an error.

PostgreSQL's to_timestamp ouputs timestamp with time zone
https://www.postgresql.org/docs/current/functions-formatting.html

to_timestamp ( text, text ) → timestamp with time zone
Converts string to time stamp according to the given format. 

to_timestamp('05 Dec 2000', 'DD Mon YYYY') → 2000-12-05 00:00:00-05
willy=# set timezone to 'America/Denver';
SET
willy=# select to_timestamp('05 Dec 2000', 'DD Mon YYYY');
      to_timestamp      
------------------------
 2000-12-05 00:00:00-07
(1 row)

it shows the fixed offset timezone. i think there's no ambiguity in this approach.

The TO_TIMESTAMP for datafusion for now outputs Timestamp<TimeUnit, None>.
https://github.com/apache/arrow-datafusion/blob/6d00bd990ce5644181ad1549a6c70c8406219070/datafusion/expr/src/function.rs#L206-L220
I think it will make more sense if these TO_TIMESTSMP_XX functions change the signatures to Timestamp with Timezone. This is in my backlog as well . I was working on #2979 to align TO_TIMESTAMP to Postgrseql's but found that we don't have timezone support. So I moved the focus to finish the time zone support (from sqlparser, arrow-rs, and finally datafusion 😃 )

@waitingkuo
Copy link
Contributor Author

i think we might need to refactor the return_type to support dynamical time zone determine
https://github.com/apache/arrow-datafusion/blob/6d00bd990ce5644181ad1549a6c70c8406219070/datafusion/expr/src/function.rs#L221-L223

now()'s return type is TImestamp<TimeUnt, Some("+08:00")> for now. need to get the timezone from config_options to and return TImestamp<TimeUnt, Some("SOME_TIMEZONE")> instead. I'll prefer move the now() support to another issue/pr.

i.e. 2. now() returns the Timestamp<TimeUnit::Nanosecond, Some("SOME_TIMEZONE") according to the time zone we have. it's currently fixed as "UTC"
will be implemented in this pr. we can do this instead for now

set time zone to '+08:00';
0 rows in set. Query took 0.000 seconds.

❯ select now();
+----------------------------------+
| now()                            |
+----------------------------------+
| 2022-11-05T11:11:51.713660+00:00 |
+----------------------------------+
1 row in set. Query took 0.003 seconds.

❯ select now()::timestamptz;
+----------------------------------+
| now()                            |
+----------------------------------+
| 2022-11-05T19:11:14.245835+08:00 |
+----------------------------------+
1 row in set. Query took 0.003 seconds.

@avantgardnerio
Copy link
Contributor

no ambiguity in this approach

I disagree. The problem is that America/Denver is on a statuary (legally constructed) timezone called Mountain Time. Because Mountain Time (presently) observes Daylight Savings Time, it is two "real" timezones: MDT or UTC-6 and MST UTC-7.

image

Unfortunately, this Sunday (2022-11-06), MT will be switching from UTC-6 to UTC-7 at 02:00 MDT, which will cause the clocks to go to 01:00 MST, and about half an hour later it will be 1:30AM MT for the second time this year.

So while I agree there is no ambiguity for to_timestamp('05 Dec 2000', 'DD Mon YYYY'), I think we need a test case specifically for select TO_TIMESTAMP('2022-11-06T01:30:00');, because given only that the only information known is:

  1. 2022-11-06T01:30:00
  2. and America/Denver

There is no way to infer if this means there is no way to know if that corresponds with 1667745000 (UTC-6) or 1667748600 (UTC-7).

image

https://www.unixtimestamp.com/index.php

@waitingkuo
Copy link
Contributor Author

waitingkuo commented Nov 6, 2022

@avantgardnerio thank you. I didn't aware this before. never live in the area that has the timezone switch.

I did some research in Postgrseql and Chrono-tz

MST timezone offset: -7
MDT timezone offset: -6

In 2022, MDT
began from Sunday, 13 March, 02:00 (changed from -7 to -6)
ended at Sunday, 6 November, 02:00 (changed from -6 to -7)

willy=# set timezone to 'America/Denver';
SET

this is valid (right before the timezone shift):

willy=# select to_timestamp('2022-03-13 01:00', 'YYYY-MM-DD HH24:MI');
      to_timestamp      
------------------------
 2022-03-13 01:00:00-07
(1 row)

willy=# select to_timestamp('2022-03-13 01:59', 'YYYY-MM-DD HH24:MI');
      to_timestamp      
------------------------
 2022-03-13 01:59:00-07
(1 row)

begin from 2:00, it switches to MDT (-6)
i think the logic behind for this hour is:
parse as it's MST(-7), and then switch to MDT (-6), that's why we have an one hour shift here

willy=# select to_timestamp('2022-03-13 02:00', 'YYYY-MM-DD HH24:MI');
      to_timestamp      
------------------------
 2022-03-13 03:00:00-06
(1 row)

willy=# select to_timestamp('2022-03-13 02:30', 'YYYY-MM-DD HH24:MI');
      to_timestamp      
------------------------
 2022-03-13 03:30:00-06
(1 row)

willy=# select to_timestamp('2022-03-13 02:59', 'YYYY-MM-DD HH24:MI');
      to_timestamp      
------------------------
 2022-03-13 03:59:00-06
(1 row)

and then the next hour it's parsed as MDT

willy=# select to_timestamp('2022-03-13 03:00', 'YYYY-MM-DD HH24:MI');
      to_timestamp      
------------------------
 2022-03-13 03:00:00-06
(1 row)

In November 6th 2am MDT, time zone is switched back to MST (1am MST)

let's begin with something unambiguous

willy=# select to_timestamp('2022-11-06 00:59', 'YYYY-MM-DD HH24:MI');
      to_timestamp      
------------------------
 2022-11-06 00:59:00-06
(1 row)

for the next 1 hour, it's ambiguous since both MST and MDT has 1 am. and this is what postgresql does, it uses MST even though 1am MDT is valid as well

willy=# select to_timestamp('2022-11-06 01:00', 'YYYY-MM-DD HH24:MI');
      to_timestamp      
------------------------
 2022-11-06 01:00:00-07
(1 row)
willy=# select to_timestamp('2022-11-06 01:59', 'YYYY-MM-DD HH24:MI');
      to_timestamp      
------------------------
 2022-11-06 01:59:00-07
(1 row)

after that, there's no ambiguity

willy=# select to_timestamp('2022-11-06 02:00', 'YYYY-MM-DD HH24:MI');
      to_timestamp      
------------------------
 2022-11-06 02:00:00-07
(1 row)

conclusion for Postgrseql: when it's ambiguous or invalid, it's parsed as MST, and then switches to MDT if needed

now let's see the behavior for chrono-tz

     let tz: Tz = "America/Denver".parse().unwrap();
    let dt = tz.datetime_from_str("2022-03-13T01:59", "%Y-%m-%dT%H:%M");
    println!("2022-03-13T01:59 -> {:?} / {}", dt, dt.unwrap().to_rfc3339());
    let dt = tz.datetime_from_str("2022-03-13T02:00", "%Y-%m-%dT%H:%M");
    println!("2022-03-13T02:00 -> {:?}", dt);
    let dt = tz.datetime_from_str("2022-03-13T02:59", "%Y-%m-%dT%H:%M");
    println!("2022-03-13T02:59 -> {:?}", dt);
    let dt = tz.datetime_from_str("2022-03-13T03:00", "%Y-%m-%dT%H:%M");
    println!("2022-03-13T03:00 -> {:?} / {}", dt, dt.unwrap().to_rfc3339());
2022-03-13T01:59 -> Ok(2022-03-13T01:59:00MST) / 2022-03-13T01:59:00-07:00
2022-03-13T02:00 -> Err(ParseError(Impossible))
2022-03-13T02:59 -> Err(ParseError(Impossible))
2022-03-13T03:00 -> Ok(2022-03-13T03:00:00MDT) / 2022-03-13T03:00:00-06:00
    let dt = tz.datetime_from_str("2022-11-06T00:59", "%Y-%m-%dT%H:%M");
    println!("2022-11-06T00:59 -> {:?} / {}", dt, dt.unwrap().to_rfc3339());
    let dt = tz.datetime_from_str("2022-11-06T01:00", "%Y-%m-%dT%H:%M");
    println!("2022-11-06T01:00 -> {:?}", dt);
    let dt = tz.datetime_from_str("2022-11-06T01:59", "%Y-%m-%dT%H:%M");
    println!("2022-11-06T01:59 -> {:?}", dt);
    let dt = tz.datetime_from_str("2022-11-06T02:00", "%Y-%m-%dT%H:%M");
    println!("2022-11-06T02:00 -> {:?} / {}", dt, dt.unwrap().to_rfc3339());
2022-11-06T00:59 -> Ok(2022-11-06T00:59:00MDT) / 2022-11-06T00:59:00-06:00
2022-11-06T01:00 -> Err(ParseError(NotEnough))
2022-11-06T01:59 -> Err(ParseError(NotEnough))
2022-11-06T02:00 -> Ok(2022-11-06T02:00:00MST) / 2022-11-06T02:00:00-07:00

chrono-tz raises errors for these cases for the ambiguous or invalid cases

@waitingkuo
Copy link
Contributor Author

waitingkuo commented Nov 6, 2022

@waitingkuo thank you for your clear communication as always! You have the most reviewable PRs and issues I've seen 😄 .

The issue looks good to me, but I would propose tackling in stages:

  1. Allow things like set time zone to '+08:00'; as described
  2. Allow set time zone to 'US/Mountain'; or set time zone to 'US/Denver'; or set time zone to 'MT';

This is due to the extra complication of what to do when someone does a select TO_TIMESTAMP('2022-11-06T01:30:00'); and the server timezone is 'US/Mountain' and there are two distinct 1:30AMs one in MDT and one in MST. Do we know what postgres does here? The only sensible thing I can think to do is fail the query with an error.

i'll limit this pr to support fixed offset timezone only. thank you @avantgardnerio

@avantgardnerio
Copy link
Contributor

never live in the area that has the timezone switch

@waitingkuo I think this is good advice :) Hopefully the US ends this nonsense next year.

I find it troubling that postgres happily parses ambiguous timestamps and chooses a default arbitrarily, but it seems like the emerging trend for DataFusion is "just do what postgres does", so perhaps we just copy their flawed implementation.

What's even more troubling than it defaulting an ambiguous timestamp is that it happily parses a non-existent 2:30AM in March!

I guess the only real choice we have is:

  1. do we follow postgres' arguably flawed behavior?
  2. or do we throw errors for ambiguous/invalid timestamps?

Finally, I even wonder how chrono deals with all this? DST didn't always start or end in March/November, so a database is required to keep track of when various legal jurisdictions implement/modify/retract DST. In unix-based OSes I think this is a tzinfo file, but I know of no equivalent on Windows.

If we do embrace US/MT, I think we should account for the historical shifts i.e. we need tests for April < 1986 rather than March.

@waitingkuo
Copy link
Contributor Author

@avantgardnerio
TLDR:
chrono-tz use


IANA timezone database maintains the offset & leapseconds
https://www.iana.org/time-zones

this repo has the database and parser (written in c)
https://github.com/eggert/tz
here's the sample for North America
https://github.com/eggert/tz/blob/main/northamerica

chrono-tz use
https://github.com/eggert/tz as the timezone database
and https://github.com/chronotope/parse-zoneinfo as parser code

@waitingkuo
Copy link
Contributor Author

waitingkuo commented Nov 8, 2022

@avantgardnerio

perhaps we could force user to add the timezone offset while casting string to timestamptz as the initial pr for extending named timezone

e.g.
correct
2000-03-13T02:00:00+00:00
2000-03-13T02:00:00-06:00
2000-03-13T02:00:00-07:00
2000-03-13T02:00:00Z (this is a special case)

note that if we set time zone as 'America/Denver'
2000-03-13T01:00:00-07:00::Timestamptz becomes 2000-03-13T01:00:00-07:00
2000-03-13T02:00:00-07:00::Timestamptz becomes 2000-03-13T03:00:00-06:00
2000-03-13T02:00:00-06:00::Timestamptz becomes 2000-03-13T02:00:00-06:00

incorrect
2000-03-13T02:30:00 -> this is the one might be illegal in some timezone
2000-03-01T02:30:00 -> this should work for all the timezone but still disallow it (until we decide the behavior)

@avantgardnerio
Copy link
Contributor

chrono-tz use

Very informative, ty!

@avantgardnerio
Copy link
Contributor

perhaps we could force user to add the timezone offset

That would definitely eliminate the ambiguity. I'd love to see if others have opinions about this.

@waitingkuo
Copy link
Contributor Author

waitingkuo commented Nov 8, 2022

https://arrow.apache.org/docs/cpp/api/compute.html#structarrow_1_1compute_1_1_assume_timezone_options

@avantgardnerio adding these options looks great as well.
note that what chronos does is AMBIGUOUS_RAISE and posgresql is AMBIGUOUS_EARLIEST
but then we need to implement more 😮

@alamb
Copy link
Contributor

alamb commented Nov 9, 2022

or do we throw errors for ambiguous/invalid timestamps?

This would be my preference I think -- no one likes how postgres handles things.

What I would really prefer is that DataFusion always deals with UTC internally (like all internal math, expressions like now(), etc always use a UTC timestamp). I think this is what happens now

If the input data doesn't specify a timezone, we should treat it like UTC (I know that is not what the arrow spec seems to say)

If the user wants to see times / dates in their current locale's timezone, that can be handled by the end client (e.g. datafusion-cli) rather than forcing the entire processing chain to handle arbitrary timezones just for output display.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants