Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support parsing timezone-aware datetimes in constructors when data type is also timezone-aware #16297

Open
stinodego opened this issue May 17, 2024 · 2 comments · May be fixed by #16828
Open
Labels
A-api Area: changes to the public API A-timeseries Area: date/time functionality enhancement New feature or an improvement of an existing feature needs decision Awaiting decision by a maintainer
Milestone

Comments

@stinodego
Copy link
Member

Description

I ran into this today, and I think we can improve behavior here.

Consider this code:

from datetime import datetime
from zoneinfo import ZoneInfo
import polars as pl

data = [datetime(1970, 1, 1, tzinfo=ZoneInfo("Iran"))]
s = pl.Series(data, dtype=pl.Datetime(time_zone="Iran"), strict=False)
ValueError: time-zone-aware datetimes are converted to UTC

Please either drop the time zone from the dtype, or set it to 'UTC'. To convert to a different time zone, please use `.dt.convert_time_zone`.

This is odd. If we're casting timezone-aware data anyway, might as well cast it to desired time zone, right?

One of the benefits of doing this is that timezone-aware data can then be roundtripped, like in one of our tests:

@given(s=series())
def test_to_list(s: pl.Series) -> None:
    values = s.to_list()
    result = pl.Series(values, dtype=s.dtype)
    assert_series_equal(s, result, categorical_as_str=True)

For reference, PyArrow seems to handle this a bit differently from us and they do support this:

import pyarrow as pa

data = [datetime(1970, 1, 1, tzinfo=ZoneInfo("Iran"))]
arr = pa.array(data, type=pa.timestamp("us", tz="Iran"))

print(arr)  # 1969-12-31 20:30:00.000000Z
print(arr.type)  # timestamp[us, tz=Iran]

I may be missing something here, but I thought I'd throw this out there. Let's see what @MarcoGorelli thinks 😄

@stinodego stinodego added enhancement New feature or an improvement of an existing feature A-timeseries Area: date/time functionality labels May 17, 2024
@MarcoGorelli
Copy link
Collaborator

MarcoGorelli commented May 17, 2024

thanks for the ping

so, for .str.to_datetime, the current rules are:

  • naive string, time zone => parse into the given time zone
    In [17]: pl.Series(['2020-01-01T01:23:45']).str.to_datetime(time_zone='Asia/Kathmandu')
    Out[17]:
    shape: (1,)
    Series: '' [datetime[μs, Asia/Kathmandu]]
    [
            2020-01-01 01:23:45 +0545
    ]
  • offset-aware string, no time zone set: convert to UTC:
    In [18]: pl.Series(['2020-01-01T01:23:45+05:45']).str.to_datetime()
    Out[18]:
    shape: (1,)
    Series: '' [datetime[μs, UTC]]
    [
            2019-12-31 19:38:45 UTC
    ]
  • naive, no time zone: parse as naive
  • offset-aware string, time zone set: raise
    ComputeError: offset-aware datetimes are converted to UTC. Please either drop the time zone from the function call, or set it to UTC. To convert to a different time zone, please use `convert_time_zone`.

The constructor should probably not be too different. Let's see:

  • naive datetime, time zone => parse into given time zone. ✅ same
    In [20]: pl.Series([datetime(2020, 1, 1)], dtype=pl.Datetime('us', 'Asia/Kathmandu'))
    Out[20]:
    shape: (1,)
    Series: '' [datetime[μs, Asia/Kathmandu]]
    [
            2020-01-01 00:00:00 +0545
    ]
  • time zone aware datetime, no time zone given => convert to UTC. ✅ same
    In [21]: pl.Series([datetime(2020, 1, 1, tzinfo=ZoneInfo('Asia/Kathmandu'))], dtype=pl.Datetime)
    <ipython-input-21-3f19ea488109>:1: TimeZoneAwareConstructorWarning: Constructing a Series with time-zone-aware datetimes results in a Series with UTC time zone. To silence this warning, you can filter warnings of class TimeZoneAwareConstructorWarning, or set 'UTC' as the time zone of your datatype.
     pl.Series([datetime(2020, 1, 1, tzinfo=ZoneInfo('Asia/Kathmandu'))], dtype=pl.Datetime)
    Out[21]:
    shape: (1,)
    Series: '' [datetime[μs, UTC]]
    [
            2019-12-31 18:15:00 UTC
    ]
  • naive, no time zone: parse as naive. ✅ same
  • time-zone-aware, time zone set: also raises (as you found), so it's also ✅ same

For the last one, I think you're suggesting to convert to the given time zone. So long as it's clearly documented, and it's done for both the Series constructor and .str.to_datetime, I think it makes sense

@MarcoGorelli
Copy link
Collaborator

MarcoGorelli commented Jun 3, 2024

I've taken another look at PyArrow, and there is something else probably worth mirroring

For .str.to_datetime, if the strings are offset-aware, then they convert to UTC. Just like Polars does 👍:

In [8]: pc.strptime(pa.array(['2020-01-01T01:02:03+01:00']), unit='us', format='%Y-%m-%dT%H:%M:%S%z').type
Out[8]: TimestampType(timestamp[us, tz=UTC])

In [11]: pl.Series(['2020-01-01T01:02:03+01:00']).str.to_datetime(time_unit='us').dtype
Out[11]: Datetime(time_unit='us', time_zone='UTC')

But, for in the constuctor, when starting from a tz-aware stdlib datetime object, they take the time zone of the first such object:

In [14]: pa.array([datetime(2020, 1, 1, tzinfo=timezone(timedelta(hours=1))), datetime(2020, 1, 2)]).type
Out[14]: TimestampType(timestamp[us, tz=+01:00])

In [15]: pl.Series([datetime(2020, 1, 1, tzinfo=timezone(timedelta(hours=1))), datetime(2020, 1, 2)]).dtype
<ipython-input-15-9332dc369f5e>:1: TimeZoneAwareConstructorWarning: Constructing a Series with time-zone-aware datetimes results in a Series with UTC time zone. To silence this warning, you can filter warnings of class TimeZoneAwareConstructorWarning, or set 'UTC' as the time zone of your datatype.
  pl.Series([datetime(2020, 1, 1, tzinfo=timezone(timedelta(hours=1))), datetime(2020, 1, 2)]).dtype
Out[15]: Datetime(time_unit='us', time_zone='UTC')

whereas Polars still converts to UTC

One suggestion could be:

  • if constructing a Series from datetime objects, then take the timezone of the first one
  • if the first one has a timezone which Polars doesn't support (e.g. timezone(timedelta(minutes=47))), then raise

There's a further difference though. If the user specifies the time zone as part of the dtype, then Polars sets that as the dtype, whereas PyArrow converts as if starting from UTC:

In [25]: pa.array([datetime(2020, 1, 1), datetime(2020, 1, 2)], type=pa.timestamp('us', 'Iran'))
Out[25]:
<pyarrow.lib.TimestampArray object at 0x7f2d3b4e5c60>
[
  2020-01-01 00:00:00.000000Z,
  2020-01-02 00:00:00.000000Z
]

In [26]: pl.Series([datetime(2020, 1, 1), datetime(2020, 1, 2)], dtype=pl.Datetime('us', 'Iran'))
Out[26]:
shape: (2,)
Series: '' [datetime[μs, Iran]]
[
        2020-01-01 00:00:00 +0330
        2020-01-02 00:00:00 +0330
]

It looks like their rule is:

  • consider all inputs as UTC (even if they're tz-naive)
  • if the user specified a time zone in pa.timestamp, then convert (don't replace) to that
  • if the user didn't specify type, then convert to the time zone of the first non-null element

Where does this leave Polars? Not totally sure, just wanted to leave these findings here for now


Something which currently doesn't look great (and is unintuitive?) is this:

In [22]: pl.Series([datetime(2020, 1, 1), datetime(2020, 1, 1, tzinfo=ZoneInfo('Asia/Kathmandu'))], dtype=pl.Datetime('us', 'Europe/Amsterdam'))
Out[22]:
shape: (2,)
Series: '' [datetime[μs, Europe/Amsterdam]]
[
        2020-01-01 00:00:00 CET
        2019-12-31 18:15:00 CET
]

The second element gets converted to 'UTC', but then its time zone is replaced with 'Europe/Amsterdam'


OK, got a concrete proposal in #16828. It addresses several inconsistencies, but in doing so is unfortunately breaking for some people. In those cases, however, a clear warning is issued, advising the user about what to do instead

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-api Area: changes to the public API A-timeseries Area: date/time functionality enhancement New feature or an improvement of an existing feature needs decision Awaiting decision by a maintainer
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants