New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Only use UTC datetimes #115
Conversation
a9de99f
to
b8babdf
Compare
This comment was marked as outdated.
This comment was marked as outdated.
0bc9019
to
4a8cac8
Compare
src/pyo3_extensions/chrono_tz.rs
Outdated
// Python assumes that naive datetimes are in the local timezone, so this | ||
// is not like assuming it was in UTC. | ||
// https://docs.python.org/3/library/datetime.html#datetime.datetime.astimezone | ||
// TODO: Should we instead assume utc, and not convert? Here we are converting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this approach is fine. Do our warning messages show up even when we haven't configured python logging levels?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think my vote here would be to make this panic to not accidentally assume incorrect behavior: The chrono types are explicitly separated into naive vs zoned, and if we're writing our Bytewax time code assuming zoned, we should punt to the producer of the timestamps what zone should be applied. I don't think it's too much to ask people to write datetime.now(timezone.utc)
instead of just datetime.now()
.
src/lib.rs
Outdated
@@ -37,7 +37,7 @@ fn sleep_release_gil(py: Python, secs: u64) { | |||
#[pymodule] | |||
#[pyo3(name = "bytewax")] | |||
fn mod_bytewax(py: Python, m: &PyModule) -> PyResult<()> { | |||
pyo3_log::init(); | |||
// pyo3_log::init(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: Don't forget to uncomment this when it's available again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I might be coming into this halfway and perhaps you already hashed these out, but I have some questions. Thanks!
src/pyo3_extensions/chrono_tz.rs
Outdated
// Python assumes that naive datetimes are in the local timezone, so this | ||
// is not like assuming it was in UTC. | ||
// https://docs.python.org/3/library/datetime.html#datetime.datetime.astimezone | ||
// TODO: Should we instead assume utc, and not convert? Here we are converting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think my vote here would be to make this panic to not accidentally assume incorrect behavior: The chrono types are explicitly separated into naive vs zoned, and if we're writing our Bytewax time code assuming zoned, we should punt to the producer of the timestamps what zone should be applied. I don't think it's too much to ask people to write datetime.now(timezone.utc)
instead of just datetime.now()
.
src/pyo3_extensions/chrono_tz.rs
Outdated
} | ||
|
||
// Implement serde (de)serialization manually here, since there is no standard | ||
// way to serialize datetimes with IANA timezone info (yet?). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious. Why is this a special serialization needed in our case? It feels a little strange that we've encountered a use case that has such unique persistence needs compared to the built in serde in the Chrono crate.
My understanding of the argument is "normal serialization will just store the UTC offset of the time, so we lose the full TZ of the original data". But since we're only storing points in time, we don't lose the ability to compare them to any other points by just storing the correct UTC offset at that one point, so I'm not sure this is necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your understanding is correct, the point of using tz rather than a fixed offset is to keep the same information if the user provided a timezone info rather than an offset, see answers below for more context.
pytests/test_datetime.py
Outdated
run_main(flow_2) | ||
|
||
assert len(out_1) == len(out_2) | ||
assert out_1 == out_2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason this test is not comparing to some fixed expected data? What is it trying to assert here? That if the TZ of the start at and the items themselves are compared across TZ correctly? This feels like it's only testing the same thing as our tumbling window operator test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a regression test (maybe should add a comment about it). It is testing a behaviour that is supposed to be always correct, but if you run this test using the pyo3-chrono
integration we used before, it fails because even if start_at
and start_at_1
represents the same date in python, the incorrect conversion in pyo3-chrono
made the two flow run with a different start_at
parameter, and gave two different results. This is basically what caused this PR to be opened
pytests/test_datetime.py
Outdated
start_at = datetime.now(ZoneInfo("America/Los_Angeles")) | ||
except ImportError: | ||
py39 = False | ||
start_at = datetime.now(tz=timezone(timedelta(hours=-8))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick, but I think these except ImportError
blocks might read a little nicer if you isolate the TZ determination from the rest of the datetime creation:
try:
from zoneinfo import ZoneInfo
tz = ZoneInfo("America/Los_Angeles")
except ImportError:
tz = timezone(timedelta(hours=-8))
xxx = datetime.yyy(tz=tz)
pytests/test_datetime.py
Outdated
# But if we are working with fixed offsets in python<3.9, | ||
# we convert everything to UTC, so the tzinfo won't be the same. | ||
if py39: | ||
assert start_at.tzinfo == wc.start_at.tzinfo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought < Py3.9 just means we don't have pre-packaged ZoneInfo
types, not that we can't actually assert on the TZ itself. Does timezone(timedelta(hours=-8)) != timezone(timedelta(hours=-8))
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would fail due to the choice of always having timezoneaware datetimes internally, and the way I do this conversion (see comments below).
Since I made the ChronoDateTime be chrono::DateTime<chrono_tz::Tz>
, I can't use a chrono::FixedOffset
. So when I receive a datetime with a fixed offset from python, i convert it to utc first. This means the tzinfo
of the object that comes back from rust actually changes, even though it still represents the same date.
pytests/test_datetime.py
Outdated
|
||
run_main(flow_1) | ||
run_main(flow_2) | ||
assert len(out_1) == len(out_2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar questions as above test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar concept, maybe this test is reduntant, but the idea is that settings start_at
to now
in python, and not doing it (so setting it in rust) should give the same result, but it didn't with pyo3-chrono
if the now
in python had different timezone info
src/pyo3_extensions/chrono_tz.rs
Outdated
} | ||
|
||
impl ToPyObject for ChronoDateTime { | ||
fn to_object(&self, py: Python<'_>) -> PyObject { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to piggyback on the PyO3 built-in TZ-ignoring conversions and then fill in the TZ info here, rather than re-creating the rest of the conversion? Or does that make things more confusing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll look into it, it makes sense to reuse it if we can
src/pyo3_extensions/chrono_tz.rs
Outdated
.getattr("timezone")? | ||
.getattr("utc")?; | ||
dt = dt.getattr("astimezone")?.call1((utc,))?.cast_as()?; | ||
chrono_tz::Tz::UTC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's your thinking behind doing the conversion to UTC and not trying to create a chrono::offset::FixedOffset
in this case? It feels like it would be nice to round-trip a TZ datetime without this conversion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This answers your previous comment on why I only check start_at.tzinfo == wc.start_at.tzinfo
if we are on python3.9.
I made datetimes in bytewax be chrono::DateTime<chrono_tz::Tz>
rather than chrono::DateTime<chrono::Tz>
, so I can't really use a FixedOffset
as the parameter for chrono::DateTime
, because chrono_tz::Tz
is an enum implementing the Tz
trait, not the trait itself.
I would have liked to keep the chrono::Tz
trait as the generic param and still use chrono_tz
's stuff, but couldn't find a way.
I agree that would be better as it would allow me to use both FixedOffset and timezone infos, I'm going to have a second look at this and let you know.
So, I'm having a second look at this PR, and after some thinking I agree with @davidselassie here that we should force users to send us UTC datetimes, at least for the Supporting any possible kind of datetime here makes everything more complicated than it needs to be. Only accepting UTC datetimes will make this PR much simpler, I can remove the whole I'm going to update this PR and make it ready for review so that we can move on |
0cf8971
to
bb629e2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't suggesting this with my comment, but this seems like a great way forward! UTC4ever.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we ended up with the better approach.
Feel free to merge when logging is available again.
bb629e2
to
3434e3e
Compare
So, to have everything working we just need PyO3 |
de7ffb3
to
684ee4a
Compare
This PR forces datetimes passed to bytewax as config parameters to be in UTC, using the new chrono integration added in PyO3 (not merged yet).
We'll need to wait for the PR on PyO3 to be merged so that we can open a PR on pyo3-log to make everything work.
The PyO3 author said he wants to add the chrono integration in PyO3 0.17.2 which is coming soon, so this should not take too long.
Previous description (ignore this)
This PR makes all datetimes in Bytewax timezone aware.
It uses a non-released version of PyO3 that adds the
chrono
integration, and removes the dependency frompyo3-chrono
, which was doing the python->rust conversion deleting timezone info, without properly converting to UTC first.The approach taken in this PR is to add a
chrono_tz
integration on top of the one implemented in PyO3, so that any datetime in the codebase has timezone info attached, and can be correctly compared to any other datetime.The only edge case we have to handle, is when we receive naive datetimes from the user in python.
Right now the approach is to convert the datetime to utc using the python interpreter.
This means that any naive datetime coming from python will be assumed to be local (relative to where the dataflow is ran) and converted to UTC before being passed down to rust. This could be wrong in case the datetime comes from data generated somewhere with a different timezone, but we don't have enough informations to make a proper conversion anyway. Another possible approach would be to assume the datetime is UTC rather than local, and we would attach the UTC timezone info without converting first, but this would be equally wrong most of the times. The application should warn anytime we receive a naive datetime, so that users are aware of the fact and can fix the data if needed.
For (de)serialization of datetimes, I had to implement serde's traits manually, since there is no standard way to serialize datetimes with IANA timezone info. To do this I just serialize the local naivedatetime and the timezone name separately, so that we can rebuild the same
ChronoDateTime
when deserializing.The PR is still in Draft because I'm waiting for PyO3 to release the new version, and depending on the versioning chosen (0.17.2 or 0.18) we might also need to wait for an update on
pyo3-log
which depends on PyO3 0.17.To make things easier, for the next release I propose to implement Consensus Time:
https://xkcd.com/2594/
Original description (ignore this too)
The test at the end of the file should explain the situation I'm trying to fix here:
pyo3-chrono
ignores timezone info even if they are present.I think we should instead take that into account.
With this conversion traits, we always convert datetimes to UTC internally.
If the user sends us timezone aware datetimes, we do the conversion.
If the user sends us naive datetimes, we assume they are already UTC (debatable).
This is not the only possible approach, and might not be the best one, so I'm opening this PR to discuss what you think about this.
My point is that I don't think we should ignore tzinfo if present, but the flow should still work if we receive naive datetimes (and maybe warn the user about the assumption).
This PR is rough, and if we want to replace
pyo3-chrono
I'll need to add the conversion traits forTime
andDate
too, and some more tests.edit: I just found this PR: chronotope/chrono#542 which does the conversion properly. Reading the discussion, it looks like they plan to integrate this into PyO3 (rather than in chrono) under a feature flag, so maybe we'll have this for free in the future