Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only use UTC datetimes #115

Merged
merged 6 commits into from Sep 27, 2022
Merged

Only use UTC datetimes #115

merged 6 commits into from Sep 27, 2022

Conversation

Psykopear
Copy link
Contributor

@Psykopear Psykopear commented Aug 30, 2022

This PR forces datetimes passed to bytewax as config parameters to be in UTC, using the new chrono integration added in PyO3 (not merged yet).

We'll need to wait for the PR on PyO3 to be merged so that we can open a PR on pyo3-log to make everything work.
The PyO3 author said he wants to add the chrono integration in PyO3 0.17.2 which is coming soon, so this should not take too long.

Previous description (ignore this)

This PR makes all datetimes in Bytewax timezone aware.

It uses a non-released version of PyO3 that adds the chrono integration, and removes the dependency from pyo3-chrono, which was doing the python->rust conversion deleting timezone info, without properly converting to UTC first.

The approach taken in this PR is to add a chrono_tz integration on top of the one implemented in PyO3, so that any datetime in the codebase has timezone info attached, and can be correctly compared to any other datetime.

The only edge case we have to handle, is when we receive naive datetimes from the user in python.
Right now the approach is to convert the datetime to utc using the python interpreter.
This means that any naive datetime coming from python will be assumed to be local (relative to where the dataflow is ran) and converted to UTC before being passed down to rust. This could be wrong in case the datetime comes from data generated somewhere with a different timezone, but we don't have enough informations to make a proper conversion anyway. Another possible approach would be to assume the datetime is UTC rather than local, and we would attach the UTC timezone info without converting first, but this would be equally wrong most of the times. The application should warn anytime we receive a naive datetime, so that users are aware of the fact and can fix the data if needed.

For (de)serialization of datetimes, I had to implement serde's traits manually, since there is no standard way to serialize datetimes with IANA timezone info. To do this I just serialize the local naivedatetime and the timezone name separately, so that we can rebuild the same ChronoDateTime when deserializing.

The PR is still in Draft because I'm waiting for PyO3 to release the new version, and depending on the versioning chosen (0.17.2 or 0.18) we might also need to wait for an update on pyo3-log which depends on PyO3 0.17.

To make things easier, for the next release I propose to implement Consensus Time:
https://xkcd.com/2594/

Original description (ignore this too)

The test at the end of the file should explain the situation I'm trying to fix here: pyo3-chrono ignores timezone info even if they are present.
I think we should instead take that into account.
With this conversion traits, we always convert datetimes to UTC internally.
If the user sends us timezone aware datetimes, we do the conversion.
If the user sends us naive datetimes, we assume they are already UTC (debatable).

This is not the only possible approach, and might not be the best one, so I'm opening this PR to discuss what you think about this.

My point is that I don't think we should ignore tzinfo if present, but the flow should still work if we receive naive datetimes (and maybe warn the user about the assumption).

This PR is rough, and if we want to replace pyo3-chrono I'll need to add the conversion traits for Time and Date too, and some more tests.

edit: I just found this PR: chronotope/chrono#542 which does the conversion properly. Reading the discussion, it looks like they plan to integrate this into PyO3 (rather than in chrono) under a feature flag, so maybe we'll have this for free in the future

@Psykopear

This comment was marked as outdated.

Cargo.toml Outdated Show resolved Hide resolved
@Psykopear Psykopear force-pushed the explore-datetimes-issues branch 2 times, most recently from 0bc9019 to 4a8cac8 Compare September 8, 2022 11:24
@Psykopear Psykopear changed the title Explore datetimes issues Make all datetimes in bytewax timezone aware Sep 8, 2022
// Python assumes that naive datetimes are in the local timezone, so this
// is not like assuming it was in UTC.
// https://docs.python.org/3/library/datetime.html#datetime.datetime.astimezone
// TODO: Should we instead assume utc, and not convert? Here we are converting
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this approach is fine. Do our warning messages show up even when we haven't configured python logging levels?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think my vote here would be to make this panic to not accidentally assume incorrect behavior: The chrono types are explicitly separated into naive vs zoned, and if we're writing our Bytewax time code assuming zoned, we should punt to the producer of the timestamps what zone should be applied. I don't think it's too much to ask people to write datetime.now(timezone.utc) instead of just datetime.now().

src/lib.rs Outdated
@@ -37,7 +37,7 @@ fn sleep_release_gil(py: Python, secs: u64) {
#[pymodule]
#[pyo3(name = "bytewax")]
fn mod_bytewax(py: Python, m: &PyModule) -> PyResult<()> {
pyo3_log::init();
// pyo3_log::init();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: Don't forget to uncomment this when it's available again.

@Psykopear Psykopear mentioned this pull request Sep 13, 2022
4 tasks
Copy link
Contributor

@davidselassie davidselassie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I might be coming into this halfway and perhaps you already hashed these out, but I have some questions. Thanks!

// Python assumes that naive datetimes are in the local timezone, so this
// is not like assuming it was in UTC.
// https://docs.python.org/3/library/datetime.html#datetime.datetime.astimezone
// TODO: Should we instead assume utc, and not convert? Here we are converting
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think my vote here would be to make this panic to not accidentally assume incorrect behavior: The chrono types are explicitly separated into naive vs zoned, and if we're writing our Bytewax time code assuming zoned, we should punt to the producer of the timestamps what zone should be applied. I don't think it's too much to ask people to write datetime.now(timezone.utc) instead of just datetime.now().

}

// Implement serde (de)serialization manually here, since there is no standard
// way to serialize datetimes with IANA timezone info (yet?).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious. Why is this a special serialization needed in our case? It feels a little strange that we've encountered a use case that has such unique persistence needs compared to the built in serde in the Chrono crate.

My understanding of the argument is "normal serialization will just store the UTC offset of the time, so we lose the full TZ of the original data". But since we're only storing points in time, we don't lose the ability to compare them to any other points by just storing the correct UTC offset at that one point, so I'm not sure this is necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your understanding is correct, the point of using tz rather than a fixed offset is to keep the same information if the user provided a timezone info rather than an offset, see answers below for more context.

run_main(flow_2)

assert len(out_1) == len(out_2)
assert out_1 == out_2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason this test is not comparing to some fixed expected data? What is it trying to assert here? That if the TZ of the start at and the items themselves are compared across TZ correctly? This feels like it's only testing the same thing as our tumbling window operator test.

Copy link
Contributor Author

@Psykopear Psykopear Sep 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a regression test (maybe should add a comment about it). It is testing a behaviour that is supposed to be always correct, but if you run this test using the pyo3-chrono integration we used before, it fails because even if start_at and start_at_1 represents the same date in python, the incorrect conversion in pyo3-chrono made the two flow run with a different start_at parameter, and gave two different results. This is basically what caused this PR to be opened

start_at = datetime.now(ZoneInfo("America/Los_Angeles"))
except ImportError:
py39 = False
start_at = datetime.now(tz=timezone(timedelta(hours=-8)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick, but I think these except ImportError blocks might read a little nicer if you isolate the TZ determination from the rest of the datetime creation:

try:
    from zoneinfo import ZoneInfo
    tz = ZoneInfo("America/Los_Angeles")
except ImportError:
    tz = timezone(timedelta(hours=-8))

xxx = datetime.yyy(tz=tz)

# But if we are working with fixed offsets in python<3.9,
# we convert everything to UTC, so the tzinfo won't be the same.
if py39:
assert start_at.tzinfo == wc.start_at.tzinfo
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought < Py3.9 just means we don't have pre-packaged ZoneInfo types, not that we can't actually assert on the TZ itself. Does timezone(timedelta(hours=-8)) != timezone(timedelta(hours=-8))?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would fail due to the choice of always having timezoneaware datetimes internally, and the way I do this conversion (see comments below).
Since I made the ChronoDateTime be chrono::DateTime<chrono_tz::Tz>, I can't use a chrono::FixedOffset. So when I receive a datetime with a fixed offset from python, i convert it to utc first. This means the tzinfo of the object that comes back from rust actually changes, even though it still represents the same date.


run_main(flow_1)
run_main(flow_2)
assert len(out_1) == len(out_2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar questions as above test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar concept, maybe this test is reduntant, but the idea is that settings start_at to now in python, and not doing it (so setting it in rust) should give the same result, but it didn't with pyo3-chrono if the now in python had different timezone info

}

impl ToPyObject for ChronoDateTime {
fn to_object(&self, py: Python<'_>) -> PyObject {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to piggyback on the PyO3 built-in TZ-ignoring conversions and then fill in the TZ info here, rather than re-creating the rest of the conversion? Or does that make things more confusing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll look into it, it makes sense to reuse it if we can

.getattr("timezone")?
.getattr("utc")?;
dt = dt.getattr("astimezone")?.call1((utc,))?.cast_as()?;
chrono_tz::Tz::UTC
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's your thinking behind doing the conversion to UTC and not trying to create a chrono::offset::FixedOffset in this case? It feels like it would be nice to round-trip a TZ datetime without this conversion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This answers your previous comment on why I only check start_at.tzinfo == wc.start_at.tzinfo if we are on python3.9.
I made datetimes in bytewax be chrono::DateTime<chrono_tz::Tz> rather than chrono::DateTime<chrono::Tz>, so I can't really use a FixedOffset as the parameter for chrono::DateTime, because chrono_tz::Tz is an enum implementing the Tz trait, not the trait itself.
I would have liked to keep the chrono::Tz trait as the generic param and still use chrono_tz's stuff, but couldn't find a way.
I agree that would be better as it would allow me to use both FixedOffset and timezone infos, I'm going to have a second look at this and let you know.

@davidselassie davidselassie mentioned this pull request Sep 20, 2022
@Psykopear
Copy link
Contributor Author

Psykopear commented Sep 21, 2022

So, I'm having a second look at this PR, and after some thinking I agree with @davidselassie here that we should force users to send us UTC datetimes, at least for the start_at parameter and the return type of the dt_getter function in the EventTimeClock.

Supporting any possible kind of datetime here makes everything more complicated than it needs to be.

Only accepting UTC datetimes will make this PR much simpler, I can remove the whole chrono_tz integration and just use the one on PyO3 once it's merged, and we can just use DateTime<Utc> instead of having a wrapper type in our structures.
It's a minor annoyance for users since a crash would happen either at the initialization of the flow (in case of a start_at parameter) or at the first event received (in case of the EventTimeClock), so it shouldn't lead to unexpected crashes when the flow is deployed to production (which was my worry when I started thinking about this).

I'm going to update this PR and make it ready for review so that we can move on

Copy link
Contributor

@davidselassie davidselassie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't suggesting this with my comment, but this seems like a great way forward! UTC4ever.

Copy link
Contributor

@whoahbot whoahbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we ended up with the better approach.

Feel free to merge when logging is available again.

@Psykopear Psykopear changed the title Make all datetimes in bytewax timezone aware Only use UTC datetimes Sep 22, 2022
@Psykopear Psykopear marked this pull request as ready for review September 23, 2022 08:16
@Psykopear
Copy link
Contributor Author

So, to have everything working we just need PyO3 0.17.2. The chrono integration will be merged then, and this should happen soon.
For the pyo3-log crate, we won't need to do anything once PyO3 0.17.2 is released, since it only depends on ~0.17.
But, since this PR is blocking another one, I temporarily forked pyo3-log to make it compatible with the yet unreleased version of PyO3 (simply pinning its PyO3 dependency to the same commit we use here), so that we can merge this now.
I added 2 TODO notes in the Cargo.toml with what to do as soon as PyO3 0.17.2 is released, so that we can just swap the uncommented lines with the commented ones and we should be good to go.

@Psykopear Psykopear merged commit 4defb35 into main Sep 27, 2022
@Psykopear Psykopear deleted the explore-datetimes-issues branch September 27, 2022 15:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants