Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parse unix timestamps consistently regardless of timezones #954

Merged
merged 4 commits into from Sep 20, 2022

Conversation

onlynone
Copy link
Contributor

@onlynone onlynone commented Jul 27, 2021

When dateparser parses a unix timestamp, and the user has set a TIMEZONE that is not local (or the same as the user's system local timezone), it gets the wrong time. This is because datetime.fromtimestamp(timestamp) will return a datetime in the user's local timezone as configured by their system (influenced by the TZ env var, /etc/localtime, etc). But, it will be a naive datetime without any timezone information . The dateparser code will try to interpret that datetime object as if it was in the settings.TIMEZONE timezone, which is incorrect.

The fix is to use datetime.fromtimestamp(timestamp, timezone_that_matches_settings_TIMEZONE) to get a timezone aware datetime object with a timezone that matches dateparser's settings.TIMEZONE (or use the local timezone if that is unset). Then strip the tzinfo from that datetime object, because apply_timezone_from_settings(date_obj, settings) requires date_obj to be a naive datetime without timezone information. So when apply_timezone_from_settings interprets that object, it does so correctly.

Otherwise datetime.fromtimestamp will return a naive datetime in the
local timezone and applying timezone modifications later with TIMEZONE
and TO_TIMEZONE won't do the right thing.

Parsing a unix timestamp value should always result in the exact same
instant in time regardless of current time zone, so it shouldn't matter
what the current TIMEZONE setting is, whether 'UTC', 'local', any other
timezone, or even unset.
@codecov
Copy link

codecov bot commented Jul 27, 2021

Codecov Report

Merging #954 (6934648) into master (0ed979e) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master     #954   +/-   ##
=======================================
  Coverage   98.29%   98.29%           
=======================================
  Files         234      234           
  Lines        2702     2705    +3     
=======================================
+ Hits         2656     2659    +3     
  Misses         46       46           
Impacted Files Coverage Δ
dateparser/date.py 99.27% <100.00%> (+<0.01%) ⬆️
dateparser/utils/__init__.py 95.27% <100.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@onlynone
Copy link
Contributor Author

By the way, when I was adding tests, I tried to follow the pattern of the other tests in TestTimestampParser by calling date.get_date_from_timestamp directly, and passing in a custom value for settings. I also saw a pattern for creating a new Settings object, in another test in tests/test_date.py of doing something like:

        settings_mod = copy(settings)
        settings_mod.RETURN_AS_TIMEZONE_AWARE = True
        settings_mod.TIMEZONE = timezone
        settings_mod.TO_TIMEZONE = to_timezone

        date.get_date_from_timestamp(timestamp, settings_mod)

But, when I do that, it seems like I end up modifying the global settings object imported from dateparser.conf and that caused other tests to fail because they weren't expecting RETURN_AS_TIMEZONE_AWARE to be True. Is dateparser.conf.Settings a singleton? Should the other test that tries to do a copy be fixed? Should there be a more standard way to create a custom/one-off settings object for use in tests?

@onlynone onlynone changed the title test coverage for how unix timestamps are parsed Parse unix timestamps consistently regardless of timezones Jul 28, 2021
@onlynone
Copy link
Contributor Author

@Gallaecio Is this good to merge?

dateparser/date.py Outdated Show resolved Hide resolved
Copy link
Member

@Gallaecio Gallaecio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work!

@Gallaecio Gallaecio self-requested a review August 31, 2022 14:31
@gutsytechster
Copy link
Collaborator

TBH, I am not able to understand the issue here. Maybe an example may help.

But if we are applying the timezone and stripping it off, to eventually apply the timezone from settings through apply_timezone_from_settings , how does the behavior change?

Also the statement in the initial explanation

The dateparser code will try to interpret that datetime object as if it was in the settings.TIMEZONE timezone, which is incorrect

Where do we see that behavior, since the apply_tiemzone_from_settings would apply the timezone from the settings, I believe?

@onlynone
Copy link
Contributor Author

onlynone commented Sep 1, 2022

The examples in the added tests should show the issue. You could try running those tests on the main line code without my changes.

But if we are applying the timezone and stripping it off, to eventually apply the timezone from settings through apply_timezone_from_settings , how does the behavior change?

The problem is that there's not just one time zone that's being stripped and reapplied. The date is initially constructed with the time zone of the system (regardless of any dateparser settings), but then that time zone is not kept on the datetime. Then later in the code, the datetime is interpreted as if it was in dateparser's settings.TIMEZONE. If the system's time zone happens to be the same as the one used for dateparser's settings, then I think everything is fine. But if my system is set to 'America/New_York', and I set the dateparser settings time zone to 'UTC' and ask it to parse '1661996156', I believe dateparser will currently give me 'August 31, 2022 9:35:56 PM UTC' when it should give me 'September 1, 2022 1:35:56 AM UTC'.

Sorry if there are any errors in the above comment, I'm on mobile and it's also been a while since I first made this branch. Please test it the examples and see if what I'm saying makes sense.

@onlynone
Copy link
Contributor Author

onlynone commented Sep 1, 2022

Where do we see that behavior, since the apply_tiemzone_from_settings would apply the timezone from the settings, I believe?

Yes, but with the new code, the naive datetime object is created in the same time zone as settings.TIMEZONE. so that later, when it's interpreted as if it was in settings.TIMEZONE, that's now a correct assumption. With the old code, the naive datetime is created relative to the system's time zone.

@onlynone
Copy link
Contributor Author

onlynone commented Sep 1, 2022

Here's an example. With the current code:

>>> dateparser.parse('1661996156', settings={'TIMEZONE': 'UTC', 'TO_TIMEZONE': 'UTC', 'RETURN_AS_TIMEZONE_AWARE': True}).timestamp()
1661981756.0

The returned unix timestamp doesn't match the original timestamp. And with my branch:

>>> dateparser.parse('1661996156', settings={'TIMEZONE': 'UTC', 'TO_TIMEZONE': 'UTC', 'RETURN_AS_TIMEZONE_AWARE': True}).timestamp()
1661996156.0

It does match.

@gutsytechster
Copy link
Collaborator

Thanks @onlynone!

It's clear now. I understand where the difference comes from.

As far as I understood, previous to, when the date goes to the apply_timezone_from_settings, it got converted into a date object with a local timezone(without timezone included), and that date is further used for applying timezone from the mentioned method.

dateparser/date.py Outdated Show resolved Hide resolved
dateparser/date.py Outdated Show resolved Hide resolved
@onlynone
Copy link
Contributor Author

@gutsytechster @Gallaecio I think everything has been resolved, can it be merged?

@Gallaecio Gallaecio merged commit 28cadc1 into scrapinghub:master Sep 20, 2022
@Gallaecio
Copy link
Member

Thank you!

@ghazel
Copy link

ghazel commented Oct 22, 2022

Using an old version of tzlocal, 2.1 made dateparser return Local Mean Time instead of the correct timezone as the result of this change. After they switched away from pytz to zoneinfo that problem went away. You might want to bump your minimum tzlocal version dependency.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants