Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix date_parser with prefer_month_of_year wrong results #1224

Merged
merged 4 commits into from Apr 8, 2024

Conversation

benbuzbee
Copy link
Contributor

@benbuzbee benbuzbee commented Apr 2, 2024

Fix two problems

  1. Parser would use current month even if prefer_month_of_year was not current when relative_base was not none

  2. Parser would use current month to derive 'what is the last day of this month' - for example, with prefer_month=last and prefer_day=last, but current_month=april, it would return december 30th, because it would use april to find that the last day was the 30th, when it should have used December (Parsing using {"PREFER_DAY_OF_MONTH": "last", "PREFER_MONTH_OF_YEAR": "last"} doesn't use last day of the last month #1217)

Additionally, add a test to test_date_parser that uses prefer_month

Fixes #1217

Fix two problems
1. Parser would use current month even if prefer_month_of_year was not
   current when relative_base was not none

2. Parser would use current month to derive 'what is the last day of
   this month' - for example, with prefer_month=last and
   prefer_day=past, but current_month=april, it would return december
   30th, because it would use april to find that the last day was the
   30th, when it should use the month.

Additionally, add a test to test_date_parser that uses prefer_month
@benbuzbee
Copy link
Contributor Author

Tagging folks here who were involved in the original PR #1146

@adnan-awan @Gallaecio @serhii73 could I have your comments and reviews please

@Gallaecio
Copy link
Member

Could you have a look at the 2 failing tests?

It is parsing "0:4", french, with settings
```
                "PREFER_DATES_FROM": "past",
                "PREFER_DAY_OF_MONTH": "first",
                "PREFER_LOCALE_DATE_ORDER": True,
                "PREFER_MONTH_OF_YEAR": "current",
                "RELATIVE_BASE": datetime(
                    year=1970, month=1, day=1, hour=0, minute=0, second=0
                ),
```
It used to expect to get `expected_date=datetime(1969, 12, 31, 14, 4)` but after my change it gets `datetime(1969, 1, 31, 14, 4)`
I would argue that with PREFER_MONTH_OF_YEAR set to "Current", and "Current" being January 1st 1970, that `datetime(1969, 1, 31, 14, 4)` is a better result
However with this particular set of configuration, I am not exactly 100% sure what to expect. These settings were generated by a fuzzer so perhaps they don't really make a ton of sense together anyway; rather than change the settings (and thus deviate from what the parser caught) I have opted to update the test expectation to accept January.
'Die'

It is searching a German string for dates and asserting that when it finds the word "Die" in the string, it should be parsed as `datetime.datetime(1999, 12, 28, 0, 0)`
Similarly, my change makes this `datetime.datetime(1999, 1, 28, 0, 0)` instead. I don't speak German, but as far as I can tell "Die" just means "The" so I have no idea why it is even matching it. In my opinion, this could be a bug with the search identifying a non-date word, and so I can't really guess as to what a sensible result would be. For the sake of simplicity, I also just updated this test to accept January,
@benbuzbee
Copy link
Contributor Author

Certainly @Gallaecio - sorry I missed that locally - tox / all tests do not run well on Windows where I developed this so I just ran select tests and expected to check all of them in CI

In the first case (test_dates_parse_utc_offset_does_not_throw), it is parsing "0:4", French, with settings

                "PREFER_DATES_FROM": "past",
                "PREFER_DAY_OF_MONTH": "first",
                "PREFER_LOCALE_DATE_ORDER": True,
                "PREFER_MONTH_OF_YEAR": "current",
                "RELATIVE_BASE": datetime(
                    year=1970, month=1, day=1, hour=0, minute=0, second=0
                ),

it used to expect to get expected_date=datetime(1969, 12, 31, 14, 4) but after my change it gets datetime(1969, 1, 31, 14, 4)
I would argue that with PREFER_MONTH_OF_YEAR set to "Current", and "Current" being January 1st 1970, that datetime(1969, 1, 31, 14, 4) is a better result
However with this particular set of configuration, I am not exactly 100% sure what to expect. These settings were generated by a fuzzer so perhaps they don't really make a ton of sense together anyway; rather than change the settings (and thus deviate from what the parser caught) I have opted to update the test expectation to accept January. CC @ennamarie19 who made that change in case I missed something.

The next one is even more perplexing, it is searching a German string for dates and asserting that when it finds the word "Die" in the string, it should be parsed as datetime.datetime(1999, 12, 28, 0, 0)
Similarly, my change makes this datetime.datetime(1999, 1, 28, 0, 0) instead. I don't speak German, but as far as I can tell "Die" just means "The" so I have no idea why it is even matching it. In my opinion, this could be a bug with the search identifying a non-date word, and so I can't really guess as to what a sensible result would be. For the sake of simplicity, I also just updated this test to accept January. CC @eszakharova who added this in 7e7e224

@Gallaecio Gallaecio requested a review from wRAR April 5, 2024 09:59
@wRAR wRAR merged commit 748e48a into scrapinghub:master Apr 8, 2024
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants