Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Locale strings not expanded in get format #701

Closed
mredaelli opened this issue Oct 28, 2019 · 14 comments · Fixed by #720
Closed

Locale strings not expanded in get format #701

mredaelli opened this issue Oct 28, 2019 · 14 comments · Fixed by #720
Labels

Comments

@mredaelli
Copy link

This code:

arrow.get("Montag, 9. September 2019, 16:15-20:00", "dddd, D. MMMM YYYY", locale="de")   

works perfectly in arrow 1.13.1, but in 1.15.2 I ge an Exceptiont:

 Failed to match 'dddd, D. MMMM YYYY' when parsing 'Montag, 9. September 2019, 16:15-20:00'

where the pattern is not expanded.

Am I doing something wrong?

@jadchaar
Copy link
Member

@mredaelli it seems that you forgot to include the time and timezone tokens in your format string. Parsing was improved and revamped in Arrow v0.15.0. This should work:

venv ❯ python3
Python 3.7.3 (v3.7.3:ef4ec6ed12, Mar 25 2019, 16:52:21)
[Clang 6.0 (clang-600.0.57)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import arrow
>>> arrow.__version__
'0.15.2'
>>> arrow.get("Montag, 9. September 2019, 16:15-20:00", "dddd, D. MMMM YYYY, HH:mmZZ", locale="de")
<Arrow [2019-09-09T16:15:00-20:00]>

@mredaelli
Copy link
Author

Thank you @jadchaar , but I'm not sure I understand. What was the change exactly?

Because in the documentation I still read

Search a date in a string:

>>> arrow.get('June was born in May 1980', 'MMMM YYYY')
<Arrow [1980-05-01T00:00:00+00:00]>

and that's exactly what I'm doing. I don't want to parse the time, that's all.

I really hope the "search" part was not removed intentionally (and without a clear indication in the changelog, as far as I can see), because I rely on that for a lot of strings coming from websites of all types :)

Besides, the ticket that prompted this, #612, talks about "partial matches", but in my case the match is complete, and so there should be no ambiguity, right?

(Small thing: I liked very much the error message earlier, which made debugging easier by printing the expanded regex.)

@mredaelli
Copy link
Author

Hmm, I see that

arrow.get("Montag, 9. September 2019, 16:15-20:00", "dddd, D. MMMM YYYY[.*]", locale="de")

seems to work, so it would seem that indeed you do not search anymore, but just match the regex.

@systemcatch
Copy link
Collaborator

@mredaelli I think the comma in your example is causing the problem, without it everything seems to be working ok.

>>> arrow.get("Montag, 9. September 2019 16:15-20:00", "dddd, D. MMMM YYYY", locale="de")
<Arrow [2019-09-09T00:00:00+00:00]>

Further investigation needed.

@jadchaar jadchaar added the bug label Oct 30, 2019
@jadchaar
Copy link
Member

Ah apologies, I did not realize that you were trying to search--I thought you were trying to parse it.

@andrewchouman
Copy link
Contributor

andrewchouman commented Nov 17, 2019

Do you mind if I try to fix this?

@jadchaar
Copy link
Member

@andrewchouman for sure! The issue is most likely in parser.py. I recommend using a debugger to step through the parsing logic with the problematic example.

@andrewchouman
Copy link
Contributor

andrewchouman commented Nov 18, 2019

I believe I found a solution to the issue, but this seems to be the correct defined behavior based on test cases in versions >= 0.15.0.

In current test cases, any substring connected to the string (not separated by whitespace) that isn't part of the pattern should trigger an exception. See below:

def test_parse_with_extra_words_at_start_and_end_invalid(self):
        input_format_pairs = [
            ("blah2016", "YYYY"),
            ("blah2016blah", "YYYY"),
            ("2016blah", "YYYY"),
            ("2016-05blah", "YYYY-MM"),
            ("2016-05-16blah", "YYYY-MM-DD"),
            ("2016-05-16T04:05:06.789120blah", "YYYY-MM-DDThh:mm:ss.S"),
            ("2016-05-16T04:05:06.789120ZblahZ", "YYYY-MM-DDThh:mm:ss.SZ"),
            ("2016-05-16T04:05:06.789120Zblah", "YYYY-MM-DDThh:mm:ss.SZ"),
            ("2016-05-16T04:05:06.789120blahZ", "YYYY-MM-DDThh:mm:ss.SZ"),
        ]

        for pair in input_format_pairs:
            with self.assertRaises(ParserError):
                self.parser.parse(pair[0], pair[1])


    def test_parse_with_extra_words_at_start_and_end_valid(self):
        # Spaces surrounding the parsable date are ok because we
        # allow the parsing of natural language input
        self.assertEqual(
            self.parser.parse("blah 2016 blah", "YYYY"), datetime(2016, 1, 1)
        )

        self.assertEqual(self.parser.parse("blah 2016", "YYYY"), datetime(2016, 1, 1))

        self.assertEqual(self.parser.parse("2016 blah", "YYYY"), datetime(2016, 1, 1))

Before, 0.15.0, those test cases didn't exist and defined behavior was that any connected substring that wasn't part of the pattern was fine, as long as the pattern was there. I can make a pull request that reverts back just that little bit of functionality and fix the test cases so that the "invalid strings" are defined as valid, or the docs should be changed to reflect the differences.

@jadchaar
Copy link
Member

The issue seems to stem from this line of code where we add the word boundary. Currently, we wrap the formatting regex in a custom white space word boundary. Here is a demonstration of the regex that is formed: https://regex101.com/r/EJ2oJJ/2. If we remove the custom word boundary we search for the entry successfully: https://regex101.com/r/5bUeAq/1. We need to adjust the word boundary slightly to allow for partial match cases like the one in the original post.

@andrewchouman
Copy link
Contributor

@jadchaar Do we want to allow punctuation perhaps (i.e. comma, period, dash)? I can work on this and add to the test cases.

@jadchaar
Copy link
Member

Yeah I think adding appropriate punctuation that may be possible in a datetime string to the regex word boundary could be an appropriate solution to this. Feel free to play around with regex101 to help fix this issue, it is an amazing tool for visualizing regular expressions.

@andrewchouman
Copy link
Contributor

I should have it done within the next week or two (adding to documentation/testing included)

@andrewchouman
Copy link
Contributor

@jadchaar I think this regex works for our purposes; what do you think? Also when I ran nosetests this test didn't work:

    def test_YYYY_MM_DDDD(self):
        with self.assertRaises(ParserError):
            self.parser.parse_iso("2014-05-125")

This is because this line causes a ParserMatchError now as opposed to a ParserError due to my regex change. ParserMatchErrors get caught within the function so the exception is never raised to this top level stack even if I were to switch the exception name in that test case. Should I remove the test case?

@jadchaar
Copy link
Member

@andrewchouman it may be better to supplement the current regex: \S is the same as ^\s, so we can add some punctuation to the set like this: ^\s\, (may need to add more punctuation).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants