Locale strings not expanded in `get` format #701

mredaelli · 2019-10-28T21:07:59Z

This code:

arrow.get("Montag, 9. September 2019, 16:15-20:00", "dddd, D. MMMM YYYY", locale="de")

works perfectly in arrow 1.13.1, but in 1.15.2 I ge an Exceptiont:

 Failed to match 'dddd, D. MMMM YYYY' when parsing 'Montag, 9. September 2019, 16:15-20:00'

where the pattern is not expanded.

Am I doing something wrong?

The text was updated successfully, but these errors were encountered:

jadchaar · 2019-10-28T21:46:23Z

@mredaelli it seems that you forgot to include the time and timezone tokens in your format string. Parsing was improved and revamped in Arrow v0.15.0. This should work:

venv ❯ python3
Python 3.7.3 (v3.7.3:ef4ec6ed12, Mar 25 2019, 16:52:21)
[Clang 6.0 (clang-600.0.57)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import arrow
>>> arrow.__version__
'0.15.2'
>>> arrow.get("Montag, 9. September 2019, 16:15-20:00", "dddd, D. MMMM YYYY, HH:mmZZ", locale="de")
<Arrow [2019-09-09T16:15:00-20:00]>

mredaelli · 2019-10-29T08:04:16Z

Thank you @jadchaar , but I'm not sure I understand. What was the change exactly?

Because in the documentation I still read

Search a date in a string:

>>> arrow.get('June was born in May 1980', 'MMMM YYYY')
<Arrow [1980-05-01T00:00:00+00:00]>

and that's exactly what I'm doing. I don't want to parse the time, that's all.

I really hope the "search" part was not removed intentionally (and without a clear indication in the changelog, as far as I can see), because I rely on that for a lot of strings coming from websites of all types :)

Besides, the ticket that prompted this, #612, talks about "partial matches", but in my case the match is complete, and so there should be no ambiguity, right?

(Small thing: I liked very much the error message earlier, which made debugging easier by printing the expanded regex.)

mredaelli · 2019-10-29T08:21:16Z

Hmm, I see that

arrow.get("Montag, 9. September 2019, 16:15-20:00", "dddd, D. MMMM YYYY[.*]", locale="de")

seems to work, so it would seem that indeed you do not search anymore, but just match the regex.

systemcatch · 2019-10-29T12:27:34Z

@mredaelli I think the comma in your example is causing the problem, without it everything seems to be working ok.

>>> arrow.get("Montag, 9. September 2019 16:15-20:00", "dddd, D. MMMM YYYY", locale="de")
<Arrow [2019-09-09T00:00:00+00:00]>

Further investigation needed.

jadchaar · 2019-10-30T01:22:41Z

Ah apologies, I did not realize that you were trying to search--I thought you were trying to parse it.

andrewchouman · 2019-11-17T20:26:18Z

Do you mind if I try to fix this?

jadchaar · 2019-11-17T20:34:05Z

@andrewchouman for sure! The issue is most likely in parser.py. I recommend using a debugger to step through the parsing logic with the problematic example.

andrewchouman · 2019-11-18T01:57:17Z

I believe I found a solution to the issue, but this seems to be the correct defined behavior based on test cases in versions >= 0.15.0.

In current test cases, any substring connected to the string (not separated by whitespace) that isn't part of the pattern should trigger an exception. See below:

def test_parse_with_extra_words_at_start_and_end_invalid(self):
        input_format_pairs = [
            ("blah2016", "YYYY"),
            ("blah2016blah", "YYYY"),
            ("2016blah", "YYYY"),
            ("2016-05blah", "YYYY-MM"),
            ("2016-05-16blah", "YYYY-MM-DD"),
            ("2016-05-16T04:05:06.789120blah", "YYYY-MM-DDThh:mm:ss.S"),
            ("2016-05-16T04:05:06.789120ZblahZ", "YYYY-MM-DDThh:mm:ss.SZ"),
            ("2016-05-16T04:05:06.789120Zblah", "YYYY-MM-DDThh:mm:ss.SZ"),
            ("2016-05-16T04:05:06.789120blahZ", "YYYY-MM-DDThh:mm:ss.SZ"),
        ]

        for pair in input_format_pairs:
            with self.assertRaises(ParserError):
                self.parser.parse(pair[0], pair[1])


    def test_parse_with_extra_words_at_start_and_end_valid(self):
        # Spaces surrounding the parsable date are ok because we
        # allow the parsing of natural language input
        self.assertEqual(
            self.parser.parse("blah 2016 blah", "YYYY"), datetime(2016, 1, 1)
        )

        self.assertEqual(self.parser.parse("blah 2016", "YYYY"), datetime(2016, 1, 1))

        self.assertEqual(self.parser.parse("2016 blah", "YYYY"), datetime(2016, 1, 1))

Before, 0.15.0, those test cases didn't exist and defined behavior was that any connected substring that wasn't part of the pattern was fine, as long as the pattern was there. I can make a pull request that reverts back just that little bit of functionality and fix the test cases so that the "invalid strings" are defined as valid, or the docs should be changed to reflect the differences.

jadchaar · 2019-11-18T18:43:14Z

The issue seems to stem from this line of code where we add the word boundary. Currently, we wrap the formatting regex in a custom white space word boundary. Here is a demonstration of the regex that is formed: https://regex101.com/r/EJ2oJJ/2. If we remove the custom word boundary we search for the entry successfully: https://regex101.com/r/5bUeAq/1. We need to adjust the word boundary slightly to allow for partial match cases like the one in the original post.

andrewchouman · 2019-11-18T18:59:57Z

@jadchaar Do we want to allow punctuation perhaps (i.e. comma, period, dash)? I can work on this and add to the test cases.

jadchaar · 2019-11-18T19:04:46Z

Yeah I think adding appropriate punctuation that may be possible in a datetime string to the regex word boundary could be an appropriate solution to this. Feel free to play around with regex101 to help fix this issue, it is an amazing tool for visualizing regular expressions.

andrewchouman · 2019-11-18T19:06:25Z

I should have it done within the next week or two (adding to documentation/testing included)

andrewchouman · 2019-11-21T19:24:29Z

@jadchaar I think this regex works for our purposes; what do you think? Also when I ran nosetests this test didn't work:

    def test_YYYY_MM_DDDD(self):
        with self.assertRaises(ParserError):
            self.parser.parse_iso("2014-05-125")

This is because this line causes a ParserMatchError now as opposed to a ParserError due to my regex change. ParserMatchErrors get caught within the function so the exception is never raised to this top level stack even if I were to switch the exception name in that test case. Should I remove the test case?

jadchaar · 2019-11-23T20:39:40Z

@andrewchouman it may be better to supplement the current regex: \S is the same as ^\s, so we can add some punctuation to the set like this: ^\s\, (may need to add more punctuation).

jadchaar added the bug label Oct 30, 2019

andrewchouman added a commit to andrewchouman/arrow that referenced this issue Nov 25, 2019

Fixed issue arrow-py#701

f434f7a

andrewchouman mentioned this issue Nov 25, 2019

Improve parsing of local strings with punctuation #720

Merged

jadchaar closed this as completed in #720 Dec 3, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Locale strings not expanded in `get` format #701

Locale strings not expanded in `get` format #701

mredaelli commented Oct 28, 2019

jadchaar commented Oct 28, 2019

mredaelli commented Oct 29, 2019

mredaelli commented Oct 29, 2019

systemcatch commented Oct 29, 2019

jadchaar commented Oct 30, 2019

andrewchouman commented Nov 17, 2019 •

edited

jadchaar commented Nov 17, 2019

andrewchouman commented Nov 18, 2019 •

edited

jadchaar commented Nov 18, 2019

andrewchouman commented Nov 18, 2019

jadchaar commented Nov 18, 2019

andrewchouman commented Nov 18, 2019

andrewchouman commented Nov 21, 2019

jadchaar commented Nov 23, 2019

Locale strings not expanded in get format #701

Locale strings not expanded in get format #701

Comments

mredaelli commented Oct 28, 2019

jadchaar commented Oct 28, 2019

mredaelli commented Oct 29, 2019

mredaelli commented Oct 29, 2019

systemcatch commented Oct 29, 2019

jadchaar commented Oct 30, 2019

andrewchouman commented Nov 17, 2019 • edited

jadchaar commented Nov 17, 2019

andrewchouman commented Nov 18, 2019 • edited

jadchaar commented Nov 18, 2019

andrewchouman commented Nov 18, 2019

jadchaar commented Nov 18, 2019

andrewchouman commented Nov 18, 2019

andrewchouman commented Nov 21, 2019

jadchaar commented Nov 23, 2019

Locale strings not expanded in `get` format #701

Locale strings not expanded in `get` format #701

andrewchouman commented Nov 17, 2019 •

edited

andrewchouman commented Nov 18, 2019 •

edited