Added support for fractional units #876

DenverCoder1 · 2021-02-07T15:31:08Z

Fixes #753

Added support for fractional units by making the regex pattern support digits before a decimal point and converting num to a float instead of an int on line 150.

What this changes

Before this fix, parse("in 0.5 hours") would return the time 5 hours in the future instead of 0.5 hours in the future.

With this fix, parsing fractional units are no longer parsed incorrectly.

Tests

>>> parse("now")
datetime.datetime(2021, 2, 7, 18, 50, 10, 107501)
>>> parse("in 0.5 hours")
datetime.datetime(2021, 2, 7, 19, 20, 15, 773800)
>>> parse("in 3.75 hours") 
datetime.datetime(2021, 2, 7, 22, 35, 20, 463238)
>>> parse("2.5 hours ago")  
datetime.datetime(2021, 2, 7, 16, 20, 29, 614937)

What was changed

To make it easier for future reviewers, the change is much smaller than it appears to be. The reason there are 131+ files changed is because the regex in write_complete_data.py was updated and ran.

Very few lines of code were changed in this PR:

Regex updated in (freshness_date_parser.py line 14)

- PATTERN = re.compile(r'(\d+)\s*(%s)\b' % _UNITS, re.I | re.S | re.U)
+ PATTERN = re.compile(r'(\d+[.,]?\d*)\s*(%s)\b' % _UNITS, re.I | re.S | re.U)

Result converted to float instead of int in (freshness_date_parser.py line 150)

- kwargs[unit + 's'] = int(num)
+ kwargs[unit + 's'] = float(num.replace(",", "."))

Number regex changed in (write_complete_data.py line 34)

- string = RELATIVE_PATTERN.sub(r'(\\d+)', string)
+ string = RELATIVE_PATTERN.sub(r'(\\d+[.,]?\\d*)', string)

Number regex changed in supplementary yaml files for custom relative-type-regex sections (multiple locations)

In all occurrences in supplementary yaml:
- (\d+)
+ (\d+[.,]?\d*)

Fix for Croatian dates conflicting with decimal pattern (date.py line 110)

line 40
+ RE_SANITIZE_CROATIAN = re.compile(r'(\d+)\.\s?(\d+)\.\s?(\d+)\.( u)?', flags=re.I | re.U)
line 110
+ date_string = RE_SANITIZE_CROATIAN.sub(r'\1.\2.\3 ', date_string)  # extra '.' and 'u' interferes with parsing relative fractional dates

Test case for number regex in data was updated in (test_data.py line 68)

- return not (all([isinstance(x, str) for x in value])
-                and all(['(\\d+)' in x for x in value]))
+ return not (all([isinstance(x, str) for x in value])
+                and all(['(\\d+[.,]?\\d*)' in x for x in value]))

Test cases for fractional units added to (test_freshness_date_parser.py)

line 99
+ # Fractional units
+ param('2.5 hours', ago={'hours': 2.5}, period='day'),
+ param('10.75 minutes', ago={'minutes': 10.75}, period='day'),
+ param('1.5 days', ago={'days': 1.5}, period='day'),
line 616
+ # Fractional units
+ param('2.5 hours', ago={'hours': 2.5}, period='day'),
+ param('10.75 minutes', ago={'minutes': 10.75}, period='day'),
+ param('1.5 days', ago={'days': 1.5}, period='day'),
line 1108
+ # Fractional units
+ param('in 2.5 hours', in_future={'hours': 2.5}, period='day'),
+ param('in 10.75 minutes', in_future={'minutes': 10.75}, period='day'),
+ param('in 1.5 days', in_future={'days': 1.5}, period='day'),
+ param('in 0,5 hours', in_future={'hours': 0.5}, period='day'),
line 1635
+ # Fractional units
+ param('2.5 hours ago', date(2010, 6, 4), time(10, 45)),
+ param('in 10.75 minutes', date(2010, 6, 4), time(13, 25, 45)),
+ param('in 1.5 days', date(2010, 6, 6), time(1, 15)),
+ param('0,5 hours ago', date(2010, 6, 4), time(12, 45)),

See these comments for shortcuts to view the changes

Gallaecio

Could you also extend tests to cover this new feature? I’m think we should at least test 1 such value, and assuming we include comma support the same value with a comma instead of a period.

dateparser/freshness_date_parser.py

noviluni · 2021-02-08T12:10:23Z

HI @DenverCoder1, thank you for this PR.

Could you add those tests you mention in the description in the project tests? Thanks!

DenverCoder1 · 2021-02-08T17:46:06Z

@Gallaecio Added some tests for fractional units. Also added some support for numbers with commas instead of periods. Adding comma support was not as simple because of the way the Dictionary splits the strings.

Gallaecio · 2021-02-08T18:08:41Z

I see you’ve had to modify the per-language Python files. Those are not meant to be modified manually, and are instead generated from other files in the repository. See https://dateparser.readthedocs.io/en/latest/contributing.html#guidelines-for-editing-translation-data . On the bright side, it may mean that you just need to edit the script that generates them.

DenverCoder1 · 2021-02-08T18:35:07Z

@Gallaecio

I see. So I should update write_complete_data.py and the yaml files then rerun write_complete_data.py?

Gallaecio · 2021-02-08T20:26:47Z

I am not yet familiar with that process, but I would assume you would have to edit either write_complete_data.py (hopefully) or the yaml files (if they include the \d pattern), and then rerun write_complete_data.py.

This is also assuming that you can get the desired result editing one of those two files. If the patterns came directly from CLDR (unlikely but possible), it may get messier.

Please have a look at en.yml and write_complete_data.py, see if you can figure out where the change should go. Otherwise, @noviluni is the person most familiar with the process, so he probably knows.

DenverCoder1 · 2021-02-08T20:38:10Z

It seems to me that some of the data in the .py files is coming from the date_translation_data JSON files and some parts (such as decades) are coming from the supplementary YAML files. I made changes in both places to get the .py files to fully update.

Gallaecio · 2021-02-08T21:33:38Z

The JSON files are not meant to be modified (they come from http://cldr.unicode.org/), so this may be not as trivial as I had hoped.

DenverCoder1 · 2021-02-08T21:45:00Z

The JSON files are not meant to be modified (they come from http://cldr.unicode.org), so this may be not as trivial as I had hoped.

I didn't need to modify the JSON, only had to change the way it is converted in write_complete_data.py.

Gallaecio

✔️

@noviluni Mind that tests did not run in CI for some reason.

Gallaecio · 2021-02-09T08:07:54Z

Looking at #859 maybe it’s just a matter or reopening.

codecov · 2021-02-10T08:47:10Z

Codecov Report

Base: 98.23% // Head: 98.23% // Increases project coverage by +0.00% 🎉

Coverage data is based on head (5c5b3f1) compared to base (d052d3e).
Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files

@@           Coverage Diff           @@
##           master     #876   +/-   ##
=======================================
  Coverage   98.23%   98.23%           
=======================================
  Files         232      232           
  Lines        2602     2604    +2     
=======================================
+ Hits         2556     2558    +2     
  Misses         46       46

Impacted Files	Coverage Δ
dateparser/data/date_translation_data/af.py	`100.00% <ø> (ø)`
dateparser/data/date_translation_data/am.py	`100.00% <ø> (ø)`
dateparser/data/date_translation_data/ar.py	`100.00% <ø> (ø)`
dateparser/data/date_translation_data/ast.py	`100.00% <ø> (ø)`
dateparser/data/date_translation_data/az-Latn.py	`100.00% <ø> (ø)`
dateparser/data/date_translation_data/az.py	`100.00% <ø> (ø)`
dateparser/data/date_translation_data/be.py	`100.00% <ø> (ø)`
dateparser/data/date_translation_data/bg.py	`100.00% <ø> (ø)`
dateparser/data/date_translation_data/bn.py	`100.00% <ø> (ø)`
dateparser/data/date_translation_data/br.py	`100.00% <ø> (ø)`
... and 105 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

noviluni · 2021-02-10T08:56:59Z

Hi @DenverCoder1 @Gallaecio Thank you for the PR.

I like the approach, however, I'm don't want to merge this as-is, because not all the locale/regions accept both (period and comma) as decimal separator. This could lead to some errors.

For example, if we have 1.500 s we don't know if it's 1 point 5 seconds or 1 thousand 500 seconds. In UK and US, they normally use comma (,) as a thousand separator, while in Spanish we usually use the period (.) as a thousand separator.

I tried with this code, and this is what I got:

>>> datetime.now() - dateparser.parse('1,5000 s', languages=['en'])
datetime.timedelta(0, 4999, 995131)

>>> datetime.now() - dateparser.parse('1.5000 s', languages=['en'])
datetime.timedelta(0, 1, 496419)

>>> datetime.now() - dateparser.parse('1,5000 s', languages=['es'])
datetime.timedelta(0, 4999, 996924)

>>> datetime.now() - dateparser.parse('1.5000 s', languages=['es'])
datetime.timedelta(0, 1, 496970)

I personally think that the decimal/comma separator should be configured by default by locale and we should probably be able to configure it manually using the settings, so this approach wouldn't probably work.

Let me know what you think / your ideas :)

Gallaecio · 2021-02-10T10:51:40Z

I think you have a point, but I also imagine that those examples are interpreted as 5000 with master. If so, I imagine this change would not break anything that is parsed fine at the moment, but only extend support for fractional numbers without thousand separators.

If so, I would not mind merging as is and creating 2 separate tickets (1) for restricting supported decimal separators based on locale (we can hopefully use data from CLDR to choose the right ones) and (2) for supporting thousand separators.

But yes, we could also aim to solve all of that in this patch. I’m just worried about the extra time that could require.

DenverCoder1 · 2021-03-21T15:20:25Z

@Gallaecio @noviluni What's the status on this? I don't think I have the time now to try making it locale-based as suggested. I agree that it could be made a separate PR.

As of now, it will make things such as "2.5 hours", etc. parse correctly when right now they don't ("2.5" gets treated as "5").

To make it work with periods only required a couple simple changes. Making it work with commas as well was a bit more complicated due to the way the parser ignores commas, but it should also be working properly with these changes.

If there's a worry that parsing commas as decimal points could break things, that part could be removed for now until the locale settings are in place, or if not, it could be merged as is.

Let me know if there's anything else needed from me.

repl-samuel · 2022-02-09T15:52:30Z

Could I ask the status of this PR?
It's approved almost a year ago but still not get merged?

Gallaecio · 2022-02-09T19:56:09Z

We usually follow a 2-maintainer-approval policy before merging. The next step is for another maintainer to have a look. However, most (all?) dateparser maintainers are also maintainers of many other open source libraries, and our time is limited.

DenverCoder1

To make it easier for future reviewers, the change is much smaller than it appears to be. The reason there are many files changed is because the regex in write_complete_data.py was updated and ran.

See below for shortcuts to the code changes.

DenverCoder1 · 2022-02-09T23:05:43Z

dateparser/freshness_date_parser.py

@@ -11,7 +11,7 @@


 _UNITS = r'decade|year|month|week|day|hour|minute|second'
-PATTERN = re.compile(r'(\d+)\s*(%s)\b' % _UNITS, re.I | re.S | re.U)
+PATTERN = re.compile(r'(\d+[.,]?\d*)\s*(%s)\b' % _UNITS, re.I | re.S | re.U)


Added decimal point to unit parser regex

DenverCoder1 · 2022-02-09T23:08:01Z

dateparser/freshness_date_parser.py

@@ -147,7 +147,7 @@ def get_kwargs(self, date_string):

        kwargs = {}
        for num, unit in m:
-            kwargs[unit + 's'] = int(num)
+            kwargs[unit + 's'] = float(num.replace(",", "."))


Converted unit type to float instead of int, (replacing comma with period for multi-language support)

DenverCoder1 · 2022-02-09T23:08:38Z

dateparser_scripts/write_complete_data.py

@@ -31,7 +31,7 @@ def _modify_relative_data(relative_data):
    modified_relative_data = OrderedDict()
    for key, value in relative_data.items():
        for i, string in enumerate(value):
-            string = RELATIVE_PATTERN.sub(r'(\\d+)', string)
+            string = RELATIVE_PATTERN.sub(r'(\\d+[.,]?\\d*)', string)


Updated the data converter script to include decimal places in regexes

While we wait for a 2nd maintainer approval, I wonder if we should go for a more strict regular expression here. Something like (\d+(?:[.,]\d+)? maybe?

@Gallaecio I've spent hours trying more specific regexes but for some reason, any regex I try besides the current one ends up with hundreds of failed tests and strings not translating. Feel free to try some things out and see if you can get it to work, but I think sticking to this regex that works is easiest.

DenverCoder1 · 2022-02-09T23:09:01Z

tests/test_data.py

@@ -65,7 +65,7 @@ def is_invalid_relative_regex_mapping(relative_regex_mapping):
    if '\\1' not in key:
        return True
    return not (all([isinstance(x, str) for x in value])
-                and all(['(\\d+)' in x for x in value]))
+                and all(['(\\d+[.,]?\\d*)' in x for x in value]))


Updated the test for the converted data

DenverCoder1 · 2022-02-09T23:10:48Z

tests/test_freshness_date_parser.py

+        # Fractional units
+        param('2.5 hours', ago={'hours': 2.5}, period='day'),
+        param('10.75 minutes', ago={'minutes': 10.75}, period='day'),
+        param('1.5 days', ago={'days': 1.5}, period='day'),


Added test cases for fractional units (tests added to 4 places within this file including with commas)

jgayfer · 2022-06-13T20:14:14Z

Would love to get some more visibility on this! ♥️

DenverCoder1 · 2022-10-23T21:01:53Z

Updating the base branch will take some extra work due to some changes in the language data.

#1079 specifically has caused some issues since it changed the CLDR data instead of the supplementary yaml. I can open a separate PR to fix that.

DenverCoder1 · 2022-10-24T00:19:52Z

All languages should be working again now.

Note: The fix in #1090 is also incorporated in this PR.

A code diff overview has been added to the PR description to make it easier to review.

serhii73 · 2022-10-31T16:32:01Z

Thank you.

Added support for fractional units

d585d45

Gallaecio reviewed Feb 8, 2021

View reviewed changes

dateparser/freshness_date_parser.py Outdated Show resolved Hide resolved

Added tests and comma decimal support

cbb0b58

Updated write_complete_data for fractional units

2f2d5c9

DenverCoder1 closed this Feb 8, 2021

DenverCoder1 reopened this Feb 8, 2021

Gallaecio approved these changes Feb 9, 2021

View reviewed changes

Gallaecio closed this Feb 9, 2021

Gallaecio reopened this Feb 9, 2021

noviluni closed this Feb 10, 2021

noviluni reopened this Feb 10, 2021

DenverCoder1 commented Feb 9, 2022

View reviewed changes

DenverCoder1 added 2 commits October 20, 2022 20:10

Merge branch 'master' into support-for-fractional-units

3ca6540

Rerun write_complete_data.py

3c9c05a

Merge branch 'master' into support-for-fractional-units

5a4a936

DenverCoder1 mentioned this pull request Oct 23, 2022

Fix Ukrainian supplemental data to be in the yaml and restored CLDR data #1090

Merged

DenverCoder1 added 3 commits October 23, 2022 21:35

update new language data

d184989

fix for croatian dates

6def952

add case insensitive to croatian fix

78217bd

Merge branch 'master' into support-for-fractional-units

5c5b3f1

serhii73 requested review from wRAR and serhii73 October 24, 2022 07:28

serhii73 approved these changes Oct 31, 2022

View reviewed changes

serhii73 merged commit 6e8a132 into scrapinghub:master Oct 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added support for fractional units #876

Added support for fractional units #876

DenverCoder1 commented Feb 7, 2021 •

edited

Gallaecio left a comment

noviluni commented Feb 8, 2021

DenverCoder1 commented Feb 8, 2021

Gallaecio commented Feb 8, 2021

DenverCoder1 commented Feb 8, 2021

Gallaecio commented Feb 8, 2021

DenverCoder1 commented Feb 8, 2021

Gallaecio commented Feb 8, 2021

DenverCoder1 commented Feb 8, 2021

Gallaecio left a comment

Gallaecio commented Feb 9, 2021

codecov bot commented Feb 10, 2021 •

edited

noviluni commented Feb 10, 2021

Gallaecio commented Feb 10, 2021 •

edited

DenverCoder1 commented Mar 21, 2021

repl-samuel commented Feb 9, 2022

Gallaecio commented Feb 9, 2022

DenverCoder1 left a comment •

edited

DenverCoder1 Feb 9, 2022

DenverCoder1 Feb 9, 2022 •

edited

DenverCoder1 Feb 9, 2022 •

edited

Gallaecio Oct 24, 2022

DenverCoder1 Oct 24, 2022

DenverCoder1 Feb 9, 2022 •

edited

DenverCoder1 Feb 9, 2022 •

edited

jgayfer commented Jun 13, 2022

DenverCoder1 commented Oct 23, 2022 •

edited

DenverCoder1 commented Oct 24, 2022 •

edited

serhii73 commented Oct 31, 2022

Added support for fractional units #876

Added support for fractional units #876

Conversation

DenverCoder1 commented Feb 7, 2021 • edited

What this changes

Tests

What was changed

Gallaecio left a comment

Choose a reason for hiding this comment

noviluni commented Feb 8, 2021

DenverCoder1 commented Feb 8, 2021

Gallaecio commented Feb 8, 2021

DenverCoder1 commented Feb 8, 2021

Gallaecio commented Feb 8, 2021

DenverCoder1 commented Feb 8, 2021

Gallaecio commented Feb 8, 2021

DenverCoder1 commented Feb 8, 2021

Gallaecio left a comment

Choose a reason for hiding this comment

Gallaecio commented Feb 9, 2021

codecov bot commented Feb 10, 2021 • edited

Codecov Report

noviluni commented Feb 10, 2021

Gallaecio commented Feb 10, 2021 • edited

DenverCoder1 commented Mar 21, 2021

repl-samuel commented Feb 9, 2022

Gallaecio commented Feb 9, 2022

DenverCoder1 left a comment • edited

Choose a reason for hiding this comment

DenverCoder1 Feb 9, 2022

Choose a reason for hiding this comment

DenverCoder1 Feb 9, 2022 • edited

Choose a reason for hiding this comment

DenverCoder1 Feb 9, 2022 • edited

Choose a reason for hiding this comment

Gallaecio Oct 24, 2022

Choose a reason for hiding this comment

DenverCoder1 Oct 24, 2022

Choose a reason for hiding this comment

DenverCoder1 Feb 9, 2022 • edited

Choose a reason for hiding this comment

DenverCoder1 Feb 9, 2022 • edited

Choose a reason for hiding this comment

jgayfer commented Jun 13, 2022

DenverCoder1 commented Oct 23, 2022 • edited

DenverCoder1 commented Oct 24, 2022 • edited

serhii73 commented Oct 31, 2022

DenverCoder1 commented Feb 7, 2021 •

edited

codecov bot commented Feb 10, 2021 •

edited

Gallaecio commented Feb 10, 2021 •

edited

DenverCoder1 left a comment •

edited

DenverCoder1 Feb 9, 2022 •

edited

DenverCoder1 Feb 9, 2022 •

edited

DenverCoder1 Feb 9, 2022 •

edited

DenverCoder1 Feb 9, 2022 •

edited

DenverCoder1 commented Oct 23, 2022 •

edited

DenverCoder1 commented Oct 24, 2022 •

edited