Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added support for fractional units #876

Merged

Conversation

DenverCoder1
Copy link
Contributor

@DenverCoder1 DenverCoder1 commented Feb 7, 2021

Fixes #753

Added support for fractional units by making the regex pattern support digits before a decimal point and converting num to a float instead of an int on line 150.

What this changes

Before this fix, parse("in 0.5 hours") would return the time 5 hours in the future instead of 0.5 hours in the future.

With this fix, parsing fractional units are no longer parsed incorrectly.

Tests

>>> parse("now")
datetime.datetime(2021, 2, 7, 18, 50, 10, 107501)
>>> parse("in 0.5 hours")
datetime.datetime(2021, 2, 7, 19, 20, 15, 773800)
>>> parse("in 3.75 hours") 
datetime.datetime(2021, 2, 7, 22, 35, 20, 463238)
>>> parse("2.5 hours ago")  
datetime.datetime(2021, 2, 7, 16, 20, 29, 614937)

What was changed

To make it easier for future reviewers, the change is much smaller than it appears to be. The reason there are 131+ files changed is because the regex in write_complete_data.py was updated and ran.

Very few lines of code were changed in this PR:

- PATTERN = re.compile(r'(\d+)\s*(%s)\b' % _UNITS, re.I | re.S | re.U)
+ PATTERN = re.compile(r'(\d+[.,]?\d*)\s*(%s)\b' % _UNITS, re.I | re.S | re.U)
- kwargs[unit + 's'] = int(num)
+ kwargs[unit + 's'] = float(num.replace(",", "."))
- string = RELATIVE_PATTERN.sub(r'(\\d+)', string)
+ string = RELATIVE_PATTERN.sub(r'(\\d+[.,]?\\d*)', string)
  • Number regex changed in supplementary yaml files for custom relative-type-regex sections (multiple locations)
In all occurrences in supplementary yaml:
- (\d+)
+ (\d+[.,]?\d*)
line 40
+ RE_SANITIZE_CROATIAN = re.compile(r'(\d+)\.\s?(\d+)\.\s?(\d+)\.( u)?', flags=re.I | re.U)
line 110
+ date_string = RE_SANITIZE_CROATIAN.sub(r'\1.\2.\3 ', date_string)  # extra '.' and 'u' interferes with parsing relative fractional dates
- return not (all([isinstance(x, str) for x in value])
-                and all(['(\\d+)' in x for x in value]))
+ return not (all([isinstance(x, str) for x in value])
+                and all(['(\\d+[.,]?\\d*)' in x for x in value]))
line 99
+ # Fractional units
+ param('2.5 hours', ago={'hours': 2.5}, period='day'),
+ param('10.75 minutes', ago={'minutes': 10.75}, period='day'),
+ param('1.5 days', ago={'days': 1.5}, period='day'),
line 616
+ # Fractional units
+ param('2.5 hours', ago={'hours': 2.5}, period='day'),
+ param('10.75 minutes', ago={'minutes': 10.75}, period='day'),
+ param('1.5 days', ago={'days': 1.5}, period='day'),
line 1108
+ # Fractional units
+ param('in 2.5 hours', in_future={'hours': 2.5}, period='day'),
+ param('in 10.75 minutes', in_future={'minutes': 10.75}, period='day'),
+ param('in 1.5 days', in_future={'days': 1.5}, period='day'),
+ param('in 0,5 hours', in_future={'hours': 0.5}, period='day'),
line 1635
+ # Fractional units
+ param('2.5 hours ago', date(2010, 6, 4), time(10, 45)),
+ param('in 10.75 minutes', date(2010, 6, 4), time(13, 25, 45)),
+ param('in 1.5 days', date(2010, 6, 6), time(1, 15)),
+ param('0,5 hours ago', date(2010, 6, 4), time(12, 45)),

See these comments for shortcuts to view the changes

Copy link
Member

@Gallaecio Gallaecio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also extend tests to cover this new feature? I’m think we should at least test 1 such value, and assuming we include comma support the same value with a comma instead of a period.

dateparser/freshness_date_parser.py Outdated Show resolved Hide resolved
@noviluni
Copy link
Collaborator

noviluni commented Feb 8, 2021

HI @DenverCoder1, thank you for this PR.

Could you add those tests you mention in the description in the project tests? Thanks!

@DenverCoder1
Copy link
Contributor Author

@Gallaecio Added some tests for fractional units. Also added some support for numbers with commas instead of periods. Adding comma support was not as simple because of the way the Dictionary splits the strings.

@Gallaecio
Copy link
Member

I see you’ve had to modify the per-language Python files. Those are not meant to be modified manually, and are instead generated from other files in the repository. See https://dateparser.readthedocs.io/en/latest/contributing.html#guidelines-for-editing-translation-data . On the bright side, it may mean that you just need to edit the script that generates them.

@DenverCoder1
Copy link
Contributor Author

@Gallaecio

I see. So I should update write_complete_data.py and the yaml files then rerun write_complete_data.py?

@Gallaecio
Copy link
Member

I am not yet familiar with that process, but I would assume you would have to edit either write_complete_data.py (hopefully) or the yaml files (if they include the \d pattern), and then rerun write_complete_data.py.

This is also assuming that you can get the desired result editing one of those two files. If the patterns came directly from CLDR (unlikely but possible), it may get messier.

Please have a look at en.yml and write_complete_data.py, see if you can figure out where the change should go. Otherwise, @noviluni is the person most familiar with the process, so he probably knows.

@DenverCoder1
Copy link
Contributor Author

It seems to me that some of the data in the .py files is coming from the date_translation_data JSON files and some parts (such as decades) are coming from the supplementary YAML files. I made changes in both places to get the .py files to fully update.

@Gallaecio
Copy link
Member

The JSON files are not meant to be modified (they come from http://cldr.unicode.org/), so this may be not as trivial as I had hoped.

@DenverCoder1
Copy link
Contributor Author

The JSON files are not meant to be modified (they come from http://cldr.unicode.org), so this may be not as trivial as I had hoped.

I didn't need to modify the JSON, only had to change the way it is converted in write_complete_data.py.

@DenverCoder1 DenverCoder1 reopened this Feb 8, 2021
Copy link
Member

@Gallaecio Gallaecio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✔️

@noviluni Mind that tests did not run in CI for some reason.

@Gallaecio
Copy link
Member

Looking at #859 maybe it’s just a matter or reopening.

@Gallaecio Gallaecio closed this Feb 9, 2021
@Gallaecio Gallaecio reopened this Feb 9, 2021
@noviluni noviluni closed this Feb 10, 2021
@noviluni noviluni reopened this Feb 10, 2021
@codecov
Copy link

codecov bot commented Feb 10, 2021

Codecov Report

Base: 98.23% // Head: 98.23% // Increases project coverage by +0.00% 🎉

Coverage data is based on head (5c5b3f1) compared to base (d052d3e).
Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files
@@           Coverage Diff           @@
##           master     #876   +/-   ##
=======================================
  Coverage   98.23%   98.23%           
=======================================
  Files         232      232           
  Lines        2602     2604    +2     
=======================================
+ Hits         2556     2558    +2     
  Misses         46       46           
Impacted Files Coverage Δ
dateparser/data/date_translation_data/af.py 100.00% <ø> (ø)
dateparser/data/date_translation_data/am.py 100.00% <ø> (ø)
dateparser/data/date_translation_data/ar.py 100.00% <ø> (ø)
dateparser/data/date_translation_data/ast.py 100.00% <ø> (ø)
dateparser/data/date_translation_data/az-Latn.py 100.00% <ø> (ø)
dateparser/data/date_translation_data/az.py 100.00% <ø> (ø)
dateparser/data/date_translation_data/be.py 100.00% <ø> (ø)
dateparser/data/date_translation_data/bg.py 100.00% <ø> (ø)
dateparser/data/date_translation_data/bn.py 100.00% <ø> (ø)
dateparser/data/date_translation_data/br.py 100.00% <ø> (ø)
... and 105 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@noviluni
Copy link
Collaborator

Hi @DenverCoder1 @Gallaecio Thank you for the PR.

I like the approach, however, I'm don't want to merge this as-is, because not all the locale/regions accept both (period and comma) as decimal separator. This could lead to some errors.

For example, if we have 1.500 s we don't know if it's 1 point 5 seconds or 1 thousand 500 seconds. In UK and US, they normally use comma (,) as a thousand separator, while in Spanish we usually use the period (.) as a thousand separator.

I tried with this code, and this is what I got:

>>> datetime.now() - dateparser.parse('1,5000 s', languages=['en'])
datetime.timedelta(0, 4999, 995131)

>>> datetime.now() - dateparser.parse('1.5000 s', languages=['en'])
datetime.timedelta(0, 1, 496419)
>>> datetime.now() - dateparser.parse('1,5000 s', languages=['es'])
datetime.timedelta(0, 4999, 996924)

>>> datetime.now() - dateparser.parse('1.5000 s', languages=['es'])
datetime.timedelta(0, 1, 496970)

I personally think that the decimal/comma separator should be configured by default by locale and we should probably be able to configure it manually using the settings, so this approach wouldn't probably work.

Let me know what you think / your ideas :)

@Gallaecio
Copy link
Member

Gallaecio commented Feb 10, 2021

I think you have a point, but I also imagine that those examples are interpreted as 5000 with master. If so, I imagine this change would not break anything that is parsed fine at the moment, but only extend support for fractional numbers without thousand separators.

If so, I would not mind merging as is and creating 2 separate tickets (1) for restricting supported decimal separators based on locale (we can hopefully use data from CLDR to choose the right ones) and (2) for supporting thousand separators.

But yes, we could also aim to solve all of that in this patch. I’m just worried about the extra time that could require.

@DenverCoder1
Copy link
Contributor Author

@Gallaecio @noviluni What's the status on this? I don't think I have the time now to try making it locale-based as suggested. I agree that it could be made a separate PR.

As of now, it will make things such as "2.5 hours", etc. parse correctly when right now they don't ("2.5" gets treated as "5").

To make it work with periods only required a couple simple changes. Making it work with commas as well was a bit more complicated due to the way the parser ignores commas, but it should also be working properly with these changes.

If there's a worry that parsing commas as decimal points could break things, that part could be removed for now until the locale settings are in place, or if not, it could be merged as is.

Let me know if there's anything else needed from me.

@repl-samuel
Copy link

Could I ask the status of this PR?
It's approved almost a year ago but still not get merged?

@Gallaecio
Copy link
Member

We usually follow a 2-maintainer-approval policy before merging. The next step is for another maintainer to have a look. However, most (all?) dateparser maintainers are also maintainers of many other open source libraries, and our time is limited.

Copy link
Contributor Author

@DenverCoder1 DenverCoder1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make it easier for future reviewers, the change is much smaller than it appears to be. The reason there are many files changed is because the regex in write_complete_data.py was updated and ran.

See below for shortcuts to the code changes.

@@ -11,7 +11,7 @@


_UNITS = r'decade|year|month|week|day|hour|minute|second'
PATTERN = re.compile(r'(\d+)\s*(%s)\b' % _UNITS, re.I | re.S | re.U)
PATTERN = re.compile(r'(\d+[.,]?\d*)\s*(%s)\b' % _UNITS, re.I | re.S | re.U)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Added decimal point to unit parser regex

@@ -147,7 +147,7 @@ def get_kwargs(self, date_string):

kwargs = {}
for num, unit in m:
kwargs[unit + 's'] = int(num)
kwargs[unit + 's'] = float(num.replace(",", "."))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Converted unit type to float instead of int, (replacing comma with period for multi-language support)

@@ -31,7 +31,7 @@ def _modify_relative_data(relative_data):
modified_relative_data = OrderedDict()
for key, value in relative_data.items():
for i, string in enumerate(value):
string = RELATIVE_PATTERN.sub(r'(\\d+)', string)
string = RELATIVE_PATTERN.sub(r'(\\d+[.,]?\\d*)', string)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Updated the data converter script to include decimal places in regexes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While we wait for a 2nd maintainer approval, I wonder if we should go for a more strict regular expression here. Something like (\d+(?:[.,]\d+)? maybe?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Gallaecio I've spent hours trying more specific regexes but for some reason, any regex I try besides the current one ends up with hundreds of failed tests and strings not translating. Feel free to try some things out and see if you can get it to work, but I think sticking to this regex that works is easiest.

@@ -65,7 +65,7 @@ def is_invalid_relative_regex_mapping(relative_regex_mapping):
if '\\1' not in key:
return True
return not (all([isinstance(x, str) for x in value])
and all(['(\\d+)' in x for x in value]))
and all(['(\\d+[.,]?\\d*)' in x for x in value]))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Updated the test for the converted data

Comment on lines +99 to +102
# Fractional units
param('2.5 hours', ago={'hours': 2.5}, period='day'),
param('10.75 minutes', ago={'minutes': 10.75}, period='day'),
param('1.5 days', ago={'days': 1.5}, period='day'),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Added test cases for fractional units (tests added to 4 places within this file including with commas)

@jgayfer
Copy link

jgayfer commented Jun 13, 2022

Would love to get some more visibility on this! ♥️

@DenverCoder1
Copy link
Contributor Author

DenverCoder1 commented Oct 23, 2022

Updating the base branch will take some extra work due to some changes in the language data.

#1079 specifically has caused some issues since it changed the CLDR data instead of the supplementary yaml. I can open a separate PR to fix that.

@DenverCoder1
Copy link
Contributor Author

DenverCoder1 commented Oct 24, 2022

All languages should be working again now.

Note: The fix in #1090 is also incorporated in this PR.

A code diff overview has been added to the PR description to make it easier to review.

@serhii73 serhii73 merged commit 6e8a132 into scrapinghub:master Oct 31, 2022
@serhii73
Copy link
Collaborator

Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Parsing of fractional hours is not supported
6 participants