New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added support for fractional units #876
Added support for fractional units #876
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you also extend tests to cover this new feature? I’m think we should at least test 1 such value, and assuming we include comma support the same value with a comma instead of a period.
HI @DenverCoder1, thank you for this PR. Could you add those tests you mention in the description in the project tests? Thanks! |
@Gallaecio Added some tests for fractional units. Also added some support for numbers with commas instead of periods. Adding comma support was not as simple because of the way the |
I see you’ve had to modify the per-language Python files. Those are not meant to be modified manually, and are instead generated from other files in the repository. See https://dateparser.readthedocs.io/en/latest/contributing.html#guidelines-for-editing-translation-data . On the bright side, it may mean that you just need to edit the script that generates them. |
I see. So I should update |
I am not yet familiar with that process, but I would assume you would have to edit either This is also assuming that you can get the desired result editing one of those two files. If the patterns came directly from CLDR (unlikely but possible), it may get messier. Please have a look at |
It seems to me that some of the data in the .py files is coming from the date_translation_data JSON files and some parts (such as decades) are coming from the supplementary YAML files. I made changes in both places to get the .py files to fully update. |
The JSON files are not meant to be modified (they come from http://cldr.unicode.org/), so this may be not as trivial as I had hoped. |
I didn't need to modify the JSON, only had to change the way it is converted in write_complete_data.py. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
✔️
@noviluni Mind that tests did not run in CI for some reason.
Looking at #859 maybe it’s just a matter or reopening. |
Codecov ReportBase: 98.23% // Head: 98.23% // Increases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## master #876 +/- ##
=======================================
Coverage 98.23% 98.23%
=======================================
Files 232 232
Lines 2602 2604 +2
=======================================
+ Hits 2556 2558 +2
Misses 46 46
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
Hi @DenverCoder1 @Gallaecio Thank you for the PR. I like the approach, however, I'm don't want to merge this as-is, because not all the locale/regions accept both (period and comma) as decimal separator. This could lead to some errors. For example, if we have I tried with this code, and this is what I got: >>> datetime.now() - dateparser.parse('1,5000 s', languages=['en'])
datetime.timedelta(0, 4999, 995131)
>>> datetime.now() - dateparser.parse('1.5000 s', languages=['en'])
datetime.timedelta(0, 1, 496419) >>> datetime.now() - dateparser.parse('1,5000 s', languages=['es'])
datetime.timedelta(0, 4999, 996924)
>>> datetime.now() - dateparser.parse('1.5000 s', languages=['es'])
datetime.timedelta(0, 1, 496970) I personally think that the decimal/comma separator should be configured by default by locale and we should probably be able to configure it manually using the settings, so this approach wouldn't probably work. Let me know what you think / your ideas :) |
I think you have a point, but I also imagine that those examples are interpreted as If so, I would not mind merging as is and creating 2 separate tickets (1) for restricting supported decimal separators based on locale (we can hopefully use data from CLDR to choose the right ones) and (2) for supporting thousand separators. But yes, we could also aim to solve all of that in this patch. I’m just worried about the extra time that could require. |
@Gallaecio @noviluni What's the status on this? I don't think I have the time now to try making it locale-based as suggested. I agree that it could be made a separate PR. As of now, it will make things such as "2.5 hours", etc. parse correctly when right now they don't ("2.5" gets treated as "5"). To make it work with periods only required a couple simple changes. Making it work with commas as well was a bit more complicated due to the way the parser ignores commas, but it should also be working properly with these changes. If there's a worry that parsing commas as decimal points could break things, that part could be removed for now until the locale settings are in place, or if not, it could be merged as is. Let me know if there's anything else needed from me. |
Could I ask the status of this PR? |
We usually follow a 2-maintainer-approval policy before merging. The next step is for another maintainer to have a look. However, most (all?) dateparser maintainers are also maintainers of many other open source libraries, and our time is limited. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make it easier for future reviewers, the change is much smaller than it appears to be. The reason there are many files changed is because the regex in write_complete_data.py
was updated and ran.
See below for shortcuts to the code changes.
@@ -11,7 +11,7 @@ | |||
|
|||
|
|||
_UNITS = r'decade|year|month|week|day|hour|minute|second' | |||
PATTERN = re.compile(r'(\d+)\s*(%s)\b' % _UNITS, re.I | re.S | re.U) | |||
PATTERN = re.compile(r'(\d+[.,]?\d*)\s*(%s)\b' % _UNITS, re.I | re.S | re.U) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Added decimal point to unit parser regex
@@ -147,7 +147,7 @@ def get_kwargs(self, date_string): | |||
|
|||
kwargs = {} | |||
for num, unit in m: | |||
kwargs[unit + 's'] = int(num) | |||
kwargs[unit + 's'] = float(num.replace(",", ".")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Converted unit type to float instead of int, (replacing comma with period for multi-language support)
@@ -31,7 +31,7 @@ def _modify_relative_data(relative_data): | |||
modified_relative_data = OrderedDict() | |||
for key, value in relative_data.items(): | |||
for i, string in enumerate(value): | |||
string = RELATIVE_PATTERN.sub(r'(\\d+)', string) | |||
string = RELATIVE_PATTERN.sub(r'(\\d+[.,]?\\d*)', string) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Updated the data converter script to include decimal places in regexes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While we wait for a 2nd maintainer approval, I wonder if we should go for a more strict regular expression here. Something like (\d+(?:[.,]\d+)?
maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Gallaecio I've spent hours trying more specific regexes but for some reason, any regex I try besides the current one ends up with hundreds of failed tests and strings not translating. Feel free to try some things out and see if you can get it to work, but I think sticking to this regex that works is easiest.
@@ -65,7 +65,7 @@ def is_invalid_relative_regex_mapping(relative_regex_mapping): | |||
if '\\1' not in key: | |||
return True | |||
return not (all([isinstance(x, str) for x in value]) | |||
and all(['(\\d+)' in x for x in value])) | |||
and all(['(\\d+[.,]?\\d*)' in x for x in value])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Updated the test for the converted data
# Fractional units | ||
param('2.5 hours', ago={'hours': 2.5}, period='day'), | ||
param('10.75 minutes', ago={'minutes': 10.75}, period='day'), | ||
param('1.5 days', ago={'days': 1.5}, period='day'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Added test cases for fractional units (tests added to 4 places within this file including with commas)
Would love to get some more visibility on this! |
Updating the base branch will take some extra work due to some changes in the language data. #1079 specifically has caused some issues since it changed the CLDR data instead of the supplementary yaml. I can open a separate PR to fix that. |
All languages should be working again now. Note: The fix in #1090 is also incorporated in this PR. A code diff overview has been added to the PR description to make it easier to review. |
Thank you. |
Fixes #753
Added support for fractional units by making the regex pattern support digits before a decimal point and converting
num
to afloat
instead of anint
on line 150.What this changes
Before this fix,
parse("in 0.5 hours")
would return the time 5 hours in the future instead of 0.5 hours in the future.With this fix, parsing fractional units are no longer parsed incorrectly.
Tests
What was changed
To make it easier for future reviewers, the change is much smaller than it appears to be. The reason there are 131+ files changed is because the regex in
write_complete_data.py
was updated and ran.Very few lines of code were changed in this PR:
freshness_date_parser.py
line 14)float
instead ofint
in (freshness_date_parser.py
line 150)write_complete_data.py
line 34)relative-type-regex
sections (multiple locations)date.py
line 110)test_data.py
line 68)test_freshness_date_parser.py
)See these comments for shortcuts to view the changes