Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement dehumanize method #956

Merged
merged 25 commits into from Apr 25, 2021
Merged

Conversation

anishnya
Copy link
Member

@anishnya anishnya commented Apr 8, 2021

Pull Request Checklist

Thank you for taking the time to improve Arrow! Before submitting your pull request, please check all appropriate boxes:

  • [ x] 🧪 Added tests for changed code.
  • [ x] 🛠️ All tests pass when run locally (run tox or make test to find out!).
  • [ x] 🧹 All linting checks pass when run locally (run tox -e lint or make lint to find out!).
  • [ x] 📚 Updated documentation for changed code.
  • [ x] ⏩ Code is up-to-date with the master branch.

If you have any questions about your code changes or any of the points above, please submit your questions along with the pull request and we will try our best to help!

Description of Changes

This PR isn't expected to be merged, however, we would like to start a dialogue on how to proceed with some of the challenges ahead in implementing a reverse_humanize/dehumanize method as proposed in #360. We would first like to thank #772 and #895 for giving us a good base to work off of.

This attempt so far works with 37 of the 54 current locales in Arrow, a big step forward from previous attempts. There are a couple issues we would like to address and discuss.

1) Input Validation

Currently, this solution has little input validation. For example, a user can "in 3 months and 3 lightseconds" and dehumanize will return an object that is 3 months ahead. After some experimenting, creating a generalized method for input validation across locales seems unfeasible.

For example, one method we attempted was to simply delete any match from the input string and then at the very end of search, raise an error if the string was not empty (after removing the locale's and word and removing any whitespace). This creates some issues in languages such as Azerbaijani, the plural for a unit is the same as a single value for a unit. For example, the phrase for "a second" in Azerbaijani is "saniyə" whereas the phrase for "5 seconds" is "5 saniyə". This type of situation has only been encountered in 3 languages so far. There is an easy fix here, but we worry there may be another locale introduced that breaks those rules entirely.

Some dialogue is needed here on how much input validation is or isn't needed. Potentially we can label dehumanize as fuzzy_dehumanize in the interim and revisit the question of input validation later.

2) Remaining Locales and the Need for Standardization Across Locales

In order to support the remaining locales not supported by this feature, there needs to be some element of standardization across locales. This pertains to what we will be referring to as special time frame values (a special word conveying two days ago).

Some locales incorporate these special time frame values very differently. Some like the Slavic languages used the regular timeframe object and have the key as some list.

Bulgarian:
"minutes": ["{0} минута", "{0} минути", "{0} минути"]

Some have their timeframe key as a dictionary.

Arabic:
"seconds": {"double": "ثانيتين", "ten": "{0} ثوان", "higher": "{0} ثانية"},

Some have some nested combination of both.

Czech:
"seconds": {"past": "{0} sekundami", "future": ["{0} sekundy", "{0} sekund"]}

Some introduce their own separate objects.

Korean:
special_dayframes = { -3: "그끄제", -2: "그제", -1: "어제", 1: "내일", 2: "모레", 3: "글피", 4: "그글피", }

A temporary solution here would be to create a special time frame dictionary for each of these locales, allowing dehumanize to function on these languages and then later down the line figuring out a way to standardize across locales.
A more permanent solution maybe to standardized like the following.

Temporary Solution:
special_timeframes = {"string for two months ago": ["months", 2, True], "{0} special plural unit": ["seconds", 1, False]}

The first element in the list represents the unit that string modifies, the second the value modified and the third is a bool if we need to obtain the value from the string or not.

Permanent Solution:
"seconds": {1: "special value one", 5: "special value two", "general": "general value one {0}", "special case one": "special case string {0}"}

We could check if a key is an int and know the value associated with that phrase for the particular unit. If the key is some string, we know there the value that can be associated with that phrase is arbitrary and we need to obtain the value.

I believe these issues aren't too complex to solve, but I think we need some discussion to choose from multiple potential soltuons.

@codecov
Copy link

codecov bot commented Apr 8, 2021

Codecov Report

Merging #956 (ef81148) into master (f18be7f) will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #956   +/-   ##
=======================================
  Coverage   99.74%   99.75%           
=======================================
  Files          10       10           
  Lines        1944     2014   +70     
  Branches      312      323   +11     
=======================================
+ Hits         1939     2009   +70     
  Misses          4        4           
  Partials        1        1           
Impacted Files Coverage Δ
arrow/api.py 100.00% <100.00%> (ø)
arrow/arrow.py 99.09% <100.00%> (+0.09%) ⬆️
arrow/constants.py 100.00% <100.00%> (ø)
arrow/factory.py 100.00% <100.00%> (ø)
arrow/formatter.py 100.00% <100.00%> (ø)
arrow/locales.py 100.00% <100.00%> (ø)
arrow/parser.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f18be7f...ef81148. Read the comment docs.

@systemcatch systemcatch changed the title Reverse humanize Implement dehumanize method Apr 13, 2021
@systemcatch
Copy link
Collaborator

Hey @anishnya from a quick skim over (more detailed review will follow) I really like this PR and think merging it is quite possible. If it only works for 37/54 locales at the moment we can encode a locale check and work on improving the method. With dehumanize and all the PR attempts before I think we've let perfect be the enemy of good sometime.

ALee008 and others added 3 commits April 14, 2021 21:07
* added parameter weekday and thus ability to choose start day when using span in combination with frame == "week"

* changed default value for new parameter weekday.

* added test for new parameter weekday when using span and frame == "week"

* added comma after new parameter weekday.

* Update arrow/arrow.py

Co-authored-by: Chris <30196510+systemcatch@users.noreply.github.com>

* changed parameter name from weekday to week_start.

* extended doc string to highlight new parameter week_start.

* raise ValueError if 1 <= week_start <= 7 is not met

* renamed weekday to week_start in comment.

* changed if-else block to one liner. Updated comments.

* Update arrow/arrow.py

add blank line at the end of the docstring

Co-authored-by: Chris <30196510+systemcatch@users.noreply.github.com>

* Remove trailing whitespace

* added new line at end of doc string.

* Fix doc issue

* extended test_span_week by further test cases.

Co-authored-by: Jad Chaar <jadchaar@users.noreply.github.com>
Co-authored-by: Chris <30196510+systemcatch@users.noreply.github.com>
Co-authored-by: Jad Chaar <jad.chaar@gmail.com>
* Progress toward locale validation

* standardize and validate locales

* Add tests

* Clean up tests

* Use new locale name in error

* Remove useless comments

* Address comments

* English

* Remove default locale from test_parser
Copy link
Collaborator

@systemcatch systemcatch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @anishnya I've done a first pass with some comments. I think you should create a list of supported locales in constants.py and perform a check with the provided locale, raising an error if it's not supported.

arrow/arrow.py Outdated
Comment on lines 1287 to 1289
"""Returns an arrow object that represents the
date/time represented by a humanzied arrow time string.
"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need this docstring to be like the humanize one with params and usage examples.

arrow/arrow.py Outdated
Comment on lines 1311 to 1312
# Sign logic
sign_val = 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed (it is set below or throws an error).

arrow/arrow.py Outdated
locale_obj = locales.get_locale(locale)

# Create a regex pattern object for numbers
num_pattern = re.compile(r"[0-9]+")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might be able to use \d here instead.

arrow/arrow.py Outdated
# If no number matches set change value to be one
if not num_match:
change_value = 1

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for a space here.

Comment on lines 2573 to 2591
# Errors with humanize cause this test case to fail (issue reported)
""" def test_mixed_granularity_month(self, locale_list_no_weeks):

for lang in locale_list_no_weeks:

arw = arrow.Arrow(2000, 1, 10, 5, 55, 0)
past = arw.shift(months=-3, days=-23, seconds=-1)
future = arw.shift(months=3, days=23, seconds=1)

past_string = past.humanize(arw, locale=lang, granularity=["month","day","second"])
future_string = future.humanize(
arw, locale=lang, granularity=["month","day","second"]
)

print(future_string)
print(past_string)

assert arw.dehumanize(past_string, locale=lang) == past
assert arw.dehumanize(future_string, locale=lang) == future """
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be removed now the issue is filed.

arrow/arrow.py Outdated
Comment on lines 1371 to 1380

return current_time.shift(
seconds=sign_val * time_object_info["seconds"],
minutes=sign_val * time_object_info["minutes"],
hours=sign_val * time_object_info["hours"],
days=sign_val * time_object_info["days"],
weeks=sign_val * time_object_info["weeks"],
months=sign_val * time_object_info["months"],
years=sign_val * time_object_info["years"],
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return current_time.shift(
seconds=sign_val * time_object_info["seconds"],
minutes=sign_val * time_object_info["minutes"],
hours=sign_val * time_object_info["hours"],
days=sign_val * time_object_info["days"],
weeks=sign_val * time_object_info["weeks"],
months=sign_val * time_object_info["months"],
years=sign_val * time_object_info["years"],
)
time_changes = {k: sign_val*v for k, v in time_object_info.items()}
return current_time.shift(**time_changes)

Or something along those lines, saves having to do all the multiplications in the return.

arrow/arrow.py Outdated
Comment on lines 1330 to 1331
else:
raise ValueError(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need lots of tests where garbage/bad info is passed to dehumanize.

arrow/arrow.py Outdated
Comment on lines 1295 to 1303
time_object_info = {
"seconds": 0,
"minutes": 0,
"hours": 0,
"days": 0,
"weeks": 0,
"months": 0,
"years": 0,
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
time_object_info = {
"seconds": 0,
"minutes": 0,
"hours": 0,
"days": 0,
"weeks": 0,
"months": 0,
"years": 0,
}
time_object_info = dict.fromkeys(["seconds", "minutes", "hours", "days", "weeks", "months", "years"], 0)

Purely personal preference, feel free to ignore.

@anishnya
Copy link
Member Author

Will get on these issues asap @systemcatch. How much input validation is needed? Just input that is completely bad, like for example a random string? Or more nuanced input validation like rejecting "2 seconds 10 seconds ago"? As mentioned in the original PR post, the latter is proving to be quite difficult to generalize across locales. The current solution will reject any string that does not contain some kind of time reference like "ago" or "in".

@systemcatch
Copy link
Collaborator

Let's start with simple validation and work from there.

@jadchaar jadchaar added this to the v1.1.0 milestone Apr 19, 2021
@anishnya
Copy link
Member Author

I've implemented all the suggestions and added additional input validation test cases. Although the input validation isn't perfect, it does reject complete nonsense. Apologies in advance for the weird merge issues in this PR. I'm not too sure why that ended up happening.

@systemcatch
Copy link
Collaborator

I've implemented all the suggestions and added additional input validation test cases. Although the input validation isn't perfect, it does reject complete nonsense. Apologies in advance for the weird merge issues in this PR. I'm not too sure why that ended up happening.

Not to worry we'll squash all the commits to make it look nice and clean.

Copy link
Collaborator

@systemcatch systemcatch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good @anishnya, I've got a couple more comments which is mostly small stuff.

We'll need dehumanize to be mentioned in the documentation, how about we put one or two examples after the humanize section?

arrow/arrow.py Outdated
normalized_locale_name = locale.lower().replace("_", "-")

if normalized_locale_name not in DEHUMANIZE_LOCALES:
raise ValueError(f"Dehumanize does not currently support locale {locale}.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
raise ValueError(f"Dehumanize does not currently support locale {locale}.")
raise ValueError(f"Dehumanize does not currently support the {locale} locale, please consider making a contribution to add support for this locale.")

arrow/arrow.py Outdated
if normalized_locale_name not in DEHUMANIZE_LOCALES:
raise ValueError(f"Dehumanize does not currently support locale {locale}.")

# Create current time object
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't really need this comment.

Comment on lines +1419 to +1421
"""Invalid input String. String does not contain any relative time information.
String should either represent a time in the future or a time in the past.
Ex: "in 5 seconds" or "5 seconds ago". """
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good error message 👍

arrow/arrow.py Outdated
unit_visited[unit] = True
continue

# Add change value to the correct unit (incorporates the plualirty that exists within timeframe i.e second v.s seconds)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

plurality

arrow/arrow.py Outdated
if not any([True for k, v in unit_visited.items() if v]):
raise ValueError(
"""Input string not valid. Note: Some locales do not support the week granulairty in Arrow.
If you are attempting to use the week granulairty on an unsupported locale, this could be the cause of this error."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

granularity

@anishnya
Copy link
Member Author

anishnya commented Apr 23, 2021

Looking good @anishnya, I've got a couple more comments which is mostly small stuff.

We'll need dehumanize to be mentioned in the documentation, how about we put one or two examples after the humanize section?

I've pushed the suggested changes and added the examples to the docs.

docs/index.rst Outdated
Dehumanize
~~~~~~~~

Dehumanize a string representing a past time:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about "Take a human readable string and use it to shift into a past time."

Copy link
Collaborator

@systemcatch systemcatch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last tweak and we should be good to merge @anishnya

docs/index.rst Outdated Show resolved Hide resolved
Copy link
Collaborator

@systemcatch systemcatch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nicely done @anishnya, thanks for your effort on this.

@systemcatch systemcatch merged commit 699198c into arrow-py:master Apr 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Release 1.1.0
Awaiting triage
Development

Successfully merging this pull request may close these issues.

None yet

5 participants