Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support for dates with dots and spaces #1028

Merged

Conversation

atharmohammad
Copy link
Contributor

@atharmohammad atharmohammad commented Dec 18, 2021

Closes #1010
this will support dates such as 26 .10.21 or 26 . 10.21 , in date 26 . 10.21 the first period was getting removed in sanitization even when it was between numerals only(surrounded by spaces) which was not needed. that is why changed the period sanitization regex a bit

@codecov
Copy link

codecov bot commented Dec 18, 2021

Codecov Report

Merging #1028 (cabeaf4) into master (4490ca6) will not change coverage.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #1028   +/-   ##
=======================================
  Coverage   98.29%   98.29%           
=======================================
  Files         234      234           
  Lines        2694     2694           
=======================================
  Hits         2648     2648           
  Misses         46       46           
Impacted Files Coverage Δ
dateparser/date.py 99.24% <100.00%> (ø)
dateparser/parser.py 99.01% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4490ca6...cabeaf4. Read the comment docs.

@jc-louis
Copy link

@Gallaecio sorry to bother but can you look into this PR? This is a regression for us (#1010) 🙏 Many thanks!

Copy link
Member

@Gallaecio Gallaecio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, great job!

@@ -223,7 +223,7 @@ class _parser:

def __init__(self, tokens, settings):
self.settings = settings
self.tokens = list(tokens)
self.tokens = [(t[0].strip(), t[1]) for t in list(tokens)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we can drop the list cast now:

Suggested change
self.tokens = [(t[0].strip(), t[1]) for t in list(tokens)]
self.tokens = [(t[0].strip(), t[1]) for t in tokens]

@@ -35,7 +35,7 @@

RE_SANITIZE_SKIP = re.compile(r'\t|\n|\r|\u00bb|,\s\u0432\b|\u200e|\xb7|\u200f|\u064e|\u064f', flags=re.M)
RE_SANITIZE_RUSSIAN = re.compile(r'([\W\d])\u0433\.', flags=re.I | re.U)
RE_SANITIZE_PERIOD = re.compile(r'(?<=\D+)\.', flags=re.U)
RE_SANITIZE_PERIOD = re.compile(r'(?<=[^0-9\s])\.', flags=re.U)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💄

Suggested change
RE_SANITIZE_PERIOD = re.compile(r'(?<=[^0-9\s])\.', flags=re.U)
RE_SANITIZE_PERIOD = re.compile(r'(?<=[^\d\s])\.', flags=re.U)

@jc-louis
Copy link

Can someone merge this to fix #1010 ? 🙏

@Gallaecio Gallaecio merged commit ffb9a2d into scrapinghub:master Aug 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Regression dates with dots and spaces
4 participants