Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

doesn't work in: italian, french and spanish #1305

Open
sergenti opened this issue Sep 3, 2023 · 1 comment
Open

doesn't work in: italian, french and spanish #1305

sergenti opened this issue Sep 3, 2023 · 1 comment

Comments

@sergenti
Copy link

sergenti commented Sep 3, 2023

I'm trying to scrape Amazon reviews. For some strange reason, they merged the date and region in the HTML, making the parsing really difficult.

Dateutil seems to be working in Chinese, Arab, Japanese, and other languages, but it does not work in Italian, French, or Spanish

For example, take this string Recensito in Italia il 18 agosto 2023. It should be parsed as 18/08/2023, but instead, it raises dateutil.parser._parser.ParserError: bad month number 18; must be 1-12:

same goes for Commenté en France le 30 août 2023 and Revisado en España el 29 de agosto de 2023

full logs (ita)

Traceback (most recent call last):
  File "/home/fyll/Documents/GitHub/glaut-back/env/lib/python3.10/site-packages/dateutil/parser/_parser.py", line 649, in parse
    ret = self._build_naive(res, default)
  File "/home/fyll/Documents/GitHub/glaut-back/env/lib/python3.10/site-packages/dateutil/parser/_parser.py", line 1232, in _build_naive
    if cday > monthrange(cyear, cmonth)[1]:
  File "/usr/lib/python3.10/calendar.py", line 126, in monthrange
    raise IllegalMonthError(month)
calendar.IllegalMonthError: bad month number 18; must be 1-12

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/fyll/Documents/GitHub/glaut-back/src/integrations/amazon.py", line 286, in amazon
    r = await extract_info_from(soup, geo)
  File "/home/fyll/Documents/GitHub/glaut-back/src/integrations/amazon.py", line 108, in extract_info_from
    date = parser.parse(date_string, fuzzy=True)
  File "/home/fyll/Documents/GitHub/glaut-back/env/lib/python3.10/site-packages/dateutil/parser/_parser.py", line 1368, in parse
    return DEFAULTPARSER.parse(timestr, **kwargs)
  File "/home/fyll/Documents/GitHub/glaut-back/env/lib/python3.10/site-packages/dateutil/parser/_parser.py", line 651, in parse
    six.raise_from(ParserError(str(e) + ": %s", timestr), e)
  File "<string>", line 3, in raise_from
dateutil.parser._parser.ParserError: bad month number 18; must be 1-12: Recensito in Italia il 18 agosto 2023

full logs (fr)

Traceback (most recent call last):
  File "/home/fyll/Documents/GitHub/glaut-back/env/lib/python3.10/site-packages/dateutil/parser/_parser.py", line 649, in parse
    ret = self._build_naive(res, default)
  File "/home/fyll/Documents/GitHub/glaut-back/env/lib/python3.10/site-packages/dateutil/parser/_parser.py", line 1232, in _build_naive
    if cday > monthrange(cyear, cmonth)[1]:
  File "/usr/lib/python3.10/calendar.py", line 126, in monthrange
    raise IllegalMonthError(month)
calendar.IllegalMonthError: bad month number 30; must be 1-12

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/fyll/Documents/GitHub/glaut-back/src/integrations/amazon.py", line 286, in amazon
    r = await extract_info_from(soup, geo)
  File "/home/fyll/Documents/GitHub/glaut-back/src/integrations/amazon.py", line 108, in extract_info_from
    date = parser.parse(date_string, fuzzy=True)
  File "/home/fyll/Documents/GitHub/glaut-back/env/lib/python3.10/site-packages/dateutil/parser/_parser.py", line 1368, in parse
    return DEFAULTPARSER.parse(timestr, **kwargs)
  File "/home/fyll/Documents/GitHub/glaut-back/env/lib/python3.10/site-packages/dateutil/parser/_parser.py", line 651, in parse
    six.raise_from(ParserError(str(e) + ": %s", timestr), e)
  File "<string>", line 3, in raise_from
dateutil.parser._parser.ParserError: bad month number 30; must be 1-12: Commenté en France le 30 août 2023

full logs (es)

Traceback (most recent call last):
  File "/home/fyll/Documents/GitHub/glaut-back/env/lib/python3.10/site-packages/dateutil/parser/_parser.py", line 649, in parse
    ret = self._build_naive(res, default)
  File "/home/fyll/Documents/GitHub/glaut-back/env/lib/python3.10/site-packages/dateutil/parser/_parser.py", line 1232, in _build_naive
    if cday > monthrange(cyear, cmonth)[1]:
  File "/usr/lib/python3.10/calendar.py", line 126, in monthrange
    raise IllegalMonthError(month)
calendar.IllegalMonthError: bad month number 29; must be 1-12

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/fyll/Documents/GitHub/glaut-back/src/integrations/amazon.py", line 286, in amazon
    r = await extract_info_from(soup, geo)
  File "/home/fyll/Documents/GitHub/glaut-back/src/integrations/amazon.py", line 108, in extract_info_from
    date = parser.parse(date_string, fuzzy=True)
  File "/home/fyll/Documents/GitHub/glaut-back/env/lib/python3.10/site-packages/dateutil/parser/_parser.py", line 1368, in parse
    return DEFAULTPARSER.parse(timestr, **kwargs)
  File "/home/fyll/Documents/GitHub/glaut-back/env/lib/python3.10/site-packages/dateutil/parser/_parser.py", line 651, in parse
    six.raise_from(ParserError(str(e) + ": %s", timestr), e)
  File "<string>", line 3, in raise_from
dateutil.parser._parser.ParserError: bad month number 29; must be 1-12: Revisado en España el 29 de agosto de 2023
@jaboto
Copy link

jaboto commented Apr 7, 2024

How did you approach this in the end? I am facing a similar issue while parsing French emails

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants