Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Library is not thread safe #441

Open
meownoid opened this issue Aug 7, 2018 · 10 comments
Open

Library is not thread safe #441

meownoid opened this issue Aug 7, 2018 · 10 comments

Comments

@meownoid
Copy link

meownoid commented Aug 7, 2018

Info:

Linux 4.15.0-24-generic #26-Ubuntu SMP Wed Jun 13 08:44:47 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Python 3.7.0b3
>>> import dateparser
>>> dateparser.__version__
'0.7.0'

Code to reproduce error:

import dateparser
from concurrent.futures.thread import ThreadPoolExecutor

fs = []
with ThreadPoolExecutor(16) as executor:
    for _ in range(100):
        fs.append(executor.submit(lambda: dateparser.parse('tomorrow')))

for f in fs:
    print(f.result())

Error:

Traceback (most recent call last):
  File "test.py", line 10, in <module>
    print(f.result())
  File "/usr/lib/python3.7/concurrent/futures/_base.py", line 425, in result
    return self.__get_result()
  File "/usr/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
  File "/usr/lib/python3.7/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "test.py", line 7, in <lambda>
    fs.append(executor.submit(lambda: dateparser.parse('tomorrow')))
  File "/home/egor/Projects/rekko/.env/lib/python3.7/site-packages/dateparser/conf.py", line 81, in wrapper
    return f(*args, **kwargs)
  File "/home/egor/Projects/rekko/.env/lib/python3.7/site-packages/dateparser/__init__.py", line 53, in parse
    data = parser.get_date_data(date_string, date_formats)
  File "/home/egor/Projects/rekko/.env/lib/python3.7/site-packages/dateparser/date.py", line 404, in get_date_data
    locale, date_string, date_formats, settings=self._settings)
  File "/home/egor/Projects/rekko/.env/lib/python3.7/site-packages/dateparser/date.py", line 177, in parse
    return instance._parse()
  File "/home/egor/Projects/rekko/.env/lib/python3.7/site-packages/dateparser/date.py", line 187, in _parse
    date_obj = parser()
  File "/home/egor/Projects/rekko/.env/lib/python3.7/site-packages/dateparser/date.py", line 200, in _try_freshness_parser
    return freshness_date_parser.get_date_data(self._get_translated_date(), self._settings)
  File "/home/egor/Projects/rekko/.env/lib/python3.7/site-packages/dateparser/freshness_date_parser.py", line 147, in get_date_data
    date, period = self.parse(date_string, settings)
  File "/home/egor/Projects/rekko/.env/lib/python3.7/site-packages/dateparser/freshness_date_parser.py", line 96, in parse
    date, period = self._parse_date(date_string)
  File "/home/egor/Projects/rekko/.env/lib/python3.7/site-packages/dateparser/freshness_date_parser.py", line 130, in _parse_date
    date = self.now + td
@lopuhin
Copy link
Member

lopuhin commented Aug 7, 2018

Good catch! FWIW the exception message is TypeError: unsupported operand type(s) for +: 'NoneType' and 'relativedelta'

@wRAR
Copy link
Member

wRAR commented Sep 19, 2018

This particular test detects only one concurrency problem, so I cannot be sure there is none else. Still, this one is pretty clear.

dateparser.freshness_date_parser.freshness_date_parser is a singleton with internal state and it is used concurrently in _DateLocaleParser._try_freshness_parser() in different threads. The shared state is FreshnessDateDataParser.now, read in _parse_date() and written in parse(), and putting it into threading.local() fixes this particular problem and allows the test to pass (I'm not saying this is a correct solution though).

Actually, I don't understand why this var is needed as it's used only once, right after its setting (though in a separate private method), and cleared after that. It's also not documented as a public attribute. The git history shows a lot of changes to the handling of this var but the current code seems strange to me.

I've also noticed there is a lock in pytz acquired inside FreshnessDateDataParser.get_local_tz().

@bzamecnik
Copy link

Another error possibly caused by thread non-safety:

date = dateparser.parse(date_to_parse)

File "/usr/local/lib/python2.7/dist-packages/dateparser/conf.py" line 81 in wrapper
return f(*args, **kwargs)

File "/usr/local/lib/python2.7/dist-packages/dateparser/__init__.py" line 53 in parse
data = parser.get_date_data(date_string, date_formats)

File "/usr/local/lib/python2.7/dist-packages/dateparser/date.py" line 402 in get_date_data
for locale in self._get_applicable_locales(date_string):

File "/usr/local/lib/python2.7/dist-packages/dateparser/date.py" line 421 in _get_applicable_locales
if self._is_applicable_locale(locale, date_string):

File "/usr/local/lib/python2.7/dist-packages/dateparser/date.py" line 432 in _is_applicable_locale
locale.is_applicable(date_string, strip_timezone=False, settings=self._settings) or

File "/usr/local/lib/python2.7/dist-packages/dateparser/languages/locale.py" line 75 in is_applicable
date_tokens = dictionary.split(date_string)

File "/usr/local/lib/python2.7/dist-packages/dateparser/languages/dictionary.py" line 142 in split
tokens[i] = self._split_by_known_words(token, keep_formatting)

File "/usr/local/lib/python2.7/dist-packages/dateparser/languages/dictionary.py" line 161 in _split_by_known_words
splitted.extend(self._split_by_known_words(unknown, keep_formatting))

File "/usr/local/lib/python2.7/dist-packages/dateparser/languages/dictionary.py" line 150 in _split_by_known_words
regex = self._get_split_regex_cache()

File "/usr/local/lib/python2.7/dist-packages/dateparser/languages/dictionary.py" line 192 in _get_split_regex_cache [args] [locals]
return self._split_regex_cache[self._settings.registry_key][self.info['name']]

KeyError: u'cs'

The app was running in Flask in uwsgi with 8 threads.

@bzamecnik
Copy link

Possibly we could change dateparser._default_parser from a plain module variable to a thread-local variable.

@Gallaecio
Copy link
Member

Related to #276

@mprzydatek
Copy link

Is this going to get fixed?

@sheikware
Copy link

There are other concurrency issues when multi-threading and having some invalid datetime strings. This causes

try_dates = []
try_dates.append("2020-04-20T03:02:16.8633333")
try_dates.append("2021-04-20T03:02:16.8633333Z")
try_dates.append("2020-06-21T08:00:00.000-07:00")
try_dates.append("2020-06-29T0000:00")
try_dates.append("2020-06-24T16:38:13.193748")
fs = []

def t3(j):
    return dateparser.parse(try_dates[j])

with ThreadPoolExecutor(16) as executor:
    for i in range(100):
        j = i % len(try_dates)
        res = executor.submit(t3, j)
        fs.append((res, j))

for f in fs:
    print(f"{f[1]}, {f[0].result()}")

Expected output (which you can get if you thread lock the call):

0, 2020-04-20 03:02:16.863333
1, 2021-04-20 03:02:16.863333+00:00
2, 2020-06-21 08:00:00-07:00
3, None
4, 2020-06-24 16:38:13.193748
0, 2020-04-20 03:02:16.863333
1, 2021-04-20 03:02:16.863333+00:00
2, 2020-06-21 08:00:00-07:00
3, None
4, 2020-06-24 16:38:13.193748
0, 2020-04-20 03:02:16.863333
1, 2021-04-20 03:02:16.863333+00:00
2, 2020-06-21 08:00:00-07:00
3, None
4, 2020-06-24 16:38:13.193748
0, 2020-04-20 03:02:16.863333
1, 2021-04-20 03:02:16.863333+00:00
2, 2020-06-21 08:00:00-07:00
3, None
4, 2020-06-24 16:38:13.193748

Error output:

Traceback (most recent call last):
  File "/Users/csheikho/src/../test_loadtest.py", line 110, in test_old_mixed_dates
    print(f"{f[1]}, {f[0].result()}")
  File "/Users/csheikho/.pyenv/versions/3.6.5/lib/python3.6/concurrent/futures/_base.py", line 425, in result
    return self.__get_result()
  File "/Users/csheikho/.pyenv/versions/3.6.5/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
  File "/Users/csheikho/.pyenv/versions/3.6.5/lib/python3.6/concurrent/futures/thread.py", line 56, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/Users/csheikho/src/.../test_loadtest.py", line 101, in t3
    return dateparser.parse(try_dates[j])
  File "/Users/csheikho/venv/.../lib/python3.6/site-packages/dateparser/conf.py", line 84, in wrapper
    return f(*args, **kwargs)
  File "/Users/csheikho/venv/.../lib/python3.6/site-packages/dateparser/__init__.py", line 40, in parse
    data = parser.get_date_data(date_string, date_formats)
  File "/Users/csheikho/venv/.../lib/python3.6/site-packages/dateparser/date.py", line 369, in get_date_data
    date_string, modify=True, settings=self._settings):
  File "/Users/csheikho/venv/.../lib/python3.6/site-packages/dateparser/languages/detection.py", line 9, in wrapped
    for language in method(self, *args, **kwargs):
  File "/Users/csheikho/venv/.../lib/python3.6/site-packages/dateparser/languages/detection.py", line 49, in iterate_applicable_languages
    for language in self._filter_languages(date_string, languages, settings=settings):
  File "/Users/csheikho/venv/.../lib/python3.6/site-packages/dateparser/languages/detection.py", line 36, in _filter_languages
    languages.pop(0)
IndexError: pop from empty list

@nex2hex
Copy link

nex2hex commented Jul 2, 2020

not yet, just replace dateparser.parse() call with custom function

import dateparser.date
import dateparser.conf

@dateparser.conf.apply_settings
def parse_date_thread_safe(date_string, date_formats=None, languages=None, locales=None, region=None, settings=None):
    parser = dateparser.date.DateDataParser(languages=languages, locales=locales, region=region, settings=settings)
    data = parser.get_date_data(date_string, date_formats)
    if data:
        return data['date_obj']

@zegerius
Copy link

@nex2hex I tried using the custom function, but had the same issues as before (TypeError: unsupported operand type(s) for +: 'NoneType' and 'relativedelta'). I resorted to storing the records in Redis and processing them serially.

@doctaphred
Copy link

To expand on the above reports, the concurrency issues don't just cause exceptions, they cause wrong results:

In [1]: import dateparser

In [2]: from concurrent.futures.thread import ThreadPoolExecutor

In [3]: executor = ThreadPoolExecutor(64)

In [4]: def f():
   ...:     # Adding some "French" causes dateparser to use Y-D-M order,
   ...:     # even though the timestamp is in ISO format.
   ...:     dateparser.parse('le 2021-02-05T05:47:15+00:00')
   ...:     # This one should get parsed correctly.
   ...:     return dateparser.parse('2021-02-05T05:47:15+00:00')
   ...:

In [5]: {future.result() for future in [executor.submit(f) for _ in range(100)]}
Out[5]:
{datetime.datetime(2021, 2, 5, 5, 47, 15, tzinfo=<StaticTzInfo 'UTC\+00:00'>),
 datetime.datetime(2021, 5, 2, 5, 47, 15, tzinfo=<StaticTzInfo 'UTC\+00:00'>)}

In [6]: dateparser.__version__
Out[6]: '1.0.0'

Looks like this behavior is already being fixed and verified by TestConcurrency in #834, but FYI to future confused devs attempting to diagnose data errors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Project board
Awaiting triage
Development

No branches or pull requests

10 participants