nltk.tokenize.word_tokenize crash with leading terminal punctuation #2932

itzsimpl · 2022-01-21T09:49:26Z

If text starts with a terminal punctuation mark (.!?) followed by space, nltk.tokenize.word_tokenize(text) will crash.

>>> word_tokenize('? who')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/nltk/tokenize/__init__.py", line 129, in word_tokenize
    sentences = [text] if preserve_line else sent_tokenize(text, language)
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/nltk/tokenize/__init__.py", line 107, in sent_tokenize
    return tokenizer.tokenize(text)
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/nltk/tokenize/punkt.py", line 1276, in tokenize
    return list(self.sentences_from_text(text, realign_boundaries))
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/nltk/tokenize/punkt.py", line 1332, in sentences_from_text
    return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/nltk/tokenize/punkt.py", line 1332, in <listcomp>
    return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/nltk/tokenize/punkt.py", line 1322, in span_tokenize
    for sentence in slices:
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/nltk/tokenize/punkt.py", line 1421, in _realign_boundaries
    for sentence1, sentence2 in _pair_iter(slices):
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/nltk/tokenize/punkt.py", line 318, in _pair_iter
    prev = next(iterator)
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/nltk/tokenize/punkt.py", line 1395, in _slices_from_text
    for match, context in self._match_potential_end_contexts(text):
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/nltk/tokenize/punkt.py", line 1382, in _match_potential_end_contexts
    before_words[match] = split[-1]
IndexError: list index out of range

it will not, however, if there is no space after the punctuation

>>> word_tokenize('?who')
['?', 'who']
>>> word_tokenize('.who')
['.who']
>>> word_tokenize('!who')
['!', 'who']

The text was updated successfully, but these errors were encountered:

tomaarsen · 2022-01-21T09:52:08Z

Hello @itzsimpl!

This has been resolved in #2922, which has been published in NLTK 3.6.7. You are (presumably) using NLTK 3.6.6, and I would highly recommend upgrading to NLTK 3.6.7.

I hope this helps.

itzsimpl · 2022-01-21T10:00:11Z

@tomaarsen Thank you for the quick reply. Upgrading to NLTK 3.6.7 solved my problem.

tomaarsen · 2022-01-21T10:00:47Z

Wonderful! Glad to hear.

tomaarsen added bug resolved labels Jan 21, 2022

itzsimpl closed this as completed Jan 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nltk.tokenize.word_tokenize crash with leading terminal punctuation #2932

nltk.tokenize.word_tokenize crash with leading terminal punctuation #2932

itzsimpl commented Jan 21, 2022 •

edited

tomaarsen commented Jan 21, 2022 •

edited

itzsimpl commented Jan 21, 2022

tomaarsen commented Jan 21, 2022

nltk.tokenize.word_tokenize crash with leading terminal punctuation #2932

nltk.tokenize.word_tokenize crash with leading terminal punctuation #2932

Comments

itzsimpl commented Jan 21, 2022 • edited

tomaarsen commented Jan 21, 2022 • edited

itzsimpl commented Jan 21, 2022

tomaarsen commented Jan 21, 2022

itzsimpl commented Jan 21, 2022 •

edited

tomaarsen commented Jan 21, 2022 •

edited