Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nltk.tokenize.word_tokenize crash with leading terminal punctuation #2932

Closed
itzsimpl opened this issue Jan 21, 2022 · 3 comments
Closed

nltk.tokenize.word_tokenize crash with leading terminal punctuation #2932

itzsimpl opened this issue Jan 21, 2022 · 3 comments

Comments

@itzsimpl
Copy link

itzsimpl commented Jan 21, 2022

If text starts with a terminal punctuation mark (.!?) followed by space, nltk.tokenize.word_tokenize(text) will crash.

>>> word_tokenize('? who')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/nltk/tokenize/__init__.py", line 129, in word_tokenize
    sentences = [text] if preserve_line else sent_tokenize(text, language)
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/nltk/tokenize/__init__.py", line 107, in sent_tokenize
    return tokenizer.tokenize(text)
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/nltk/tokenize/punkt.py", line 1276, in tokenize
    return list(self.sentences_from_text(text, realign_boundaries))
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/nltk/tokenize/punkt.py", line 1332, in sentences_from_text
    return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/nltk/tokenize/punkt.py", line 1332, in <listcomp>
    return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/nltk/tokenize/punkt.py", line 1322, in span_tokenize
    for sentence in slices:
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/nltk/tokenize/punkt.py", line 1421, in _realign_boundaries
    for sentence1, sentence2 in _pair_iter(slices):
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/nltk/tokenize/punkt.py", line 318, in _pair_iter
    prev = next(iterator)
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/nltk/tokenize/punkt.py", line 1395, in _slices_from_text
    for match, context in self._match_potential_end_contexts(text):
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/nltk/tokenize/punkt.py", line 1382, in _match_potential_end_contexts
    before_words[match] = split[-1]
IndexError: list index out of range

it will not, however, if there is no space after the punctuation

>>> word_tokenize('?who')
['?', 'who']
>>> word_tokenize('.who')
['.who']
>>> word_tokenize('!who')
['!', 'who']
@tomaarsen
Copy link
Member

tomaarsen commented Jan 21, 2022

Hello @itzsimpl!

This has been resolved in #2922, which has been published in NLTK 3.6.7. You are (presumably) using NLTK 3.6.6, and I would highly recommend upgrading to NLTK 3.6.7.

I hope this helps.

@itzsimpl
Copy link
Author

@tomaarsen Thank you for the quick reply. Upgrading to NLTK 3.6.7 solved my problem.

@tomaarsen
Copy link
Member

Wonderful! Glad to hear.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants