You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If text starts with a terminal punctuation mark (.!?) followed by space, nltk.tokenize.word_tokenize(text) will crash.
>>> word_tokenize('? who')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/nltk/tokenize/__init__.py", line 129, in word_tokenize
sentences = [text] if preserve_line else sent_tokenize(text, language)
File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/nltk/tokenize/__init__.py", line 107, in sent_tokenize
return tokenizer.tokenize(text)
File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/nltk/tokenize/punkt.py", line 1276, in tokenize
return list(self.sentences_from_text(text, realign_boundaries))
File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/nltk/tokenize/punkt.py", line 1332, in sentences_from_text
return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/nltk/tokenize/punkt.py", line 1332, in <listcomp>
return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/nltk/tokenize/punkt.py", line 1322, in span_tokenize
for sentence in slices:
File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/nltk/tokenize/punkt.py", line 1421, in _realign_boundaries
for sentence1, sentence2 in _pair_iter(slices):
File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/nltk/tokenize/punkt.py", line 318, in _pair_iter
prev = next(iterator)
File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/nltk/tokenize/punkt.py", line 1395, in _slices_from_text
for match, context in self._match_potential_end_contexts(text):
File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/nltk/tokenize/punkt.py", line 1382, in _match_potential_end_contexts
before_words[match] = split[-1]
IndexError: list index out of range
it will not, however, if there is no space after the punctuation
This has been resolved in #2922, which has been published in NLTK 3.6.7. You are (presumably) using NLTK 3.6.6, and I would highly recommend upgrading to NLTK 3.6.7.
If text starts with a terminal punctuation mark (.!?) followed by space, nltk.tokenize.word_tokenize(text) will crash.
it will not, however, if there is no space after the punctuation
The text was updated successfully, but these errors were encountered: