NLTK word_tokenize throws IndexError: list index out of range #2925

Bernhard-Steindl · 2021-12-24T13:35:21Z

I am working on some NLP experiments, where I want to tokenize some texts from users.
For that I am using NLTK right now, but I noticed an unexpected behavior when tokenizing a raw user input string.
I am not sure if this is a bug of NLTK or if I should provide some pre-processed string. Before, I had no problems using NLTK for tokenization of pre-processed datasets, but with raw user input I run into problems.

Do you have an explanation of the problem or can you give me a hint on how to pre-process my user inputs before applying NLTK word_tokenize?

I have provided a minimal reproducible example. I set up a conda environment for python 3.9 and nltk==3.6.6.

conda create -n "example_nltk" python=3.9 -y
conda activate example_nltk
pip install nltk==3.6.6

Then I create and run the following python file:

import nltk
from nltk import word_tokenize
nltk.download("punkt")
nltk.download("wordnet")
nltk.download("omw-1.4")

text = '? so ein schwachsinn! rot für: dummes post. salzburg gewinnt öfb-cup gegen rapid'
word_tokenize(text, language='german')

The script throws an IndexError: list index out of range when running the function word_tokenize(text, language='german').
The error occurs in the punkt.py file in the function _match_potential_end_contexts for line before_words[match] = split[-1], because the variable split is empty ([]).

Do you have a suggestion for me how to proceed? Am I doing something wrong? Should I process the raw user input before supplying the input to NLTK's word_tokenize?

Thank you for your support!

Here the full traceback for details:

Traceback (most recent call last):
  File "nltk_test.py", line 8, in <module>
    word_tokenize(text, language='german')
  File "/Users/user/anaconda3/envs/example_nltk/lib/python3.9/site-packages/nltk/tokenize/__init__.py", line 129, in word_tokenize
    sentences = [text] if preserve_line else sent_tokenize(text, language)
  File "/Users/user/anaconda3/envs/example_nltk/lib/python3.9/site-packages/nltk/tokenize/__init__.py", line 107, in sent_tokenize
    return tokenizer.tokenize(text)
  File "/Users/user/anaconda3/envs/example_nltk/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 1276, in tokenize
    return list(self.sentences_from_text(text, realign_boundaries))
  File "/Users/user/anaconda3/envs/example_nltk/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 1332, in sentences_from_text
    return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
  File "/Users/user/anaconda3/envs/example_nltk/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 1332, in <listcomp>
    return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
  File "/Users/user/anaconda3/envs/example_nltk/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 1322, in span_tokenize
    for sentence in slices:
  File "/Users/user/anaconda3/envs/example_nltk/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 1421, in _realign_boundaries
    for sentence1, sentence2 in _pair_iter(slices):
  File "/Users/user/anaconda3/envs/example_nltk/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 318, in _pair_iter
    prev = next(iterator)
  File "/Users/user/anaconda3/envs/example_nltk/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 1395, in _slices_from_text
    for match, context in self._match_potential_end_contexts(text):
  File "/Users/user/anaconda3/envs/example_nltk/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 1382, in _match_potential_end_contexts
    before_words[match] = split[-1]
IndexError: list index out of range

The text was updated successfully, but these errors were encountered:

tomaarsen · 2021-12-24T13:40:48Z

This is a bug introduced in NLTK 3.6.6, and has been resolved in #2922 after #2921 reported it. In short, any sentence starting with a potential "end of sentence" character (i.e. . ,? or !) followed by a space will throw an IndexError.
For the time being I would recommend using the unofficial version through

pip install git+https://github.com/nltk/nltk

or using an older NLTK version. Alternatively, you can pre-process

We'll publish a new NLTK version soon with this fix in place.

Thank you for the report! I'll keep this open for the time being to notify other users that this has been solved in the develop branch.

Bernhard-Steindl · 2021-12-24T13:45:42Z

This is a bug introduced in NLTK 3.6.6, and has been resolved in #2922 after #2921 reported it.

Thanks for your fast response!
I am looking forward to the new release. 😊

borzunov · 2021-12-27T13:24:45Z

This has broken our pipelines as well, I hope the new release will come out soon :)

Thank you for the quick fix!

tomaarsen · 2021-12-27T13:33:42Z

Apologies for the inconvenience, we're working on it!

This should save us from: - A bug in the latest nltk version: nltk/nltk#2925 - Incompatibilities introduced by `transformers` and `datasets` updates

tomaarsen · 2021-12-28T23:39:18Z

@Bernhard-Steindl @borzunov
NLTK 3.6.7 has been released, which includes the fix for this issue. Thank you for reporting it!
I'll close this as it should be resolved now.

Avoids nltk/nltk#2925

samibulti · 2023-05-17T14:10:30Z

Thanks.

tomaarsen added bug resolved labels Dec 24, 2021

borzunov mentioned this issue Dec 27, 2021

Fix dependency versions in examples/albert learning-at-home/hivemind#436

Merged

tomaarsen closed this as completed Dec 28, 2021

tomaarsen mentioned this issue Jan 7, 2022

error in nltk tokenize #2927

Closed

borzunov added a commit to learning-at-home/hivemind that referenced this issue Jan 19, 2022

Upgrade to nltk==3.6.7

c02700e

Avoids nltk/nltk#2925

This was referenced Jan 25, 2022

Possible tokenization issue with PunktSentenceTokenizer (nltk version is: 3.6.6) #2937

Closed

word_tokenize raises IndexError if . is in beginning of text in 3.6.6 but not in 3.6.2 #2942

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NLTK word_tokenize throws IndexError: list index out of range #2925

NLTK word_tokenize throws IndexError: list index out of range #2925

Bernhard-Steindl commented Dec 24, 2021 •

edited

tomaarsen commented Dec 24, 2021

Bernhard-Steindl commented Dec 24, 2021

borzunov commented Dec 27, 2021

tomaarsen commented Dec 27, 2021

tomaarsen commented Dec 28, 2021

samibulti commented May 17, 2023

NLTK word_tokenize throws IndexError: list index out of range #2925

NLTK word_tokenize throws IndexError: list index out of range #2925

Comments

Bernhard-Steindl commented Dec 24, 2021 • edited

tomaarsen commented Dec 24, 2021

Bernhard-Steindl commented Dec 24, 2021

borzunov commented Dec 27, 2021

tomaarsen commented Dec 27, 2021

tomaarsen commented Dec 28, 2021

samibulti commented May 17, 2023

Bernhard-Steindl commented Dec 24, 2021 •

edited