Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NLTK word_tokenize throws IndexError: list index out of range #2925

Closed
Bernhard-Steindl opened this issue Dec 24, 2021 · 6 comments
Closed

Comments

@Bernhard-Steindl
Copy link

Bernhard-Steindl commented Dec 24, 2021

I am working on some NLP experiments, where I want to tokenize some texts from users.
For that I am using NLTK right now, but I noticed an unexpected behavior when tokenizing a raw user input string.
I am not sure if this is a bug of NLTK or if I should provide some pre-processed string. Before, I had no problems using NLTK for tokenization of pre-processed datasets, but with raw user input I run into problems.

Do you have an explanation of the problem or can you give me a hint on how to pre-process my user inputs before applying NLTK word_tokenize?

I have provided a minimal reproducible example. I set up a conda environment for python 3.9 and nltk==3.6.6.

conda create -n "example_nltk" python=3.9 -y
conda activate example_nltk
pip install nltk==3.6.6

Then I create and run the following python file:

import nltk
from nltk import word_tokenize
nltk.download("punkt")
nltk.download("wordnet")
nltk.download("omw-1.4")

text = '? so ein schwachsinn! rot für: dummes post. salzburg gewinnt öfb-cup gegen rapid'
word_tokenize(text, language='german')

The script throws an IndexError: list index out of range when running the function word_tokenize(text, language='german').
The error occurs in the punkt.py file in the function _match_potential_end_contexts for line before_words[match] = split[-1], because the variable split is empty ([]).

Do you have a suggestion for me how to proceed? Am I doing something wrong? Should I process the raw user input before supplying the input to NLTK's word_tokenize?

Thank you for your support!

Here the full traceback for details:

Traceback (most recent call last):
  File "nltk_test.py", line 8, in <module>
    word_tokenize(text, language='german')
  File "/Users/user/anaconda3/envs/example_nltk/lib/python3.9/site-packages/nltk/tokenize/__init__.py", line 129, in word_tokenize
    sentences = [text] if preserve_line else sent_tokenize(text, language)
  File "/Users/user/anaconda3/envs/example_nltk/lib/python3.9/site-packages/nltk/tokenize/__init__.py", line 107, in sent_tokenize
    return tokenizer.tokenize(text)
  File "/Users/user/anaconda3/envs/example_nltk/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 1276, in tokenize
    return list(self.sentences_from_text(text, realign_boundaries))
  File "/Users/user/anaconda3/envs/example_nltk/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 1332, in sentences_from_text
    return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
  File "/Users/user/anaconda3/envs/example_nltk/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 1332, in <listcomp>
    return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
  File "/Users/user/anaconda3/envs/example_nltk/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 1322, in span_tokenize
    for sentence in slices:
  File "/Users/user/anaconda3/envs/example_nltk/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 1421, in _realign_boundaries
    for sentence1, sentence2 in _pair_iter(slices):
  File "/Users/user/anaconda3/envs/example_nltk/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 318, in _pair_iter
    prev = next(iterator)
  File "/Users/user/anaconda3/envs/example_nltk/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 1395, in _slices_from_text
    for match, context in self._match_potential_end_contexts(text):
  File "/Users/user/anaconda3/envs/example_nltk/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 1382, in _match_potential_end_contexts
    before_words[match] = split[-1]
IndexError: list index out of range
@tomaarsen
Copy link
Member

This is a bug introduced in NLTK 3.6.6, and has been resolved in #2922 after #2921 reported it. In short, any sentence starting with a potential "end of sentence" character (i.e. . ,? or !) followed by a space will throw an IndexError.
For the time being I would recommend using the unofficial version through

pip install git+https://github.com/nltk/nltk

or using an older NLTK version. Alternatively, you can pre-process

We'll publish a new NLTK version soon with this fix in place.

Thank you for the report! I'll keep this open for the time being to notify other users that this has been solved in the develop branch.

@Bernhard-Steindl
Copy link
Author

This is a bug introduced in NLTK 3.6.6, and has been resolved in #2922 after #2921 reported it.

Thanks for your fast response!
I am looking forward to the new release. 😊

@borzunov
Copy link

This has broken our pipelines as well, I hope the new release will come out soon :)

Thank you for the quick fix!

@tomaarsen
Copy link
Member

Apologies for the inconvenience, we're working on it!

borzunov added a commit to learning-at-home/hivemind that referenced this issue Dec 28, 2021
This should save us from:

- A bug in the latest nltk version: nltk/nltk#2925
- Incompatibilities introduced by `transformers` and `datasets` updates
@tomaarsen
Copy link
Member

@Bernhard-Steindl @borzunov
NLTK 3.6.7 has been released, which includes the fix for this issue. Thank you for reporting it!
I'll close this as it should be resolved now.

@samibulti
Copy link

Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants