New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NLTK word_tokenize throws IndexError: list index out of range #2925
Comments
This is a bug introduced in NLTK 3.6.6, and has been resolved in #2922 after #2921 reported it. In short, any sentence starting with a potential "end of sentence" character (i.e.
or using an older NLTK version. Alternatively, you can pre-process We'll publish a new NLTK version soon with this fix in place. Thank you for the report! I'll keep this open for the time being to notify other users that this has been solved in the |
This has broken our pipelines as well, I hope the new release will come out soon :) Thank you for the quick fix! |
Apologies for the inconvenience, we're working on it! |
This should save us from: - A bug in the latest nltk version: nltk/nltk#2925 - Incompatibilities introduced by `transformers` and `datasets` updates
@Bernhard-Steindl @borzunov |
Thanks. |
I am working on some NLP experiments, where I want to tokenize some texts from users.
For that I am using NLTK right now, but I noticed an unexpected behavior when tokenizing a raw user input string.
I am not sure if this is a bug of NLTK or if I should provide some pre-processed string. Before, I had no problems using NLTK for tokenization of pre-processed datasets, but with raw user input I run into problems.
Do you have an explanation of the problem or can you give me a hint on how to pre-process my user inputs before applying NLTK
word_tokenize
?I have provided a minimal reproducible example. I set up a conda environment for python 3.9 and
nltk==3.6.6
.conda create -n "example_nltk" python=3.9 -y conda activate example_nltk pip install nltk==3.6.6
Then I create and run the following python file:
The script throws an
IndexError: list index out of range
when running the functionword_tokenize(text, language='german')
.The error occurs in the
punkt.py
file in the function_match_potential_end_contexts
for linebefore_words[match] = split[-1]
, because the variablesplit
is empty ([]
).Do you have a suggestion for me how to proceed? Am I doing something wrong? Should I process the raw user input before supplying the input to NLTK's
word_tokenize
?Thank you for your support!
Here the full traceback for details:
The text was updated successfully, but these errors were encountered: