Sentence tokenizer fails when sentence.startswith('. ') #2921

zqwerty · 2021-12-21T13:03:05Z

from nltk.tokenize import sent_tokenize
sent_tokenize('. a')

will trigger IndexError from punkt.py:_realign_boundaries. But sent_tokenize('. ') will not.

Envs:
python=3.6
nltk=3.6.6

The text was updated successfully, but these errors were encountered:

tomaarsen · 2021-12-21T14:13:46Z

Thank you for the bug report!
I believe to have resolved the issue in #2922. If you are experiencing that issue, please use the develop branch instead:

pip install git+https://github.com/nltk/nltk

I'll reopen this so others can see this.

tomaarsen · 2021-12-28T23:38:31Z

@zqwerty
NLTK 3.6.7 has been released, which includes the fix for this issue. Thank you for reporting it! I'll close this now.

tomaarsen added the bug label Dec 21, 2021

tomaarsen mentioned this issue Dec 21, 2021

Resolve IndexError in sent_tokenize #2922

Merged

tomaarsen closed this as completed in #2922 Dec 21, 2021

tomaarsen reopened this Dec 21, 2021

tomaarsen added the resolved label Dec 21, 2021

tomaarsen mentioned this issue Dec 24, 2021

NLTK word_tokenize throws IndexError: list index out of range #2925

Closed

tomaarsen closed this as completed Dec 28, 2021

tomaarsen mentioned this issue Jan 7, 2022

error in nltk tokenize #2927

Closed

This was referenced Jan 25, 2022

Possible tokenization issue with PunktSentenceTokenizer (nltk version is: 3.6.6) #2937

Closed

word_tokenize raises IndexError if . is in beginning of text in 3.6.6 but not in 3.6.2 #2942

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sentence tokenizer fails when sentence.startswith('. ') #2921

Sentence tokenizer fails when sentence.startswith('. ') #2921

zqwerty commented Dec 21, 2021

tomaarsen commented Dec 21, 2021

tomaarsen commented Dec 28, 2021 •

edited

Sentence tokenizer fails when sentence.startswith('. ') #2921

Sentence tokenizer fails when sentence.startswith('. ') #2921

Comments

zqwerty commented Dec 21, 2021

tomaarsen commented Dec 21, 2021

tomaarsen commented Dec 28, 2021 • edited

tomaarsen commented Dec 28, 2021 •

edited