Possible tokenization issue with PunktSentenceTokenizer (nltk version is: 3.6.6) #2937

sstojanoska · 2022-01-25T14:19:40Z

Hi, I am uncertain whether this is expected behavior, but I have encountered this issue on a real-case data. Some of my input data comes with EOS (end-of-sentence: !?.) punctuation at the beginning of the string, which causes the _match_potential_end_contexts method to fail.
To reproduce:

from nltk import sent_tokenize

text1 = "!!Fails to be tokenized."
text2 = "! Fails to be tokenized."

tokenized_text1 = sent_tokenize(text1)
tokenized_text2 = sent_tokenize(text2)

Is string preprocessing before tokenization the only solution? Thanks.

The text was updated successfully, but these errors were encountered:

tomaarsen · 2022-01-25T14:23:44Z

Hello! This is indeed a bug. It has been resolved already, luckily! NLTK version 3.6.7 has solved this issue.

I would very much recommend upgrading to NLTK 3.6.7:

pip install -U nltk

(or similar with anaconda)

See #2925 for more information. (Other similar issues: #2921, #2925, #2927)

I hope this helped!

tomaarsen closed this as completed Jan 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible tokenization issue with PunktSentenceTokenizer (nltk version is: 3.6.6) #2937

Possible tokenization issue with PunktSentenceTokenizer (nltk version is: 3.6.6) #2937

sstojanoska commented Jan 25, 2022

tomaarsen commented Jan 25, 2022 •

edited

Possible tokenization issue with PunktSentenceTokenizer (nltk version is: 3.6.6) #2937

Possible tokenization issue with PunktSentenceTokenizer (nltk version is: 3.6.6) #2937

Comments

sstojanoska commented Jan 25, 2022

tomaarsen commented Jan 25, 2022 • edited

tomaarsen commented Jan 25, 2022 •

edited