Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible tokenization issue with PunktSentenceTokenizer (nltk version is: 3.6.6) #2937

Closed
sstojanoska opened this issue Jan 25, 2022 · 1 comment

Comments

@sstojanoska
Copy link

Hi, I am uncertain whether this is expected behavior, but I have encountered this issue on a real-case data. Some of my input data comes with EOS (end-of-sentence: !?.) punctuation at the beginning of the string, which causes the _match_potential_end_contexts method to fail.
To reproduce:

from nltk import sent_tokenize

text1 = "!!Fails to be tokenized."
text2 = "! Fails to be tokenized."

tokenized_text1 = sent_tokenize(text1)
tokenized_text2 = sent_tokenize(text2)

Is string preprocessing before tokenization the only solution? Thanks.

@tomaarsen
Copy link
Member

tomaarsen commented Jan 25, 2022

Hello! This is indeed a bug. It has been resolved already, luckily! NLTK version 3.6.7 has solved this issue.

I would very much recommend upgrading to NLTK 3.6.7:

pip install -U nltk

(or similar with anaconda)

See #2925 for more information. (Other similar issues: #2921, #2925, #2927)

I hope this helped!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants