Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sentence tokenizer fails when sentence.startswith('. ') #2921

Closed
zqwerty opened this issue Dec 21, 2021 · 2 comments · Fixed by #2922
Closed

Sentence tokenizer fails when sentence.startswith('. ') #2921

zqwerty opened this issue Dec 21, 2021 · 2 comments · Fixed by #2922

Comments

@zqwerty
Copy link

zqwerty commented Dec 21, 2021

from nltk.tokenize import sent_tokenize
sent_tokenize('. a')

will trigger IndexError from punkt.py:_realign_boundaries. But sent_tokenize('. ') will not.

Envs:
python=3.6
nltk=3.6.6

@tomaarsen
Copy link
Member

Thank you for the bug report!
I believe to have resolved the issue in #2922. If you are experiencing that issue, please use the develop branch instead:

pip install git+https://github.com/nltk/nltk

I'll reopen this so others can see this.

@tomaarsen
Copy link
Member

tomaarsen commented Dec 28, 2021

@zqwerty
NLTK 3.6.7 has been released, which includes the fix for this issue. Thank you for reporting it! I'll close this now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants