Resolve IndexError in `sent_tokenize` #2922

tomaarsen · 2021-12-21T13:45:32Z

Hello!

Pull request overview

Prevent IndexError when the sent_tokenize input starts with a dot.
Wrote tests to prevent this from failing again.

Details

It is possible that split is empty whenever . is matched, which can happen whenever the input of sent_tokenize starts with a ., in the following snippet:

nltk/nltk/tokenize/punkt.py

Lines 1380 to 1382 in dd1494e

    
           split = text[: match.start()].rsplit(maxsplit=1) 
        
           before_start = len(split[0]) if len(split) == 2 else 0 
        
           before_words[match] = split[-1]

This PR ought to resolve this issue.

Tom Aarsen

…develop * 'develop' of https://github.com/ExplorerFreda/nltk: Temporarily pause Python 3.10 CI tests due to scikit-learn issues with Windows Resolve IndexError in `sent_tokenize` (nltk#2922) Drop support for Python 3.6, support Python 3.10 (nltk#2920) updates for 3.6.6 minor clean ups updates for 3.6.6

tomaarsen added 2 commits December 21, 2021 14:31

Prevent IndexError if input starts with an endline character

4c046a2

Add doctest for Punkt sent_tokenize issue

d9dbb52

tomaarsen merged commit d4d99b4 into nltk:develop Dec 21, 2021

tomaarsen deleted the bugfix/punkt-patch branch December 21, 2021 14:08

tomaarsen mentioned this pull request Dec 21, 2021

Sentence tokenizer fails when sentence.startswith('. ') #2921

Closed

tomaarsen mentioned this pull request Dec 24, 2021

NLTK word_tokenize throws IndexError: list index out of range #2925

Closed

tomaarsen mentioned this pull request Jan 21, 2022

nltk.tokenize.word_tokenize crash with leading terminal punctuation #2932

Closed

tianjianjiang mentioned this pull request Jan 21, 2022

build: bump nltk to 3.6.7 for security and performance bigscience-workshop/metadata#130

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resolve IndexError in `sent_tokenize` #2922

Resolve IndexError in `sent_tokenize` #2922

tomaarsen commented Dec 21, 2021

	split = text[: match.start()].rsplit(maxsplit=1)
	before_start = len(split[0]) if len(split) == 2 else 0
	before_words[match] = split[-1]

Resolve IndexError in sent_tokenize #2922

Resolve IndexError in sent_tokenize #2922

Conversation

tomaarsen commented Dec 21, 2021

Pull request overview

Details

Resolve IndexError in `sent_tokenize` #2922

Resolve IndexError in `sent_tokenize` #2922