Tackle performance and accuracy regression of sentence tokenizer since NLTK 3.6.6 #3014

tomaarsen · 2022-06-21T14:32:12Z

Resolves #3013, resolves #2981 and resolves #2934.

Hello!

Pull request overview

Improve accuracy of sent_tokenize, i.e. resolve issues from Potential bug in sentence tokenizer since 3.6.6 #2934.
Improve time-efficiency of sent_tokenize (and the underlying PunktTokenizer), as reported by Performance regression in sent_tokenize #3013.
Consequently, improve efficiency of word_tokenize which relies on sent_tokenize whenever preserve_line=False. Issues with this performance were reported in TreebankWordTokenizer potentially slower tokenizing text going from 3.6 to 3.7 #2981.
Add test cases based on provided samples from Potential bug in sentence tokenizer since 3.6.6 #2934.
Added type hinting to some punkt.py methods.

Changes

The primary changes refer to a complete overhaul of the _match_potential_end_contexts method introduced in #2869. This method was introduced to combat a nasty ReDoS for long words, but it unknowingly introduced some other issues. For example, the rsplit in the following line removed the whitespace that it was splitting on, which caused . . to be parsed equivalently as .. in some cases.

nltk/nltk/tokenize/punkt.py

Line 1380 in 6de8254

split = text[: match.start()].rsplit(maxsplit=1)

Furthermore, the text[: match.start()] ends up being quite expensive with enormous strings. The new implementation also removes the need for fully computing this list in order to reverse it:

nltk/nltk/tokenize/punkt.py

Line 1375 in 6de8254

    
           for match in reversed(list(self._lang_vars.period_context_re().finditer(text))):

Now, we can simply iterate over the iterator like normal, which should also be more memory-efficient.

Experiments

Accuracy

In the end, we're most interested in what effects these changes have. First of all, let's talk accuracy. The changes from NLTK 3.6.6 introduced some small issues, for which tests have been developed now:

    @pytest.mark.parametrize(
        "sentences, expected",
        [
            (
                "this is a test. . new sentence.",
                ["this is a test.", ".", "new sentence."],
            ),
            ("This. . . That", ["This.", ".", ".", "That"]),
            ("This..... That", ["This..... That"]),
            ("This... That", ["This... That"]),
            ("This.. . That", ["This.. .", "That"]),
            ("This. .. That", ["This.", ".. That"]),
            ("This. ,. That", ["This.", ",.", "That"]),
            ("This!!! That", ["This!!!", "That"]),
            ("This! That", ["This!", "That"]),
            (
                "1. This is R .\n2. This is A .\n3. That's all",
                ["1.", "This is R .", "2.", "This is A .", "3.", "That's all"],
            ),
            (
                "1. This is R .\t2. This is A .\t3. That's all",
                ["1.", "This is R .", "2.", "This is A .", "3.", "That's all"],
            ),
            ("Hello.\tThere", ["Hello.", "There"]),
        ],
    )
    def test_sent_tokenize(self, sentences: str, expected: List[str]):
        assert sent_tokenize(sentences) == expected

Thank you @radcheb, @griverorz and @davidmezzetti for helping provide some of these previously broken test cases. As you can see via the CI, all of these tests pass now. These results correspond exactly with the results from NLTK 3.6.5 and before. Note that I'm open to receive more hand-crafted test cases!

I generated approximately 20k test cases combining different types of punctuation, out of which some of the new results still differ compared to NLTK 3.6.5 and before. However, these were all cases with two different types of sequential punctuation, like so:

>>> nltk.__version__
'3.6.5'
>>> nltk.sent_tokenize(".!,?a")
['.', '!,?a']
>>> nltk.sent_tokenize(".!,? a")
['.!,?', 'a']

In 3.6.5, for some reason, adding a space before a causes the other punctuation marks to group together?
The new behaviour is:

>>> nltk.sent_tokenize(".!,?a")
['.', '!,?a']
>>> nltk.sent_tokenize(".!,? a")
['.', '!,?', 'a']

This seems to be more consistent, but it's difficult to determine what the correct result should even be, in odd cases like these. In short, I have no issues with these tiny changes relative to NLTK 3.6.5 and before.

Efficiency

As we recognize that NLTK is used for large text processing, efficiency is of high priority. I've ran some experiments to verify the new efficiencies.

word_tokenize efficiency summary

To generate the results, we use the following simple script:

from nltk import word_tokenize
import time

n = 8
for element in ["a", "a ", "abc ", "a.", "abc.", "a. ", "abc. "]:
    print(f"Running experiments with repeated occurrences of {element!r}.")
    for length in [10**i for i in range(2, n)]:
        text = element * length
        start_t = time.time()
        out = word_tokenize(text)
        print(f"A length of {length:<{n}} takes {time.time() - start_t:.4f}s (len={len(out)})")
    print()

Which creates an input string for word_tokenize by repeating element some number of times. This is ran for NLTK 3.6.5, NLTK 3.7, and after this PR. Note that some variations are normal, as I'm only running it once.

Baseline efficiency (NLTK 3.6.5)

Running experiments with repeated occurrences of 'a'.
A length of 100      takes 0.0100s (len=1)
A length of 1000     takes 0.0050s (len=1)
A length of 10000    takes 0.6000s (len=1)
...
# Non-linear

Running experiments with repeated occurrences of 'a '.
A length of 100      takes 0.0050s (len=100)
A length of 1000     takes 0.0010s (len=1000)
A length of 10000    takes 0.0090s (len=10000)
A length of 100000   takes 0.0939s (len=100000)
A length of 1000000  takes 1.0470s (len=1000000)
A length of 10000000 takes 10.7118s (len=10000000)
# Linear

Running experiments with repeated occurrences of 'abc '.
A length of 100      takes 0.0220s (len=100)
A length of 1000     takes 0.0030s (len=1000)
A length of 10000    takes 0.0190s (len=10000)
A length of 100000   takes 0.2420s (len=100000)
A length of 1000000  takes 1.7701s (len=1000000)
A length of 10000000 takes 19.7902s (len=10000000)
# Linear

Running experiments with repeated occurrences of 'a.'.
A length of 100      takes 0.2211s (len=2)
A length of 1000     takes 0.0730s (len=2)
A length of 10000    takes 7.0727s (len=2)
...
# Non-linear

Running experiments with repeated occurrences of 'abc.'.
A length of 100      takes 0.0080s (len=2)
A length of 1000     takes 0.1940s (len=2)
A length of 10000    takes 14.9596s (len=2)
...
# Non-linear

Running experiments with repeated occurrences of 'a. '.
A length of 100      takes 0.0080s (len=101)
A length of 1000     takes 0.0250s (len=1001)
A length of 10000    takes 0.2290s (len=10001)
A length of 100000   takes 1.4147s (len=100001)
A length of 1000000  takes 13.8890s (len=1000001)
...
# Linear

Running experiments with repeated occurrences of 'abc. '.
A length of 100      takes 0.0160s (len=200)
A length of 1000     takes 0.0670s (len=2000)
A length of 10000    takes 0.5420s (len=20000)
A length of 100000   takes 5.0774s (len=200000)
...
# Linear

The comments were manually added, based on whether it seems like the performance is linear or not.
As described in #2869, a really long single word caused a serious ReDoS. So, the first experiment with just repeated a had to be terminated manually. Furthermore, a. and abc. also seem to cause worse-than-linear time complexities. Lastly, a. and abc. seem to be linear again, but just slow.

Baseline efficiency (NLTK 3.7)

Running experiments with repeated occurrences of 'a'.
A length of 100      takes 0.0050s (len=1)
A length of 1000     takes 0.0010s (len=1)
A length of 10000    takes 0.0030s (len=1)
A length of 100000   takes 0.0320s (len=1)
A length of 1000000  takes 0.3201s (len=1)
A length of 10000000 takes 2.9845s (len=1)
# Linear

Running experiments with repeated occurrences of 'a '.
A length of 100      takes 0.0010s (len=100)
A length of 1000     takes 0.0010s (len=1000)
A length of 10000    takes 0.0080s (len=10000)        
A length of 100000   takes 0.0797s (len=100000)
A length of 1000000  takes 0.8070s (len=1000000)
A length of 10000000 takes 8.2493s (len=10000000)
# Linear

Running experiments with repeated occurrences of 'abc '.
A length of 100      takes 0.0240s (len=100)
A length of 1000     takes 0.0020s (len=1000)
A length of 10000    takes 0.0160s (len=10000)
A length of 100000   takes 0.1660s (len=100000)
A length of 1000000  takes 1.4884s (len=1000000)
A length of 10000000 takes 14.7242s (len=10000000)
# Linear

Running experiments with repeated occurrences of 'a.'.
A length of 100      takes 0.2760s (len=2)
A length of 1000     takes 0.0010s (len=2)
A length of 10000    takes 0.0090s (len=2)
A length of 100000   takes 0.1050s (len=2)
A length of 1000000  takes 0.9317s (len=2)
A length of 10000000 takes 8.8467s (len=2)
# Linear

Running experiments with repeated occurrences of 'abc.'.
A length of 100      takes 0.0010s (len=2)
A length of 1000     takes 0.0010s (len=2)
A length of 10000    takes 0.0160s (len=2)
A length of 100000   takes 0.1900s (len=2)
A length of 1000000  takes 1.4831s (len=2)
A length of 10000000 takes 14.7677s (len=2)
# Linear

Running experiments with repeated occurrences of 'a. '.
A length of 100      takes 0.0060s (len=101)
A length of 1000     takes 0.0170s (len=1001)
A length of 10000    takes 0.1841s (len=10001)
A length of 100000   takes 2.5929s (len=100001)
...
# Non-linear (I waited several minutes)

Running experiments with repeated occurrences of 'abc. '.
A length of 100      takes 0.0100s (len=200)
A length of 1000     takes 0.0509s (len=2000)
A length of 10000    takes 0.5420s (len=20000)
A length of 100000   takes 6.8013s (len=200000)
...
# Non-linear (I waited several minutes)

As can be seen here, a. and abc. seem to have been solved since NLTK 3.6.5. The tests for a. and abc. again show O(n), but it's slightly worse than 10x as slow for 10x as much data.

New efficiency

Running experiments with repeated occurrences of 'a'.
A length of 100      takes 0.0060s (len=1)
A length of 1000     takes 0.0010s (len=1)
A length of 10000    takes 0.0030s (len=1)
A length of 100000   takes 0.0320s (len=1)
A length of 1000000  takes 0.3006s (len=1)
A length of 10000000 takes 3.3780s (len=1)
# Linear

Running experiments with repeated occurrences of 'a '.
A length of 100      takes 0.0010s (len=100)
A length of 1000     takes 0.0010s (len=1000)
A length of 10000    takes 0.0080s (len=10000)
A length of 100000   takes 0.1020s (len=100000)
A length of 1000000  takes 0.7950s (len=1000000)
A length of 10000000 takes 9.4390s (len=10000000)
# Linear

Running experiments with repeated occurrences of 'abc '.
A length of 100      takes 0.0210s (len=100)
A length of 1000     takes 0.0020s (len=1000)
A length of 10000    takes 0.0160s (len=10000)
A length of 100000   takes 0.1920s (len=100000)
A length of 1000000  takes 1.7082s (len=1000000)
A length of 10000000 takes 17.7505s (len=10000000)
# Linear

Running experiments with repeated occurrences of 'a.'.
A length of 100      takes 0.2178s (len=2)
A length of 1000     takes 0.0010s (len=2)
A length of 10000    takes 0.0100s (len=2)
A length of 100000   takes 0.1360s (len=2)
A length of 1000000  takes 0.9942s (len=2)
A length of 10000000 takes 10.4847s (len=2)
# Linear

Running experiments with repeated occurrences of 'abc.'.
A length of 100      takes 0.0020s (len=2)
A length of 1000     takes 0.0020s (len=2)
A length of 10000    takes 0.0170s (len=2)
A length of 100000   takes 0.2130s (len=2)
A length of 1000000  takes 1.7320s (len=2)
A length of 10000000 takes 18.1585s (len=2)
# Linear

Running experiments with repeated occurrences of 'a. '.
A length of 100      takes 0.0060s (len=101)
A length of 1000     takes 0.0180s (len=1001)
A length of 10000    takes 0.2560s (len=10001)
A length of 100000   takes 1.6629s (len=100001)
A length of 1000000  takes 16.1289s (len=1000001)
...
# Linear

Running experiments with repeated occurrences of 'abc. '.
A length of 100      takes 0.0170s (len=200)
A length of 1000     takes 0.1070s (len=2000)
A length of 10000    takes 0.6226s (len=20000)
A length of 100000   takes 5.0704s (len=200000)
A length of 1000000  takes 50.7210s (len=2000000)
...
# Linear

All of these computation times are linear to the size of the input, which is satisfactory. The first experiment works unlike for NLTK 3.6.5, thanks to #2869. Beyond that, there is perhaps a small speedup for some (e.g. 16s -> 14s for abc), and perhaps some slowdown for others (e.g. 14s -> 16s for a. ), but I can't conclusively say that without some (overkill) significance testing.

To conclude, there is no longer any non-linear time complexity in these tests.

sent_tokenize efficiency summary

To generate the results, we use a very similar script as with word_tokenize:

from nltk import sent_tokenize
import time

n = 8
for element in ["a", "a ", "abc ", "a.", "abc.", "a. ", "abc. "]:
    print(f"Running experiments with repeated occurrences of {element!r}.")
    for length in [10**i for i in range(2, n)]:
        text = element * length
        start_t = time.time()
        out = sent_tokenize(text)
        print(f"A length of {length:<{n}} takes {time.time() - start_t:.4f}s (len={len(out)})")
    print()

Which creates an input string for sent_tokenize by repeating element some number of times. This is ran for NLTK 3.6.5, NLTK 3.7, and after this PR. Note that some variations are normal, as I'm only running it once.

Baseline efficiency (NLTK 3.6.5)

Running experiments with repeated occurrences of 'a'.
A length of 100      takes 0.0050s (len=1)
A length of 1000     takes 0.0060s (len=1)
A length of 10000    takes 0.6029s (len=1)
...
# Non-linear

Running experiments with repeated occurrences of 'a '.
A length of 100      takes 0.0050s (len=1)
A length of 1000     takes 0.0000s (len=1)
A length of 10000    takes 0.0000s (len=1)
A length of 100000   takes 0.0060s (len=1)
A length of 1000000  takes 0.0840s (len=1)
A length of 10000000 takes 0.6043s (len=1)
# Linear

Running experiments with repeated occurrences of 'abc '.
A length of 100      takes 0.0020s (len=1)
A length of 1000     takes 0.0000s (len=1)
A length of 10000    takes 0.0010s (len=1)
A length of 100000   takes 0.0180s (len=1)
A length of 1000000  takes 0.1700s (len=1)
A length of 10000000 takes 1.6243s (len=1)
# Linear

Running experiments with repeated occurrences of 'a.'.
A length of 100      takes 0.0030s (len=1)
A length of 1000     takes 0.0570s (len=1)
A length of 10000    takes 5.0565s (len=1)
...
# Non-linear

Running experiments with repeated occurrences of 'abc.'.
A length of 100      takes 0.0110s (len=1)
A length of 1000     takes 0.2360s (len=1)
A length of 10000    takes 14.6740s (len=1)
...
# Non-linear

Running experiments with repeated occurrences of 'a. '.
A length of 100      takes 0.0080s (len=1)
A length of 1000     takes 0.0190s (len=1)
A length of 10000    takes 0.2060s (len=1)
A length of 100000   takes 1.7631s (len=1)
A length of 1000000  takes 12.7064s (len=1)
...
# Linear

Running experiments with repeated occurrences of 'abc. '.
A length of 100      takes 0.0070s (len=100)
A length of 1000     takes 0.0160s (len=1000)
A length of 10000    takes 0.1630s (len=10000)
A length of 100000   takes 1.4806s (len=100000)
A length of 1000000  takes 13.8376s (len=1000000)
# Linear

All of a., abc. and a had poor performance for this version.

Baseline efficiency (NLTK 3.7)

Running experiments with repeated occurrences of 'a'.
A length of 100      takes 0.0060s (len=1)
A length of 1000     takes 0.0000s (len=1)
A length of 10000    takes 0.0010s (len=1)
A length of 100000   takes 0.0000s (len=1)
A length of 1000000  takes 0.0040s (len=1)
A length of 10000000 takes 0.0470s (len=1)
# Linear

Running experiments with repeated occurrences of 'a '.
A length of 100      takes 0.0010s (len=1)
A length of 1000     takes 0.0000s (len=1)
A length of 10000    takes 0.0010s (len=1)
A length of 100000   takes 0.0020s (len=1)
A length of 1000000  takes 0.0150s (len=1)
A length of 10000000 takes 0.1120s (len=1)
# Linear

Running experiments with repeated occurrences of 'abc '.
A length of 100      takes 0.0020s (len=1)
A length of 1000     takes 0.0000s (len=1)
A length of 10000    takes 0.0000s (len=1)
A length of 100000   takes 0.0020s (len=1)
A length of 1000000  takes 0.0260s (len=1)
A length of 10000000 takes 0.2830s (len=1)
# Linear

Running experiments with repeated occurrences of 'a.'.
A length of 100      takes 0.0030s (len=1)
A length of 1000     takes 0.0010s (len=1)
A length of 10000    takes 0.0000s (len=1)
A length of 100000   takes 0.0050s (len=1)
A length of 1000000  takes 0.0520s (len=1)
A length of 10000000 takes 0.4911s (len=1)
# Linear

Running experiments with repeated occurrences of 'abc.'.
A length of 100      takes 0.0020s (len=1)
A length of 1000     takes 0.0000s (len=1)
A length of 10000    takes 0.0010s (len=1)
A length of 100000   takes 0.0050s (len=1)
A length of 1000000  takes 0.0550s (len=1)
A length of 10000000 takes 0.5165s (len=1)
# Linear

Running experiments with repeated occurrences of 'a. '.
A length of 100      takes 0.0050s (len=1)
A length of 1000     takes 0.0120s (len=1)
A length of 10000    takes 0.1700s (len=1)
A length of 100000   takes 2.2183s (len=1)
...
# Non-linear (I waited for approximately 15 minutes here)

Running experiments with repeated occurrences of 'abc. '.
A length of 100      takes 0.0070s (len=100)
A length of 1000     takes 0.0170s (len=1000)
A length of 10000    takes 0.1736s (len=10000)
A length of 100000   takes 3.3012s (len=100000)
...
# Non-linear

Here, a. and abc. are non-linear, which is indicative of the performance issues people are having.

New efficiency

Running experiments with repeated occurrences of 'a'.
A length of 100      takes 0.0070s (len=1)
A length of 1000     takes 0.0000s (len=1)
A length of 10000    takes 0.0000s (len=1)
A length of 100000   takes 0.0010s (len=1)
A length of 1000000  takes 0.0040s (len=1)
A length of 10000000 takes 0.0430s (len=1)
# Linear

Running experiments with repeated occurrences of 'a '.
A length of 100      takes 0.0020s (len=1)
A length of 1000     takes 0.0000s (len=1)
A length of 10000    takes 0.0010s (len=1)
A length of 100000   takes 0.0010s (len=1)
A length of 1000000  takes 0.0130s (len=1)
A length of 10000000 takes 0.1480s (len=1)
# Linear

Running experiments with repeated occurrences of 'abc '.
A length of 100      takes 0.0020s (len=1)
A length of 1000     takes 0.0000s (len=1)
A length of 10000    takes 0.0010s (len=1)
A length of 100000   takes 0.0040s (len=1)
A length of 1000000  takes 0.0230s (len=1)
A length of 10000000 takes 0.2340s (len=1)
# Linear

Running experiments with repeated occurrences of 'a.'.
A length of 100      takes 0.0030s (len=1)
A length of 1000     takes 0.0000s (len=1)
A length of 10000    takes 0.0000s (len=1)
A length of 100000   takes 0.0060s (len=1)
A length of 1000000  takes 0.0480s (len=1)
A length of 10000000 takes 0.4950s (len=1)
# Linear

Running experiments with repeated occurrences of 'abc.'.
A length of 100      takes 0.0010s (len=1)
A length of 1000     takes 0.0000s (len=1)
A length of 10000    takes 0.0010s (len=1)
A length of 100000   takes 0.0050s (len=1)
A length of 1000000  takes 0.0585s (len=1)
A length of 10000000 takes 0.5450s (len=1)
# Linear

Running experiments with repeated occurrences of 'a. '.
A length of 100      takes 0.0060s (len=1)
A length of 1000     takes 0.0170s (len=1)
A length of 10000    takes 0.1670s (len=1)
A length of 100000   takes 1.4250s (len=1)
A length of 1000000  takes 14.6518s (len=1)
...
# Linear

Running experiments with repeated occurrences of 'abc. '.
A length of 100      takes 0.0090s (len=100)
A length of 1000     takes 0.0180s (len=1000)
A length of 10000    takes 0.2020s (len=10000)
A length of 100000   takes 1.8661s (len=100000)
A length of 1000000  takes 18.4968s (len=1000000)
...
# Linear

Again, all the performances seem to be linear now. The performance seems to be as good as pre-NLTK 3.6.5, or even better (e.g. for a).

To conclude, the performance in terms of time-efficiency is as good, or better, than it ever has been.

In short

It seems like this PR should:

correctly tackle the (accuracy) regressions reported in Potential bug in sentence tokenizer since 3.6.6 #2934, as shown via new test cases.
correctly tackle the (efficiency) regressions reported in Performance regression in sent_tokenize #3013 (sent_tokenize) and TreebankWordTokenizer potentially slower tokenizing text going from 3.6 to 3.7 #2981 (word_tokenize), as shown with time-efficiency experiments.

I'm open to receive new test cases to help improve the robustness of both word_tokenize and sent_tokenize. The goal of this PR was to make both of these functions linear time complexity. Future PRs could then try to improve the efficiency of certain sections of code to decrease the "slope" of the linear increase.

Thanks to @12mohaned for pointing me to #2934, although I was too busy to work on it at the time. I also want to thank @radcheb, @juhoinkinen and @ViktorDrugge for raising their respective issues, and those who contributed in them.

Tom Aarsen

…ults Some minor inconsistencies remain, e.g. for '.!, a'. Furthermore, the time-efficiency was improved. Lastly, some methods were supplemented with type hints.

osma · 2022-06-22T11:29:31Z

Whoa! This must be the most awesome PR description I've ever seen. Not only is this excellent and very timely work in itself, it's also extremely well described and analyzed from many angles. Kudos!

stevenbird · 2022-07-04T05:40:14Z

Wonderful work @tomaarsen!

tomaarsen added 5 commits June 21, 2022 10:04

Add new sent_tokenize tests that I'd like to see pass

952119e

Further expand previously broken tests

bf952d6

Align _match_potential_end_contexts with NLTK 3.6.5 sent_tokenize res…

92edd2b

…ults Some minor inconsistencies remain, e.g. for '.!, a'. Furthermore, the time-efficiency was improved. Lastly, some methods were supplemented with type hints.

Remove leftover print

cf944c1

Added and modified some comments

4073f2b

tomaarsen self-assigned this Jun 21, 2022

stevenbird merged commit 86b11fb into nltk:develop Jul 4, 2022

tomaarsen deleted the perf/tokenize branch July 4, 2022 09:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tackle performance and accuracy regression of sentence tokenizer since NLTK 3.6.6 #3014

Tackle performance and accuracy regression of sentence tokenizer since NLTK 3.6.6 #3014

tomaarsen commented Jun 21, 2022

osma commented Jun 22, 2022

stevenbird commented Jul 4, 2022

Tackle performance and accuracy regression of sentence tokenizer since NLTK 3.6.6 #3014

Tackle performance and accuracy regression of sentence tokenizer since NLTK 3.6.6 #3014

Conversation

tomaarsen commented Jun 21, 2022

Pull request overview

Changes

Experiments

Accuracy

Efficiency

Baseline efficiency (NLTK 3.6.5)

Baseline efficiency (NLTK 3.7)

New efficiency

Baseline efficiency (NLTK 3.6.5)

Baseline efficiency (NLTK 3.7)

New efficiency

In short

osma commented Jun 22, 2022

stevenbird commented Jul 4, 2022