Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tackle performance and accuracy regression of sentence tokenizer since NLTK 3.6.6 #3014

Merged
merged 5 commits into from Jul 4, 2022

Conversation

tomaarsen
Copy link
Member

Resolves #3013, resolves #2981 and resolves #2934.

Hello!

Pull request overview


Changes

The primary changes refer to a complete overhaul of the _match_potential_end_contexts method introduced in #2869. This method was introduced to combat a nasty ReDoS for long words, but it unknowingly introduced some other issues. For example, the rsplit in the following line removed the whitespace that it was splitting on, which caused . . to be parsed equivalently as .. in some cases.

split = text[: match.start()].rsplit(maxsplit=1)

Furthermore, the text[: match.start()] ends up being quite expensive with enormous strings. The new implementation also removes the need for fully computing this list in order to reverse it:
for match in reversed(list(self._lang_vars.period_context_re().finditer(text))):

Now, we can simply iterate over the iterator like normal, which should also be more memory-efficient.

Experiments

Accuracy

In the end, we're most interested in what effects these changes have. First of all, let's talk accuracy. The changes from NLTK 3.6.6 introduced some small issues, for which tests have been developed now:

    @pytest.mark.parametrize(
        "sentences, expected",
        [
            (
                "this is a test. . new sentence.",
                ["this is a test.", ".", "new sentence."],
            ),
            ("This. . . That", ["This.", ".", ".", "That"]),
            ("This..... That", ["This..... That"]),
            ("This... That", ["This... That"]),
            ("This.. . That", ["This.. .", "That"]),
            ("This. .. That", ["This.", ".. That"]),
            ("This. ,. That", ["This.", ",.", "That"]),
            ("This!!! That", ["This!!!", "That"]),
            ("This! That", ["This!", "That"]),
            (
                "1. This is R .\n2. This is A .\n3. That's all",
                ["1.", "This is R .", "2.", "This is A .", "3.", "That's all"],
            ),
            (
                "1. This is R .\t2. This is A .\t3. That's all",
                ["1.", "This is R .", "2.", "This is A .", "3.", "That's all"],
            ),
            ("Hello.\tThere", ["Hello.", "There"]),
        ],
    )
    def test_sent_tokenize(self, sentences: str, expected: List[str]):
        assert sent_tokenize(sentences) == expected

Thank you @radcheb, @griverorz and @davidmezzetti for helping provide some of these previously broken test cases. As you can see via the CI, all of these tests pass now. These results correspond exactly with the results from NLTK 3.6.5 and before. Note that I'm open to receive more hand-crafted test cases!

I generated approximately 20k test cases combining different types of punctuation, out of which some of the new results still differ compared to NLTK 3.6.5 and before. However, these were all cases with two different types of sequential punctuation, like so:

>>> nltk.__version__
'3.6.5'
>>> nltk.sent_tokenize(".!,?a")
['.', '!,?a']
>>> nltk.sent_tokenize(".!,? a")
['.!,?', 'a']

In 3.6.5, for some reason, adding a space before a causes the other punctuation marks to group together?
The new behaviour is:

>>> nltk.sent_tokenize(".!,?a")
['.', '!,?a']
>>> nltk.sent_tokenize(".!,? a")
['.', '!,?', 'a']

This seems to be more consistent, but it's difficult to determine what the correct result should even be, in odd cases like these. In short, I have no issues with these tiny changes relative to NLTK 3.6.5 and before.

Efficiency

As we recognize that NLTK is used for large text processing, efficiency is of high priority. I've ran some experiments to verify the new efficiencies.

word_tokenize efficiency summary

To generate the results, we use the following simple script:

from nltk import word_tokenize
import time

n = 8
for element in ["a", "a ", "abc ", "a.", "abc.", "a. ", "abc. "]:
    print(f"Running experiments with repeated occurrences of {element!r}.")
    for length in [10**i for i in range(2, n)]:
        text = element * length
        start_t = time.time()
        out = word_tokenize(text)
        print(f"A length of {length:<{n}} takes {time.time() - start_t:.4f}s (len={len(out)})")
    print()

Which creates an input string for word_tokenize by repeating element some number of times. This is ran for NLTK 3.6.5, NLTK 3.7, and after this PR. Note that some variations are normal, as I'm only running it once.

Baseline efficiency (NLTK 3.6.5)

Running experiments with repeated occurrences of 'a'.
A length of 100      takes 0.0100s (len=1)
A length of 1000     takes 0.0050s (len=1)
A length of 10000    takes 0.6000s (len=1)
...
# Non-linear

Running experiments with repeated occurrences of 'a '.
A length of 100      takes 0.0050s (len=100)
A length of 1000     takes 0.0010s (len=1000)
A length of 10000    takes 0.0090s (len=10000)
A length of 100000   takes 0.0939s (len=100000)
A length of 1000000  takes 1.0470s (len=1000000)
A length of 10000000 takes 10.7118s (len=10000000)
# Linear

Running experiments with repeated occurrences of 'abc '.
A length of 100      takes 0.0220s (len=100)
A length of 1000     takes 0.0030s (len=1000)
A length of 10000    takes 0.0190s (len=10000)
A length of 100000   takes 0.2420s (len=100000)
A length of 1000000  takes 1.7701s (len=1000000)
A length of 10000000 takes 19.7902s (len=10000000)
# Linear

Running experiments with repeated occurrences of 'a.'.
A length of 100      takes 0.2211s (len=2)
A length of 1000     takes 0.0730s (len=2)
A length of 10000    takes 7.0727s (len=2)
...
# Non-linear

Running experiments with repeated occurrences of 'abc.'.
A length of 100      takes 0.0080s (len=2)
A length of 1000     takes 0.1940s (len=2)
A length of 10000    takes 14.9596s (len=2)
...
# Non-linear

Running experiments with repeated occurrences of 'a. '.
A length of 100      takes 0.0080s (len=101)
A length of 1000     takes 0.0250s (len=1001)
A length of 10000    takes 0.2290s (len=10001)
A length of 100000   takes 1.4147s (len=100001)
A length of 1000000  takes 13.8890s (len=1000001)
...
# Linear

Running experiments with repeated occurrences of 'abc. '.
A length of 100      takes 0.0160s (len=200)
A length of 1000     takes 0.0670s (len=2000)
A length of 10000    takes 0.5420s (len=20000)
A length of 100000   takes 5.0774s (len=200000)
...
# Linear

The comments were manually added, based on whether it seems like the performance is linear or not.
As described in #2869, a really long single word caused a serious ReDoS. So, the first experiment with just repeated a had to be terminated manually. Furthermore, a. and abc. also seem to cause worse-than-linear time complexities. Lastly, a. and abc. seem to be linear again, but just slow.

Baseline efficiency (NLTK 3.7)

Running experiments with repeated occurrences of 'a'.
A length of 100      takes 0.0050s (len=1)
A length of 1000     takes 0.0010s (len=1)
A length of 10000    takes 0.0030s (len=1)
A length of 100000   takes 0.0320s (len=1)
A length of 1000000  takes 0.3201s (len=1)
A length of 10000000 takes 2.9845s (len=1)
# Linear

Running experiments with repeated occurrences of 'a '.
A length of 100      takes 0.0010s (len=100)
A length of 1000     takes 0.0010s (len=1000)
A length of 10000    takes 0.0080s (len=10000)        
A length of 100000   takes 0.0797s (len=100000)
A length of 1000000  takes 0.8070s (len=1000000)
A length of 10000000 takes 8.2493s (len=10000000)
# Linear

Running experiments with repeated occurrences of 'abc '.
A length of 100      takes 0.0240s (len=100)
A length of 1000     takes 0.0020s (len=1000)
A length of 10000    takes 0.0160s (len=10000)
A length of 100000   takes 0.1660s (len=100000)
A length of 1000000  takes 1.4884s (len=1000000)
A length of 10000000 takes 14.7242s (len=10000000)
# Linear

Running experiments with repeated occurrences of 'a.'.
A length of 100      takes 0.2760s (len=2)
A length of 1000     takes 0.0010s (len=2)
A length of 10000    takes 0.0090s (len=2)
A length of 100000   takes 0.1050s (len=2)
A length of 1000000  takes 0.9317s (len=2)
A length of 10000000 takes 8.8467s (len=2)
# Linear

Running experiments with repeated occurrences of 'abc.'.
A length of 100      takes 0.0010s (len=2)
A length of 1000     takes 0.0010s (len=2)
A length of 10000    takes 0.0160s (len=2)
A length of 100000   takes 0.1900s (len=2)
A length of 1000000  takes 1.4831s (len=2)
A length of 10000000 takes 14.7677s (len=2)
# Linear

Running experiments with repeated occurrences of 'a. '.
A length of 100      takes 0.0060s (len=101)
A length of 1000     takes 0.0170s (len=1001)
A length of 10000    takes 0.1841s (len=10001)
A length of 100000   takes 2.5929s (len=100001)
...
# Non-linear (I waited several minutes)

Running experiments with repeated occurrences of 'abc. '.
A length of 100      takes 0.0100s (len=200)
A length of 1000     takes 0.0509s (len=2000)
A length of 10000    takes 0.5420s (len=20000)
A length of 100000   takes 6.8013s (len=200000)
...
# Non-linear (I waited several minutes)

As can be seen here, a. and abc. seem to have been solved since NLTK 3.6.5. The tests for a. and abc. again show O(n), but it's slightly worse than 10x as slow for 10x as much data.

New efficiency

Running experiments with repeated occurrences of 'a'.
A length of 100      takes 0.0060s (len=1)
A length of 1000     takes 0.0010s (len=1)
A length of 10000    takes 0.0030s (len=1)
A length of 100000   takes 0.0320s (len=1)
A length of 1000000  takes 0.3006s (len=1)
A length of 10000000 takes 3.3780s (len=1)
# Linear

Running experiments with repeated occurrences of 'a '.
A length of 100      takes 0.0010s (len=100)
A length of 1000     takes 0.0010s (len=1000)
A length of 10000    takes 0.0080s (len=10000)
A length of 100000   takes 0.1020s (len=100000)
A length of 1000000  takes 0.7950s (len=1000000)
A length of 10000000 takes 9.4390s (len=10000000)
# Linear

Running experiments with repeated occurrences of 'abc '.
A length of 100      takes 0.0210s (len=100)
A length of 1000     takes 0.0020s (len=1000)
A length of 10000    takes 0.0160s (len=10000)
A length of 100000   takes 0.1920s (len=100000)
A length of 1000000  takes 1.7082s (len=1000000)
A length of 10000000 takes 17.7505s (len=10000000)
# Linear

Running experiments with repeated occurrences of 'a.'.
A length of 100      takes 0.2178s (len=2)
A length of 1000     takes 0.0010s (len=2)
A length of 10000    takes 0.0100s (len=2)
A length of 100000   takes 0.1360s (len=2)
A length of 1000000  takes 0.9942s (len=2)
A length of 10000000 takes 10.4847s (len=2)
# Linear

Running experiments with repeated occurrences of 'abc.'.
A length of 100      takes 0.0020s (len=2)
A length of 1000     takes 0.0020s (len=2)
A length of 10000    takes 0.0170s (len=2)
A length of 100000   takes 0.2130s (len=2)
A length of 1000000  takes 1.7320s (len=2)
A length of 10000000 takes 18.1585s (len=2)
# Linear

Running experiments with repeated occurrences of 'a. '.
A length of 100      takes 0.0060s (len=101)
A length of 1000     takes 0.0180s (len=1001)
A length of 10000    takes 0.2560s (len=10001)
A length of 100000   takes 1.6629s (len=100001)
A length of 1000000  takes 16.1289s (len=1000001)
...
# Linear

Running experiments with repeated occurrences of 'abc. '.
A length of 100      takes 0.0170s (len=200)
A length of 1000     takes 0.1070s (len=2000)
A length of 10000    takes 0.6226s (len=20000)
A length of 100000   takes 5.0704s (len=200000)
A length of 1000000  takes 50.7210s (len=2000000)
...
# Linear

All of these computation times are linear to the size of the input, which is satisfactory. The first experiment works unlike for NLTK 3.6.5, thanks to #2869. Beyond that, there is perhaps a small speedup for some (e.g. 16s -> 14s for abc), and perhaps some slowdown for others (e.g. 14s -> 16s for a. ), but I can't conclusively say that without some (overkill) significance testing.

To conclude, there is no longer any non-linear time complexity in these tests.

sent_tokenize efficiency summary

To generate the results, we use a very similar script as with word_tokenize:

from nltk import sent_tokenize
import time

n = 8
for element in ["a", "a ", "abc ", "a.", "abc.", "a. ", "abc. "]:
    print(f"Running experiments with repeated occurrences of {element!r}.")
    for length in [10**i for i in range(2, n)]:
        text = element * length
        start_t = time.time()
        out = sent_tokenize(text)
        print(f"A length of {length:<{n}} takes {time.time() - start_t:.4f}s (len={len(out)})")
    print()

Which creates an input string for sent_tokenize by repeating element some number of times. This is ran for NLTK 3.6.5, NLTK 3.7, and after this PR. Note that some variations are normal, as I'm only running it once.

Baseline efficiency (NLTK 3.6.5)

Running experiments with repeated occurrences of 'a'.
A length of 100      takes 0.0050s (len=1)
A length of 1000     takes 0.0060s (len=1)
A length of 10000    takes 0.6029s (len=1)
...
# Non-linear

Running experiments with repeated occurrences of 'a '.
A length of 100      takes 0.0050s (len=1)
A length of 1000     takes 0.0000s (len=1)
A length of 10000    takes 0.0000s (len=1)
A length of 100000   takes 0.0060s (len=1)
A length of 1000000  takes 0.0840s (len=1)
A length of 10000000 takes 0.6043s (len=1)
# Linear

Running experiments with repeated occurrences of 'abc '.
A length of 100      takes 0.0020s (len=1)
A length of 1000     takes 0.0000s (len=1)
A length of 10000    takes 0.0010s (len=1)
A length of 100000   takes 0.0180s (len=1)
A length of 1000000  takes 0.1700s (len=1)
A length of 10000000 takes 1.6243s (len=1)
# Linear

Running experiments with repeated occurrences of 'a.'.
A length of 100      takes 0.0030s (len=1)
A length of 1000     takes 0.0570s (len=1)
A length of 10000    takes 5.0565s (len=1)
...
# Non-linear

Running experiments with repeated occurrences of 'abc.'.
A length of 100      takes 0.0110s (len=1)
A length of 1000     takes 0.2360s (len=1)
A length of 10000    takes 14.6740s (len=1)
...
# Non-linear

Running experiments with repeated occurrences of 'a. '.
A length of 100      takes 0.0080s (len=1)
A length of 1000     takes 0.0190s (len=1)
A length of 10000    takes 0.2060s (len=1)
A length of 100000   takes 1.7631s (len=1)
A length of 1000000  takes 12.7064s (len=1)
...
# Linear

Running experiments with repeated occurrences of 'abc. '.
A length of 100      takes 0.0070s (len=100)
A length of 1000     takes 0.0160s (len=1000)
A length of 10000    takes 0.1630s (len=10000)
A length of 100000   takes 1.4806s (len=100000)
A length of 1000000  takes 13.8376s (len=1000000)
# Linear

All of a., abc. and a had poor performance for this version.

Baseline efficiency (NLTK 3.7)

Running experiments with repeated occurrences of 'a'.
A length of 100      takes 0.0060s (len=1)
A length of 1000     takes 0.0000s (len=1)
A length of 10000    takes 0.0010s (len=1)
A length of 100000   takes 0.0000s (len=1)
A length of 1000000  takes 0.0040s (len=1)
A length of 10000000 takes 0.0470s (len=1)
# Linear

Running experiments with repeated occurrences of 'a '.
A length of 100      takes 0.0010s (len=1)
A length of 1000     takes 0.0000s (len=1)
A length of 10000    takes 0.0010s (len=1)
A length of 100000   takes 0.0020s (len=1)
A length of 1000000  takes 0.0150s (len=1)
A length of 10000000 takes 0.1120s (len=1)
# Linear

Running experiments with repeated occurrences of 'abc '.
A length of 100      takes 0.0020s (len=1)
A length of 1000     takes 0.0000s (len=1)
A length of 10000    takes 0.0000s (len=1)
A length of 100000   takes 0.0020s (len=1)
A length of 1000000  takes 0.0260s (len=1)
A length of 10000000 takes 0.2830s (len=1)
# Linear

Running experiments with repeated occurrences of 'a.'.
A length of 100      takes 0.0030s (len=1)
A length of 1000     takes 0.0010s (len=1)
A length of 10000    takes 0.0000s (len=1)
A length of 100000   takes 0.0050s (len=1)
A length of 1000000  takes 0.0520s (len=1)
A length of 10000000 takes 0.4911s (len=1)
# Linear

Running experiments with repeated occurrences of 'abc.'.
A length of 100      takes 0.0020s (len=1)
A length of 1000     takes 0.0000s (len=1)
A length of 10000    takes 0.0010s (len=1)
A length of 100000   takes 0.0050s (len=1)
A length of 1000000  takes 0.0550s (len=1)
A length of 10000000 takes 0.5165s (len=1)
# Linear

Running experiments with repeated occurrences of 'a. '.
A length of 100      takes 0.0050s (len=1)
A length of 1000     takes 0.0120s (len=1)
A length of 10000    takes 0.1700s (len=1)
A length of 100000   takes 2.2183s (len=1)
...
# Non-linear (I waited for approximately 15 minutes here)

Running experiments with repeated occurrences of 'abc. '.
A length of 100      takes 0.0070s (len=100)
A length of 1000     takes 0.0170s (len=1000)
A length of 10000    takes 0.1736s (len=10000)
A length of 100000   takes 3.3012s (len=100000)
...
# Non-linear

Here, a. and abc. are non-linear, which is indicative of the performance issues people are having.

New efficiency

Running experiments with repeated occurrences of 'a'.
A length of 100      takes 0.0070s (len=1)
A length of 1000     takes 0.0000s (len=1)
A length of 10000    takes 0.0000s (len=1)
A length of 100000   takes 0.0010s (len=1)
A length of 1000000  takes 0.0040s (len=1)
A length of 10000000 takes 0.0430s (len=1)
# Linear

Running experiments with repeated occurrences of 'a '.
A length of 100      takes 0.0020s (len=1)
A length of 1000     takes 0.0000s (len=1)
A length of 10000    takes 0.0010s (len=1)
A length of 100000   takes 0.0010s (len=1)
A length of 1000000  takes 0.0130s (len=1)
A length of 10000000 takes 0.1480s (len=1)
# Linear

Running experiments with repeated occurrences of 'abc '.
A length of 100      takes 0.0020s (len=1)
A length of 1000     takes 0.0000s (len=1)
A length of 10000    takes 0.0010s (len=1)
A length of 100000   takes 0.0040s (len=1)
A length of 1000000  takes 0.0230s (len=1)
A length of 10000000 takes 0.2340s (len=1)
# Linear

Running experiments with repeated occurrences of 'a.'.
A length of 100      takes 0.0030s (len=1)
A length of 1000     takes 0.0000s (len=1)
A length of 10000    takes 0.0000s (len=1)
A length of 100000   takes 0.0060s (len=1)
A length of 1000000  takes 0.0480s (len=1)
A length of 10000000 takes 0.4950s (len=1)
# Linear

Running experiments with repeated occurrences of 'abc.'.
A length of 100      takes 0.0010s (len=1)
A length of 1000     takes 0.0000s (len=1)
A length of 10000    takes 0.0010s (len=1)
A length of 100000   takes 0.0050s (len=1)
A length of 1000000  takes 0.0585s (len=1)
A length of 10000000 takes 0.5450s (len=1)
# Linear

Running experiments with repeated occurrences of 'a. '.
A length of 100      takes 0.0060s (len=1)
A length of 1000     takes 0.0170s (len=1)
A length of 10000    takes 0.1670s (len=1)
A length of 100000   takes 1.4250s (len=1)
A length of 1000000  takes 14.6518s (len=1)
...
# Linear

Running experiments with repeated occurrences of 'abc. '.
A length of 100      takes 0.0090s (len=100)
A length of 1000     takes 0.0180s (len=1000)
A length of 10000    takes 0.2020s (len=10000)
A length of 100000   takes 1.8661s (len=100000)
A length of 1000000  takes 18.4968s (len=1000000)
...
# Linear

Again, all the performances seem to be linear now. The performance seems to be as good as pre-NLTK 3.6.5, or even better (e.g. for a).

To conclude, the performance in terms of time-efficiency is as good, or better, than it ever has been.

In short

It seems like this PR should:

I'm open to receive new test cases to help improve the robustness of both word_tokenize and sent_tokenize. The goal of this PR was to make both of these functions linear time complexity. Future PRs could then try to improve the efficiency of certain sections of code to decrease the "slope" of the linear increase.

Thanks to @12mohaned for pointing me to #2934, although I was too busy to work on it at the time. I also want to thank @radcheb, @juhoinkinen and @ViktorDrugge for raising their respective issues, and those who contributed in them.

  • Tom Aarsen

@tomaarsen tomaarsen self-assigned this Jun 21, 2022
@osma
Copy link

osma commented Jun 22, 2022

Whoa! This must be the most awesome PR description I've ever seen. Not only is this excellent and very timely work in itself, it's also extremely well described and analyzed from many angles. Kudos!

@stevenbird stevenbird merged commit 86b11fb into nltk:develop Jul 4, 2022
@stevenbird
Copy link
Member

Wonderful work @tomaarsen!

@tomaarsen tomaarsen deleted the perf/tokenize branch July 4, 2022 09:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants