Tackle performance and accuracy regression of sentence tokenizer since NLTK 3.6.6 #3014
+115
−40
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Resolves #3013, resolves #2981 and resolves #2934.
Hello!
Pull request overview
sent_tokenize
, i.e. resolve issues from Potential bug in sentence tokenizer since 3.6.6 #2934.sent_tokenize
(and the underlying PunktTokenizer), as reported by Performance regression in sent_tokenize #3013.word_tokenize
which relies onsent_tokenize
wheneverpreserve_line=False
. Issues with this performance were reported in TreebankWordTokenizer potentially slower tokenizing text going from 3.6 to 3.7 #2981.punkt.py
methods.Changes
The primary changes refer to a complete overhaul of the
_match_potential_end_contexts
method introduced in #2869. This method was introduced to combat a nasty ReDoS for long words, but it unknowingly introduced some other issues. For example, thersplit
in the following line removed the whitespace that it was splitting on, which caused. .
to be parsed equivalently as..
in some cases.nltk/nltk/tokenize/punkt.py
Line 1380 in 6de8254
Furthermore, the
text[: match.start()]
ends up being quite expensive with enormous strings. The new implementation also removes the need for fully computing this list in order to reverse it:nltk/nltk/tokenize/punkt.py
Line 1375 in 6de8254
Now, we can simply iterate over the iterator like normal, which should also be more memory-efficient.
Experiments
Accuracy
In the end, we're most interested in what effects these changes have. First of all, let's talk accuracy. The changes from NLTK 3.6.6 introduced some small issues, for which tests have been developed now:
Thank you @radcheb, @griverorz and @davidmezzetti for helping provide some of these previously broken test cases. As you can see via the CI, all of these tests pass now. These results correspond exactly with the results from NLTK 3.6.5 and before. Note that I'm open to receive more hand-crafted test cases!
I generated approximately 20k test cases combining different types of punctuation, out of which some of the new results still differ compared to NLTK 3.6.5 and before. However, these were all cases with two different types of sequential punctuation, like so:
In 3.6.5, for some reason, adding a space before
a
causes the other punctuation marks to group together?The new behaviour is:
This seems to be more consistent, but it's difficult to determine what the correct result should even be, in odd cases like these. In short, I have no issues with these tiny changes relative to NLTK 3.6.5 and before.
Efficiency
As we recognize that NLTK is used for large text processing, efficiency is of high priority. I've ran some experiments to verify the new efficiencies.
word_tokenize efficiency summary
To generate the results, we use the following simple script:
Which creates an input string for
word_tokenize
by repeatingelement
some number of times. This is ran for NLTK 3.6.5, NLTK 3.7, and after this PR. Note that some variations are normal, as I'm only running it once.Baseline efficiency (NLTK 3.6.5)
The comments were manually added, based on whether it seems like the performance is linear or not.
As described in #2869, a really long single word caused a serious ReDoS. So, the first experiment with just repeated
a
had to be terminated manually. Furthermore,a.
andabc.
also seem to cause worse-than-linear time complexities. Lastly,a.
andabc.
seem to be linear again, but just slow.Baseline efficiency (NLTK 3.7)
As can be seen here,
a.
andabc.
seem to have been solved since NLTK 3.6.5. The tests fora.
andabc.
again show O(n), but it's slightly worse than 10x as slow for 10x as much data.New efficiency
All of these computation times are linear to the size of the input, which is satisfactory. The first experiment works unlike for NLTK 3.6.5, thanks to #2869. Beyond that, there is perhaps a small speedup for some (e.g. 16s -> 14s for
abc
), and perhaps some slowdown for others (e.g. 14s -> 16s fora.
), but I can't conclusively say that without some (overkill) significance testing.To conclude, there is no longer any non-linear time complexity in these tests.
sent_tokenize efficiency summary
To generate the results, we use a very similar script as with word_tokenize:
Which creates an input string for
sent_tokenize
by repeatingelement
some number of times. This is ran for NLTK 3.6.5, NLTK 3.7, and after this PR. Note that some variations are normal, as I'm only running it once.Baseline efficiency (NLTK 3.6.5)
All of
a.
,abc.
anda
had poor performance for this version.Baseline efficiency (NLTK 3.7)
Here,
a.
andabc.
are non-linear, which is indicative of the performance issues people are having.New efficiency
Again, all the performances seem to be linear now. The performance seems to be as good as pre-NLTK 3.6.5, or even better (e.g. for
a
).To conclude, the performance in terms of time-efficiency is as good, or better, than it ever has been.
In short
It seems like this PR should:
I'm open to receive new test cases to help improve the robustness of both
word_tokenize
andsent_tokenize
. The goal of this PR was to make both of these functions linear time complexity. Future PRs could then try to improve the efficiency of certain sections of code to decrease the "slope" of the linear increase.Thanks to @12mohaned for pointing me to #2934, although I was too busy to work on it at the time. I also want to thank @radcheb, @juhoinkinen and @ViktorDrugge for raising their respective issues, and those who contributed in them.