Resolved serious ReDoS in PunktSentenceTokenizer #2869
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Resolves #2866
Hello!
Pull request overview
sent_tokenize
andword_tokenize
.The consequences of the ReDoS
The ReDoS is described in #2866, where an example of a malicious string is given. The consequences can be seen with this script:
Which outputs:
This is before I canceled the execution of the program after running it for several hours.
Potential malicious inputs
Any long sequence of any non-whitespace characters except
.
,?
or!
will be malicious input to the regex. A quick example is simply:It takes
word_tokenize
56.3322 seconds to tokenize this input.The details of the ReDoS
The ReDoS is caused by this regex:
nltk/nltk/tokenize/punkt.py
Lines 268 to 278 in ad3c84c
After being formatted with the default values, this becomes:
The vulnerability is already at the very start of the regex. The following regex is also vulnerable:
The issue is that the Python regex engine will start matching tokens with
\S*
(i.e. any non-whitespace), and only recognize that it failed to find a.
,?
or a!
when it either reaches a whitespace character. So, with an input of a thousand characters of"a"
, then the regex engine will start matching from the very first"a"
, and only recognize that it won't be able to match this way after matching the full 1000"a"
's. Then, it will backtrack all the way back to the first"a"
, skip it, and try the entire thing again from the second"a"
, now matching 999 characters before realising this won't work. This will continue until finally realising that the regex will not match the string at all.Given an input string of length
n
, this will take(n^2 + n) / 2
steps. With other words,word_tokenize
becomes at least O(n^2) time complexity, while a performance of O(n) should be reachable for most regex problems.The fix
The fix seems elementary:
\S*
from the regex, then the regex engine will only start trying to match when it actually finds a potential end of sentence token (i.e..
,!
or?
by default).\S*$
to match the final word before the end of sentence token.And this works very well, with one small but crucial detail:
re.finditer
will find all non-overlapping matches, and we've now reduced the size of each match. See the consequences with this example:On the
develop
branchAfter removing
\S*
from the regexIn short, we need to manually remove these overlapping cases. This is what the new
_match_potential_end_contexts
method is for. I can't come up with a regex-only solution for this problem.Results
Upon running the script again, but now with the fix in place, we get:
As you can see, the performance is significantly better, and more in tune with the expected O(n).
Beyond that, the performance of normal sentences (i.e. with dots) is in line with the original performance.
If all is well, then only the performance is (positively) impacted.
Note
I believe that this PR does not require re-training of the Punkt models.