Resolve ReDoS opportunity by fixing incorrectly specified regex #2906
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hello!
Pull request overview
The regex
The original regex was
^-?[0-9]+(.[0-9]+)?$
, which is vulnerable to a ReDoS due to[0-9]+
followed by any character (due to.
), and then again[0-9]+
. However, this original regex does not seem to match what it was intended for: Matching cardinal numbers. The regex seems to want to match:-
.
followed by a non-zero amount of numbers.This would match e.g.
3
,-4
and3.12
. However, the regex actually does not match a dot directly, it matches any number. This means that-3a12
is also allowed, which we would rather not have. I believe the original designer of the regex meant to use\.
, but used.
instead, and did not realise the small mistake.ReDoS reproduction:
outputs
After several minutes I stopped the execution.
The fix
Simply adding a
\
before the.
should fix the problem.The consequences
The regex will no longer be vulnerable to a ReDoS. Beyond that,
-3a12
will no longer be considered a cardinal number in the RegexpTagger examples throughout NLTK. This is desirable behaviour, with one small subjective exception:3e12
will no longer be considered a cardinal number, even though one could argue that it is.If we want to still consider it one, then we can of course go for
^-?[0-9]+([e\.][0-9]+)?$
, but I really don't think it matters.ReDoS reproduction
now outputs
As is correct for a simple regex, the execution time of the regex is linear in the length of the input. This was not the case before.
Thanks to @Scara31 for making us aware of this vulnerability through proper means.