Regular Expression Denial of Service (ReDoS) from RegexpTagger #2929

sarathsund · 2022-01-13T16:33:53Z

Hi NLTK team,

Recently when I try to run security scans for nltk package (3.6.7), below is the exception posted by twistlock

nltk package from all versions is vulnerable to Regular Expression Denial of Service (ReDoS). ^-?[0-9]+(.[0-9]+)?$ groups [0-9]+(.[0-9]+) match each other, which causes a nasty backtracking in case of failure. If the attacker succeeds to use a malicious payload against RegexpTagger used in function get_pos_tagger and malt_regex_tagger, it will cause a nasty DoS.

Files involving the vulnerability.
glue.py
malt.py
sequential.py

could I get some support on this issue and more details ? Also I am not very sure how to reproduce this issue.

tomaarsen · 2022-01-13T16:42:41Z

@sarathsund Hello!

I believe this has been covered and solved in #2906. This PR has been included in NLTK 3.6.7, so I'm unsure how you were able to still find the vulnerability in this issue. Are you certain that you're using NLTK 3.6.7 for this?

Obviously, it is possible that the fix was not sufficient, or that only a part of the ReDoS was fixed.

Tom Aarsen

sarathsund · 2022-01-13T17:17:53Z

Hi @tomaarsen

Thanks for the quick check..

Yes I was using the 3.6.5 previously and I upgraded to 3.6.7 after checking the solved #2906

This is the latest scan results from today. Not very sure why PRISMA error still shows. (PRISMA-2021-0204)

Sarath

tomaarsen · 2022-01-13T17:37:11Z

I'm not quite sure either. Perhaps the tool fails to take into account the r before the string in

r"^-?[0-9]+(\.[0-9]+)?$"

If this is the case, then the string is not considered "raw", and is converted to:

^-?[0-9]+(.[0-9]+)?$

(Note, no \ before the dot)

However, with my knowledge of regexes I don't see a vulnerability in the current code. [0-9]+ and (\.[0-9]+) do not overlap (which is what causes the vulnerability, as it creates 2 ways to match e.g. a 3, which can cause polynomial or exponential backtracking). The \. ensures that a literal . is matched, which causes these groups to not overlap.

It seems that your tool does not see \., but . instead, which means "any character". If that was the case, then there would be a vulnerability.

sarathsund · 2022-01-13T20:51:45Z

Hi @tomaarsen

once again thanks for the quick support. I will check with twistlock prisma team on this..

Thanks
Sarath

sjurgis · 2022-01-19T22:32:03Z

It seems this fix might have caused this issue #2931

sarathsund closed this as completed Jan 13, 2022

sjurgis mentioned this issue Jan 19, 2022

nltk-3.5 build fails in docker and python 2.7.18 due to regex-2022.1.18 #2931

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regular Expression Denial of Service (ReDoS) from RegexpTagger #2929

Regular Expression Denial of Service (ReDoS) from RegexpTagger #2929

sarathsund commented Jan 13, 2022

tomaarsen commented Jan 13, 2022

sarathsund commented Jan 13, 2022

tomaarsen commented Jan 13, 2022 •

edited

sarathsund commented Jan 13, 2022

sjurgis commented Jan 19, 2022

Regular Expression Denial of Service (ReDoS) from RegexpTagger #2929

Regular Expression Denial of Service (ReDoS) from RegexpTagger #2929

Comments

sarathsund commented Jan 13, 2022

tomaarsen commented Jan 13, 2022

sarathsund commented Jan 13, 2022

tomaarsen commented Jan 13, 2022 • edited

sarathsund commented Jan 13, 2022

sjurgis commented Jan 19, 2022

tomaarsen commented Jan 13, 2022 •

edited