Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regular Expression Denial of Service (ReDoS) from RegexpTagger #2929

Closed
sarathsund opened this issue Jan 13, 2022 · 5 comments
Closed

Regular Expression Denial of Service (ReDoS) from RegexpTagger #2929

sarathsund opened this issue Jan 13, 2022 · 5 comments

Comments

@sarathsund
Copy link

Hi NLTK team,

Recently when I try to run security scans for nltk package (3.6.7), below is the exception posted by twistlock

nltk package from all versions is vulnerable to Regular Expression Denial of Service (ReDoS). ^-?[0-9]+(.[0-9]+)?$ groups [0-9]+(.[0-9]+) match each other, which causes a nasty backtracking in case of failure. If the attacker succeeds to use a malicious payload against RegexpTagger used in function get_pos_tagger and malt_regex_tagger, it will cause a nasty DoS.

Files involving the vulnerability.
glue.py
malt.py
sequential.py

could I get some support on this issue and more details ? Also I am not very sure how to reproduce this issue.

@tomaarsen
Copy link
Member

@sarathsund Hello!

I believe this has been covered and solved in #2906. This PR has been included in NLTK 3.6.7, so I'm unsure how you were able to still find the vulnerability in this issue. Are you certain that you're using NLTK 3.6.7 for this?

Obviously, it is possible that the fix was not sufficient, or that only a part of the ReDoS was fixed.

Tom Aarsen

@sarathsund
Copy link
Author

Hi @tomaarsen

Thanks for the quick check..

Yes I was using the 3.6.5 previously and I upgraded to 3.6.7 after checking the solved #2906

This is the latest scan results from today. Not very sure why PRISMA error still shows. (PRISMA-2021-0204)
image

Sarath

@tomaarsen
Copy link
Member

tomaarsen commented Jan 13, 2022

I'm not quite sure either. Perhaps the tool fails to take into account the r before the string in

r"^-?[0-9]+(\.[0-9]+)?$"

If this is the case, then the string is not considered "raw", and is converted to:

^-?[0-9]+(.[0-9]+)?$

(Note, no \ before the dot)

However, with my knowledge of regexes I don't see a vulnerability in the current code. [0-9]+ and (\.[0-9]+) do not overlap (which is what causes the vulnerability, as it creates 2 ways to match e.g. a 3, which can cause polynomial or exponential backtracking). The \. ensures that a literal . is matched, which causes these groups to not overlap.

It seems that your tool does not see \., but . instead, which means "any character". If that was the case, then there would be a vulnerability.

@sarathsund
Copy link
Author

Hi @tomaarsen

once again thanks for the quick support. I will check with twistlock prisma team on this..

Thanks
Sarath

@sjurgis
Copy link

sjurgis commented Jan 19, 2022

It seems this fix might have caused this issue #2931

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants