Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Denial of service with malformed file #1586

Closed
Google-Autofuzz opened this issue Oct 26, 2020 · 2 comments · Fixed by #1594
Closed

Denial of service with malformed file #1586

Google-Autofuzz opened this issue Oct 26, 2020 · 2 comments · Fixed by #1594
Assignees
Milestone

Comments

@Google-Autofuzz
Copy link

When running the following code with the latest git version of pygments on the attached input results of in 100% CPU consumption for an arbitrary long time:

import sys

import pygments
import pygments.formatters
import pygments.lexers

with open(sys.argv[1], 'rb') as f:
    data = f.read()
    lexer = pygments.lexers.guess_lexer(str(data))
    pygments.highlight(str(data), lexer, pygments.formatters.HtmlFormatter())

timeout-9a00111e78b5cd0979a370fc9a5cd22e39a249e4.txt

@Anteru Anteru self-assigned this Oct 26, 2020
@Anteru Anteru added this to the 2.7.3 milestone Oct 26, 2020
@kurtmckee
Copy link
Contributor

The sample input file causes Pygments to guess that this should be parsed by the SspLexer.

The SspLexer is a delegating lexer that uses the following lexers: XmlLexer (which does not choke on the input file) and JspRootLexer. JspRootLexer includes regex patterns from the JavaLexer (which also does not choke on the input file). However, when the JspRootLexer hands things off to the JavaLexer it appears that there is a mis-match in the quotes, and the JavaLexer is encountering catastrophic backtracking in the string literal regex.

I used this code to determine where in the file the JspRootLexer is choking up, and it's happening at line 115, right after these tokens:

(3690, Token.Name, 'o')
(3691, Token.Literal.String, '"; print $$0 "')
(3705, Token.Name, 'c')

The code I used was:

import pygments.lexers.templates

with open('timeout-9a00111e78b5cd0979a370fc9a5cd22e39a249e4.txt', 'rb') as f:
    data = f.read()

lexer = pygments.lexers.templates.JspRootLexer()

for i, t, v in lexer.get_tokens_unprocessed(str(data)):
    print((i, t, v))
    if i == 3705:
        breakpoint()

After stepping forward in the code for a while, I discovered that everything was hanging at pygments.lexer.RegexLexer.get_tokens_unprocessed():625. I added a print() statement just before that line and re-ran the code above, which helped me identify that it's the regex for string literals in the JavaLexer.

I've exploded that regex from a single-line regex to a new regex state named "string", which resolves the catastrophic backtracking and allows the code provided by the reporter to run without hanging.

I'm working on unit test for this and then I can submit a PR to close this issue.

kurtmckee added a commit to kurtmckee/pr-pygments that referenced this issue Nov 9, 2020
Anteru pushed a commit that referenced this issue Nov 9, 2020
* JavaLexer: Demonstrate a catastrophic backtracking bug

* JavaLexer: Fix a catastrophic backtracking bug

Closes #1586
@Anteru
Copy link
Collaborator

Anteru commented Nov 9, 2020

Thanks a lot for the fix!

@Anteru Anteru added the changelog-update Items which need to get mentioned in the changelog label Nov 9, 2020
@Anteru Anteru removed the changelog-update Items which need to get mentioned in the changelog label Dec 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants