Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unclosed script block causes extremely poor HTML lexing performance #1614

Closed
gerner opened this issue Nov 27, 2020 · 4 comments
Closed

Unclosed script block causes extremely poor HTML lexing performance #1614

gerner opened this issue Nov 27, 2020 · 4 comments

Comments

@gerner
Copy link
Contributor

gerner commented Nov 27, 2020

First: Thanks to the maintainers and community for all the hard work on Pygments!

An HTML file with a large, unclosed script block on a single line will cause very poor lexing performance, just to spit out a stream of Error tokens. This isn't too bad if the line is short, less than a few KB, but for long lines, dozens of KB or more, this makes things extremely slow, I believe O(n^2) in the line length.

For example, this snippet:

<script>hi

is pretty fast when running cat example | pygmentize -l html

but an example like

<script>hihihihi...for 30KB

is extremely slow. I think there's an O(n^2) thing going on:

# little test rig with cat spits out <script> followed by "hi" for some number of repetitions
$ time cat <(echo -n "<script>") <(for x in $(seq 1 2000); do echo -n hi; done) | pygmentize -l html > /dev/null

real    0m0.302s
user    0m0.286s
sys     0m0.040s
$ time cat <(echo -n "<script>") <(for x in $(seq 1 4000); do echo -n hi; done) | pygmentize -l html > /dev/null

real    0m0.709s
user    0m0.683s
sys     0m0.052s
$ time cat <(echo -n "<script>") <(for x in $(seq 1 8000); do echo -n hi; done) | pygmentize -l html > /dev/null

real    0m2.368s
user    0m2.365s
sys     0m0.037s
$ time cat <(echo -n "<script>") <(for x in $(seq 1 16000); do echo -n hi; done) | pygmentize -l html > /dev/null

real    0m8.922s
user    0m8.934s
sys     0m0.042s

$ time cat <(echo -n "<script>") <(for x in $(seq 1 32000); do echo -n hi; done) | pygmentize -l html > /dev/null

real    0m36.664s
user    0m36.661s
sys     0m0.077s

Notice we double the length each time, but the time to run goes up by a factor of nearly 4.

I think this is a legitimate issue that needs addressing because:

  • it's not uncommon to have large, obfuscated javascript snippets on a single line
  • it's quite possible to see an incomplete snippet show up in, e.g. a diff hunk or some other incomplete snippet

In such cases I think we at least want to fail quickly (36 seconds runtime for 36KB seems way too long just to spit out an Error token for each character), but ideally parse as much as we can as javascript.

I found this in a diff hunk with a ~400KB line with no closing </script>. My app timed out after 10 minutes.

I'll follow up with some more thoughts on why this happens and ideas for a fix.

@gerner
Copy link
Contributor Author

gerner commented Nov 27, 2020

Also, happy Thanksgiving everyone.

@gerner
Copy link
Contributor Author

gerner commented Nov 27, 2020

I think this regex looking for all characters before a closing </script> tag is moderately expensive:

https://github.com/pygments/pygments/blob/master/pygments/lexers/html.py#L79

That's not a problem if it's just eating up the body of the script block. But in the case where there is no closing </script> tag, this will interact with the lexer's loop here:

https://github.com/pygments/pygments/blob/master/pygments/lexer.py#L624-L672

and cause this moderately expensive regex to run once for every character after the opening <script> tag, hereish:

https://github.com/pygments/pygments/blob/master/pygments/lexer.py#L669-L670

Also, I believe in this case there are only two possiblities:

  1. we hit a newline and go back into the root state (here: https://github.com/pygments/pygments/blob/master/pygments/lexer.py#L662-L668 )
  2. there is no newline and we consume all remaining characters in the file

So we're guaranteed to spit out a bunch of Error tokens until a newline or the end of the file (whichever comes first). I think we can much more efficiently do this, or perhaps even do better by adding another rule like so:

        'script-content': [
            (r'(<)(\s*)(/)(\s*)(script)(\s*)(>)',
             bygroups(Punctuation, Text, Punctuation, Text, Name.Tag, Text,
                      Punctuation), '#pop'),
            (r'.+?(?=<\s*/\s*script\s*>)', using(JavascriptLexer)),
            (r'.+?\n', using(JavascriptLexer), '#pop'),
            (r'.+', using(JavascriptLexer), '#pop'),
        ],

Notice the last two new rules that will parse the rest of the line as Javascript or, failing that (there is no newline), just parse the rest of the file as Javascript.

Alternatively this change might be nearly equivalent to what currently happens (a single Error token for the rest of the line or file if no newline), only it happens way faster, not in O(n^2) time. Also, I believe a single Error token would be preferable to an Error token for every character.

        'script-content': [
            (r'(<)(\s*)(/)(\s*)(script)(\s*)(>)',
             bygroups(Punctuation, Text, Punctuation, Text, Name.Tag, Text,
                      Punctuation), '#pop'),
            (r'.+?(?=<\s*/\s*script\s*>)', using(JavascriptLexer)),
            (r'.+?\n', Error, '#pop'),
            (r'.+', Error, '#pop'),
        ],

I haven't really tested either of these :) I wanted to get some discussion before I send a PR.

@gerner
Copy link
Contributor Author

gerner commented Nov 27, 2020

there's probably something wrong with my escaping the newline in those raw strings, maybe it should just be a "$"? or perhaps not using a raw string at all?

@gerner
Copy link
Contributor Author

gerner commented Nov 27, 2020

I believe the same thing happens for the <style> case and I think we can solve the same way.

This kind of suggests this might be a broader issue since the error handler will increment one character at a time and then retry all patterns in the current state stack. I get why that is minimal progress to mark as Error in an effort to get back to parsing, but it also seems very vulnerable to this kind of O(n^2) issue depending on how lexers write their patterns.

gerner added a commit to gerner/pygments that referenced this issue Nov 27, 2020
Explicitly handle unclosed <script> and <style> tags which previously
would result in O(n^2) work to lex as Error tokens per character up to
the end of the line or end of file (whichever comes first).

Now we try lexing the rest of the line as Javascript/CSS  if there's no
closing script/style tag. We recover on the next line in the root state
if there is a newline, otherwise just keep parsing as Javascript/CSS.

This is similar to how the error handling in lexer.py works except we
get Javascript or CSS tokens instead of Error tokens. And we get to the
end of the line much faster since we don't apply an O(n) regex for every
character in the line.

I added a new test suite for html lexer (there wasn't one except for
coverage in test_examplefiles.py) including a trivial happy-path case
and several cases around <script> and <style> fragments, including
regression coverage that fails on the old logic.
gerner added a commit to gerner/pygments that referenced this issue Nov 27, 2020
Explicitly handle unclosed <script> and <style> tags which previously
would result in O(n^2) work to lex as Error tokens per character up to
the end of the line or end of file (whichever comes first).

Now we try lexing the rest of the line as Javascript/CSS  if there's no
closing script/style tag. We recover on the next line in the root state
if there is a newline, otherwise just keep parsing as Javascript/CSS.

This is similar to how the error handling in lexer.py works except we
get Javascript or CSS tokens instead of Error tokens. And we get to the
end of the line much faster since we don't apply an O(n) regex for every
character in the line.

I added a new test suite for html lexer (there wasn't one except for
coverage in test_examplefiles.py) including a trivial happy-path case
and several cases around <script> and <style> fragments, including
regression coverage that fails on the old logic.
gerner added a commit to gerner/pygments that referenced this issue Dec 3, 2020
Explicitly handle unclosed <script> and <style> tags which previously
would result in O(n^2) work to lex as Error tokens per character up to
the end of the line or end of file (whichever comes first).

Now we try lexing the rest of the line as Javascript/CSS  if there's no
closing script/style tag. We recover on the next line in the root state
if there is a newline, otherwise just keep parsing as Javascript/CSS.

This is similar to how the error handling in lexer.py works except we
get Javascript or CSS tokens instead of Error tokens. And we get to the
end of the line much faster since we don't apply an O(n) regex for every
character in the line.

I added a new test suite for html lexer (there wasn't one except for
coverage in test_examplefiles.py) including a trivial happy-path case
and several cases around <script> and <style> fragments, including
regression coverage that fails on the old logic.
@Anteru Anteru closed this as completed in 78665a4 Dec 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant