Unclosed script/style tag handling Fixes #1614 #1615

gerner · 2020-11-27T19:17:29Z

Explicitly handle unclosed <script> and <style> tags which previously
would result in O(n^2) work to lex as Error tokens per character up to
the end of the line or end of file (whichever comes first).

Now we try lexing the rest of the line as Javascript/CSS if there's no
closing script/style tag. We recover on the next line in the root state
if there is a newline, otherwise just keep parsing as Javascript/CSS.

This is similar to how the error handling in lexer.py works except we
get Javascript or CSS tokens instead of Error tokens. And we get to the
end of the line much faster since we don't apply an O(n) regex for every
character in the line.

I added a new test suite for html lexer (there wasn't one except for
coverage in test_examplefiles.py) including a trivial happy-path case
and several cases around <script> and <style> fragments, including
regression coverage that fails on the old logic.

gerner · 2020-11-27T19:42:11Z

Also, I ran a real-world webpage (https://www.atlassian.com/git/tutorials/rewriting-history) through pygmentize with html formatting on both the old code and the new and got the same pygmentized output. So happy path seems to be unaffected.

pygments/lexers/html.py

Explicitly handle unclosed <script> and <style> tags which previously would result in O(n^2) work to lex as Error tokens per character up to the end of the line or end of file (whichever comes first). Now we try lexing the rest of the line as Javascript/CSS if there's no closing script/style tag. We recover on the next line in the root state if there is a newline, otherwise just keep parsing as Javascript/CSS. This is similar to how the error handling in lexer.py works except we get Javascript or CSS tokens instead of Error tokens. And we get to the end of the line much faster since we don't apply an O(n) regex for every character in the line. I added a new test suite for html lexer (there wasn't one except for coverage in test_examplefiles.py) including a trivial happy-path case and several cases around <script> and <style> fragments, including regression coverage that fails on the old logic.

gerner force-pushed the html-script-fallback branch from 935376e to 396fdd5 Compare November 27, 2020 19:21

Anteru added this to the 2.7.3 milestone Dec 1, 2020

gerner commented Dec 3, 2020

View reviewed changes

pygments/lexers/html.py Outdated Show resolved Hide resolved

gerner force-pushed the html-script-fallback branch from 396fdd5 to 4b2326f Compare December 3, 2020 20:59

Anteru merged commit 78665a4 into pygments:master Dec 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unclosed script/style tag handling Fixes #1614 #1615

Unclosed script/style tag handling Fixes #1614 #1615

gerner commented Nov 27, 2020

gerner commented Nov 27, 2020

Unclosed script/style tag handling Fixes #1614 #1615

Unclosed script/style tag handling Fixes #1614 #1615

Conversation

gerner commented Nov 27, 2020

gerner commented Nov 27, 2020