Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unclosed script/style tag handling Fixes #1614 #1615

Merged
merged 1 commit into from Dec 5, 2020

Conversation

gerner
Copy link
Contributor

@gerner gerner commented Nov 27, 2020

Explicitly handle unclosed <script> and <style> tags which previously
would result in O(n^2) work to lex as Error tokens per character up to
the end of the line or end of file (whichever comes first).

Now we try lexing the rest of the line as Javascript/CSS if there's no
closing script/style tag. We recover on the next line in the root state
if there is a newline, otherwise just keep parsing as Javascript/CSS.

This is similar to how the error handling in lexer.py works except we
get Javascript or CSS tokens instead of Error tokens. And we get to the
end of the line much faster since we don't apply an O(n) regex for every
character in the line.

I added a new test suite for html lexer (there wasn't one except for
coverage in test_examplefiles.py) including a trivial happy-path case
and several cases around <script> and <style> fragments, including
regression coverage that fails on the old logic.

@gerner
Copy link
Contributor Author

gerner commented Nov 27, 2020

Also, I ran a real-world webpage (https://www.atlassian.com/git/tutorials/rewriting-history) through pygmentize with html formatting on both the old code and the new and got the same pygmentized output. So happy path seems to be unaffected.

@Anteru Anteru added this to the 2.7.3 milestone Dec 1, 2020
pygments/lexers/html.py Outdated Show resolved Hide resolved
Explicitly handle unclosed <script> and <style> tags which previously
would result in O(n^2) work to lex as Error tokens per character up to
the end of the line or end of file (whichever comes first).

Now we try lexing the rest of the line as Javascript/CSS  if there's no
closing script/style tag. We recover on the next line in the root state
if there is a newline, otherwise just keep parsing as Javascript/CSS.

This is similar to how the error handling in lexer.py works except we
get Javascript or CSS tokens instead of Error tokens. And we get to the
end of the line much faster since we don't apply an O(n) regex for every
character in the line.

I added a new test suite for html lexer (there wasn't one except for
coverage in test_examplefiles.py) including a trivial happy-path case
and several cases around <script> and <style> fragments, including
regression coverage that fails on the old logic.
@Anteru Anteru merged commit 78665a4 into pygments:master Dec 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants