-
Notifications
You must be signed in to change notification settings - Fork 630
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
String matching regex in JsonLexer causes catastrophic backtracking #1065
Comments
I ran into a problem with the same regex. a repro case can be found on this file: https://github.com/simdjson/simdjson/blob/master/jsonchecker/adversarial/issue150/1005.json e.g. the following will hang: I pulled the regex and pos by asking RegexLexer to log as it applies regexes. This snippet illustrates the issue in isolation.
and this snippet illustrates the fix:
|
Also, I think it's funny that the pygments lexer hangs while trying to process an adversarial example from another parser. |
not, so fast. the suggested fix doesn't work. it'll fail test: specifically there's an embedded json document in this test that fails to parse, and also fails to parse correctly when using pygmentize on the command line:
The specific issue is that this string isn't matched properly in the proposed fix:
|
I fixed a very similar issue here: ab0537f, but the solution was to go to a different state to not have the problematic regex. I'm obviously curious to see if the regex can be massaged into shape, but I'm not super optimistic it can. |
You're suggesting changing the parser to, instead of trying to parse the quoted string in one shot, recognize the start of the string and then move to parsing the inside of the string and then move back to the outer state? This would be simpler than for ruby which supports two kinds of strings (single and double), whereas json only supports one. But this would still be a somewhat complex change because both object names and "simplevalue" will both need to dive into new states for string parsing, whereas your fix for ruby was on top of code already diving into a string parsing state. Does that sound correct? Sorry, I'm not super familiar with the pygments model of lexing (or lexing in general). But I am motivated to fix this. |
rather than adding new states what about this? r'"(\\(["\\/bfnrt]|u[a-fA-F0-9]]{4})|[^\\"])+"' I'm about 95% sure what the original regex was trying to accomplish and I think this one should capture the same things. Here's what this one does:
the whole thing turns into Name.Tag in the case of object attribute name or String.Double in the case of value. Tests all pass with this and the 1005.json adversarial string is totally unmatched and doesn't hang because of catastrophic backtracking. I did code up a solution with more states, but I don't like that it splits the strings into many tokens and I needed one state for the attribute name and one for the string value which seemed clunky. I'll submit a PR, unless there is some objection. |
) (pygments#1528) * more explicitly define escape sequencies in JsonLexer (fix pygments#1065) * adding test coverage for pygments#1065
(Original issue 1361 created by howardchris on 2017-07-11T08:45:49.801114+00:00)
The attached text file contains JSON data, downloaded from a request to google.co.uk (the text isn't actually legal JSON, but that's besides the point).
When using the JsonLexer to parse this file, the process hangs for a very long time, and the CPU of one core is maxed out.
The reason is catastrophic backtracking in the regex engine, caused by the regex used to match strings:
I believe the problem is that the groups in the regex are not mutually exclusive, as the last group can match a backslash character.
The solution I have found is to make each group in the regex mutually exclusive, by adding a backslash character to the final group:
Note that this particular regex is actually in two places (in both the simplevalue and objectvalue sections), and so both regexes should be updated.
A good description of the problem can be found here:
http://www.regular-expressions.info/catastrophic.html
After making these changes, the data in the attached file can be formatted in ~1s on my machine.
The text was updated successfully, but these errors were encountered: