New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
more explicitly define escape sequencies in JsonLexer (fix #1065) #1528
more explicitly define escape sequencies in JsonLexer (fix #1065) #1528
Conversation
I should emphasize that this change is less forgiving about syntax errors. For instance, this example parses as one might expect (a pair of key/value attributes) on 2.6.1 but gets a little confused on this branch:
I'm not sure what the philosophy on being forgiving in parsing technically invalid syntax is. |
Thanks for the investigation! We do want to be unforgiving. I'd rather highlight anything slightly off as invalid syntax than have the parser hang indefinitely :) Would you mind adding a test case as part of this commit as well? Just to make sure we never regress here. |
Also looking at this change: |
A test case would help. I manually tested on various files and get expected results, at least from pygments console coloring point of view expected results. I'll add a test case or two or three. the whole string is: r'"(\\(["\\/bfnrt]|u[a-fA-F0-9]]{4})|[^\\"])*"' breaking that down what that is meant to accomplish:
it does strike me that the r-string might be leaving some extra backslashes where I do not mean them. But that would make my manual tests invalid I think? I'll verify with more testing. in step (2) all the cases are mutually exclusive (as per the point @Anteru makes in #1065), avoiding any backtracking step (3) above (trailing quote) is protected from escaping because it cannot have a prior backslash by virtue of the preceding regex which would either consume it and escape some other character or wouldn't match the regex and cause the whole match to fail. |
Thanks for the explanation. My point is that you're using |
gah, GFM ate my non-escaped backslash in my comment in (2.2), sorry. I think I need to exclude the backslash to make that case mutually exclusive with the escape sequence case. Otherwise I end up with catastrophic backtracking again. In fact, I just tried without excluding backslash and it hangs on my 1005.json case. Is your concern that this is redundant? Or that it's incorrect? It seems like it is logically redundant, but avoids the backtracking. |
I was concerned it's incorrect, but that seems to clear it up, thanks for checking :) I gotta say when it comes to catastrophic backtracking I'd rather double check everything before ending up with problems down the line -- thanks for your help here. |
Are you still up for adding a testcase? If so I'll wait with the merge. |
As I point out in the test case, I don't think the test case would fail on the old code, per se. Instead it would just hang, which is unfortunate. Also, I'm not thrilled that it's returning a long sequence of single character Error tokens. I'm not sure if that's the preferred way for the lexers to work. I experimented with trying to explicitly capture at least some error cases (and tagging them as erorrs, but as single token errors). Might be preferrable? That does improve the highlighting of later (valid) parts of the object. |
Also, I wanted to thank you for working on this library and for engaging with me so helpfully. |
I'm not planning any more test coverage than the current test I added that covers the backtracking case. I'm not sure if you're hoping for more than that. |
Nope, that's all I was hoping for. Thanks again for your help! |
) (pygments#1528) * more explicitly define escape sequencies in JsonLexer (fix pygments#1065) * adding test coverage for pygments#1065
As discussed in #1065 the existing regex for matching strings suffers from catastrophic backtracking on some strings. This updated regex is more explicit (and less forgiving) about handling escape sequences.
Tests pass and the https://github.com/simdjson/simdjson/blob/master/jsonchecker/adversarial/issue150/1005.json example doesn't cause a hang.