Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

more explicitly define escape sequencies in JsonLexer (fix #1065) #1528

Merged
merged 2 commits into from Aug 31, 2020

Conversation

gerner
Copy link
Contributor

@gerner gerner commented Aug 29, 2020

As discussed in #1065 the existing regex for matching strings suffers from catastrophic backtracking on some strings. This updated regex is more explicit (and less forgiving) about handling escape sequences.

Tests pass and the https://github.com/simdjson/simdjson/blob/master/jsonchecker/adversarial/issue150/1005.json example doesn't cause a hang.

@gerner
Copy link
Contributor Author

gerner commented Aug 29, 2020

I should emphasize that this change is less forgiving about syntax errors. For instance, this example parses as one might expect (a pair of key/value attributes) on 2.6.1 but gets a little confused on this branch:

{"\lsomething bad": 200, "something good": "something \l \\\ not good"}

I'm not sure what the philosophy on being forgiving in parsing technically invalid syntax is.

@Anteru
Copy link
Collaborator

Anteru commented Aug 30, 2020

Thanks for the investigation! We do want to be unforgiving. I'd rather highlight anything slightly off as invalid syntax than have the parser hang indefinitely :)

Would you mind adding a test case as part of this commit as well? Just to make sure we never regress here.

@Anteru Anteru self-assigned this Aug 30, 2020
@Anteru
Copy link
Collaborator

Anteru commented Aug 30, 2020

Also looking at this change: [^\\"], that should check for \\ or ", but not the sequence. Are you sure that's what you want (or am I missing one level of indirection here?)

@gerner
Copy link
Contributor Author

gerner commented Aug 30, 2020

A test case would help. I manually tested on various files and get expected results, at least from pygments console coloring point of view expected results. I'll add a test case or two or three.

the whole string is:

r'"(\\(["\\/bfnrt]|u[a-fA-F0-9]]{4})|[^\\"])*"'

breaking that down what that is meant to accomplish:

  1. match a leading quote.
  2. match zero or more of:
    • an escape character followed by one of:
      • ",,/,b,f,n,r,t (the valid json escape characters as per the json.org spec)
      • "u" followed by 4 hex digits (json escaped unicode character)
    • any character except \, " (the characters valid inside a quoted string without escaping as per json.org spec)
  3. match a trailing quote

it does strike me that the r-string might be leaving some extra backslashes where I do not mean them. But that would make my manual tests invalid I think? I'll verify with more testing.

in step (2) all the cases are mutually exclusive (as per the point @Anteru makes in #1065), avoiding any backtracking

step (3) above (trailing quote) is protected from escaping because it cannot have a prior backslash by virtue of the preceding regex which would either consume it and escape some other character or wouldn't match the regex and cause the whole match to fail.

@Anteru
Copy link
Collaborator

Anteru commented Aug 30, 2020

Thanks for the explanation. My point is that you're using ["] (without backslash) but you're using [^\\"], and while I do think that works because of the grouping, it would be good to have a test for it. 2.2 should be "any character except \\, "" based on your regex, not any character except ". I think [^"] will work just fine.

@gerner
Copy link
Contributor Author

gerner commented Aug 30, 2020

gah, GFM ate my non-escaped backslash in my comment in (2.2), sorry.

I think I need to exclude the backslash to make that case mutually exclusive with the escape sequence case. Otherwise I end up with catastrophic backtracking again.

In fact, I just tried without excluding backslash and it hangs on my 1005.json case.

Is your concern that this is redundant? Or that it's incorrect? It seems like it is logically redundant, but avoids the backtracking.

@Anteru
Copy link
Collaborator

Anteru commented Aug 30, 2020

I was concerned it's incorrect, but that seems to clear it up, thanks for checking :) I gotta say when it comes to catastrophic backtracking I'd rather double check everything before ending up with problems down the line -- thanks for your help here.

@Anteru
Copy link
Collaborator

Anteru commented Aug 30, 2020

Are you still up for adding a testcase? If so I'll wait with the merge.

@gerner
Copy link
Contributor Author

gerner commented Aug 30, 2020

As I point out in the test case, I don't think the test case would fail on the old code, per se. Instead it would just hang, which is unfortunate.

Also, I'm not thrilled that it's returning a long sequence of single character Error tokens. I'm not sure if that's the preferred way for the lexers to work. I experimented with trying to explicitly capture at least some error cases (and tagging them as erorrs, but as single token errors). Might be preferrable? That does improve the highlighting of later (valid) parts of the object.

@gerner
Copy link
Contributor Author

gerner commented Aug 30, 2020

Also, I wanted to thank you for working on this library and for engaging with me so helpfully.

@gerner
Copy link
Contributor Author

gerner commented Aug 31, 2020

I'm not planning any more test coverage than the current test I added that covers the backtracking case. I'm not sure if you're hoping for more than that.

@Anteru
Copy link
Collaborator

Anteru commented Aug 31, 2020

Nope, that's all I was hoping for. Thanks again for your help!

@Anteru Anteru added this to the 2.7 milestone Aug 31, 2020
@Anteru Anteru merged commit 9514e79 into pygments:master Aug 31, 2020
Kenny2github pushed a commit to Kenny2github/pygments that referenced this pull request Sep 22, 2020
) (pygments#1528)

* more explicitly define escape sequencies in JsonLexer (fix pygments#1065)

* adding test coverage for pygments#1065
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants