Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

String matching regex in JsonLexer causes catastrophic backtracking #1065

Closed
Anteru opened this issue Aug 31, 2019 · 6 comments
Closed

String matching regex in JsonLexer causes catastrophic backtracking #1065

Anteru opened this issue Aug 31, 2019 · 6 comments
Assignees
Labels
S-major severity: major T-bug type: a bug X-imported imported from Bitbucket
Milestone

Comments

@Anteru
Copy link
Collaborator

Anteru commented Aug 31, 2019

(Original issue 1361 created by howardchris on 2017-07-11T08:45:49.801114+00:00)

The attached text file contains JSON data, downloaded from a request to google.co.uk (the text isn't actually legal JSON, but that's besides the point).

When using the JsonLexer to parse this file, the process hangs for a very long time, and the CPU of one core is maxed out.

The reason is catastrophic backtracking in the regex engine, caused by the regex used to match strings:

#!python
r'"(\\\\|\\"|[^"])*"'

I believe the problem is that the groups in the regex are not mutually exclusive, as the last group can match a backslash character.

The solution I have found is to make each group in the regex mutually exclusive, by adding a backslash character to the final group:

#!python

r'"(\\\\|\\"|[^"\\])*"'

Note that this particular regex is actually in two places (in both the simplevalue and objectvalue sections), and so both regexes should be updated.

A good description of the problem can be found here:

http://www.regular-expressions.info/catastrophic.html

After making these changes, the data in the attached file can be formatted in ~1s on my machine.

@Anteru Anteru added T-bug type: a bug X-imported imported from Bitbucket S-major severity: major labels Aug 31, 2019
@gerner
Copy link
Contributor

gerner commented Aug 29, 2020

I ran into a problem with the same regex. a repro case can be found on this file:

https://github.com/simdjson/simdjson/blob/master/jsonchecker/adversarial/issue150/1005.json

e.g. the following will hang: pygmentize 1005.json

I pulled the regex and pos by asking RegexLexer to log as it applies regexes. This snippet illustrates the issue in isolation.

import re

regex = re.compile(r'"(\\\\|\\"|[^"])*"', re.DOTALL)
example = r'{"\u00D0000\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\63CD'
m = regex.match(example, 1)

print(str(m))

and this snippet illustrates the fix:

import re

regex = re.compile(r'"(\\\\|\\"|[^"\\])*"', re.DOTALL)
example = r'{"\u00D0000\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\63CD'
m = regex.match(example, 1)

print(str(m))

@gerner
Copy link
Contributor

gerner commented Aug 29, 2020

Also, I think it's funny that the pygments lexer hangs while trying to process an adversarial example from another parser.

@Anteru Anteru self-assigned this Aug 29, 2020
@gerner
Copy link
Contributor

gerner commented Aug 29, 2020

not, so fast. the suggested fix doesn't work. it'll fail test: tests/examplefiles/http_response_example

specifically there's an embedded json document in this test that fails to parse, and also fails to parse correctly when using pygmentize on the command line:

[{"contributors_enabled":false,"profile_background_tile":true,"followers_count":644,"protected":false,"profile_image_url":"http:\/\/a0.twimg.com\/profile_images\/69064242\/gb_normal.jpg","screen_name":"birkenfeld","default_profile_image":false,"following":null,"friends_count":88,"profile_sidebar_fill_color":"7AC3EE","url":"http:\/\/pythonic.pocoo.org\/","name":"Georg Brandl","default_profile":false,"is_translator":false,"utc_offset":3600,"profile_sidebar_border_color":"65B0DA","description":"","profile_background_image_url_https":"https:\/\/si0.twimg.com\/images\/themes\/theme10\/bg.gif","favourites_count":0,"profile_use_background_image":true,"created_at":"Tue Dec 30 22:25:11 +0000 2008","status":{"retweet_count":10,"favorited":false,"geo":null,"possibly_sensitive":false,"coordinates":null,"in_reply_to_screen_name":null,"in_reply_to_status_id_str":null,"retweeted":false,"in_reply_to_status_id":null,"in_reply_to_user_id_str":null,"created_at":"Sat Jul 09 13:42:35 +0000 2011","truncated":false,"id_str":"89690914515206144","contributors":null,"place":null,"source":"web","in_reply_to_user_id":null,"id":89690914515206144,"retweeted_status":{"retweet_count":10,"favorited":false,"geo":null,"possibly_sensitive":false,"coordinates":null,"in_reply_to_screen_name":null,"in_reply_to_status_id_str":null,"retweeted":false,"in_reply_to_status_id":null,"in_reply_to_user_id_str":null,"created_at":"Sat Jul 09 13:07:04 +0000 2011","truncated":false,"id_str":"89681976755372032","contributors":null,"place":null,"source":"web","in_reply_to_user_id":null,"id":89681976755372032,"text":"Excellent Python posts from @mitsuhiko - http:\/\/t.co\/k1wt6e4 and @ncoghlan_dev - http:\/\/t.co\/eTxacgZ (links fixed)"},"text":"RT @jessenoller: Excellent Python posts from @mitsuhiko - http:\/\/t.co\/k1wt6e4 and @ncoghlan_dev - http:\/\/t.co\/eTxacgZ (links fixed)"},"follow_request_sent":null,"statuses_count":553,"geo_enabled":false,"notifications":null,"profile_text_color":"3D1957","id_str":"18490730","lang":"en","profile_background_image_url":"http:\/\/a1.twimg.com\/images\/themes\/theme10\/bg.gif","profile_image_url_https":"https:\/\/si0.twimg.com\/profile_images\/69064242\/gb_normal.jpg","show_all_inline_media":true,"listed_count":65,"profile_link_color":"FF0000","verified":false,"id":18490730,"time_zone":"Berlin","profile_background_color":"642D8B","location":"Bavaria, Germany"}]

The specific issue is that this string isn't matched properly in the proposed fix:

"http:\/\/a0.twimg.com\/profile_images\/69064242\/gb_normal.jpg"

@Anteru
Copy link
Collaborator Author

Anteru commented Aug 29, 2020

I fixed a very similar issue here: ab0537f, but the solution was to go to a different state to not have the problematic regex. I'm obviously curious to see if the regex can be massaged into shape, but I'm not super optimistic it can.

@gerner
Copy link
Contributor

gerner commented Aug 29, 2020

You're suggesting changing the parser to, instead of trying to parse the quoted string in one shot, recognize the start of the string and then move to parsing the inside of the string and then move back to the outer state?

This would be simpler than for ruby which supports two kinds of strings (single and double), whereas json only supports one. But this would still be a somewhat complex change because both object names and "simplevalue" will both need to dive into new states for string parsing, whereas your fix for ruby was on top of code already diving into a string parsing state. Does that sound correct?

Sorry, I'm not super familiar with the pygments model of lexing (or lexing in general). But I am motivated to fix this.

@gerner
Copy link
Contributor

gerner commented Aug 29, 2020

rather than adding new states what about this?

r'"(\\(["\\/bfnrt]|u[a-fA-F0-9]]{4})|[^\\"])+"'

I'm about 95% sure what the original regex was trying to accomplish and I think this one should capture the same things. Here's what this one does:

  1. match an opening quote
  2. match either an escaped character ("/bfnrt as per json spec) or a non escape or quote character
  3. match a closing quote

the whole thing turns into Name.Tag in the case of object attribute name or String.Double in the case of value. Tests all pass with this and the 1005.json adversarial string is totally unmatched and doesn't hang because of catastrophic backtracking.

I did code up a solution with more states, but I don't like that it splits the strings into many tokens and I needed one state for the attribute name and one for the string value which seemed clunky.

I'll submit a PR, unless there is some objection.

gerner added a commit to gerner/pygments that referenced this issue Aug 29, 2020
gerner added a commit to gerner/pygments that referenced this issue Aug 30, 2020
@Anteru Anteru closed this as completed in 9514e79 Aug 31, 2020
@Anteru Anteru added this to the 2.7 milestone Aug 31, 2020
@Anteru Anteru added changelog-update Items which need to get mentioned in the changelog and removed changelog-update Items which need to get mentioned in the changelog labels Aug 31, 2020
Kenny2github pushed a commit to Kenny2github/pygments that referenced this issue Sep 22, 2020
) (pygments#1528)

* more explicitly define escape sequencies in JsonLexer (fix pygments#1065)

* adding test coverage for pygments#1065
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-major severity: major T-bug type: a bug X-imported imported from Bitbucket
Projects
None yet
Development

No branches or pull requests

2 participants