String matching regex in JsonLexer causes catastrophic backtracking #1065

Anteru · 2019-08-31T17:51:43Z

(Original issue 1361 created by howardchris on 2017-07-11T08:45:49.801114+00:00)

The attached text file contains JSON data, downloaded from a request to google.co.uk (the text isn't actually legal JSON, but that's besides the point).

When using the JsonLexer to parse this file, the process hangs for a very long time, and the CPU of one core is maxed out.

The reason is catastrophic backtracking in the regex engine, caused by the regex used to match strings:

#!python
r'"(\\\\|\\"|[^"])*"'

I believe the problem is that the groups in the regex are not mutually exclusive, as the last group can match a backslash character.

The solution I have found is to make each group in the regex mutually exclusive, by adding a backslash character to the final group:

#!python

r'"(\\\\|\\"|[^"\\])*"'

Note that this particular regex is actually in two places (in both the simplevalue and objectvalue sections), and so both regexes should be updated.

A good description of the problem can be found here:

http://www.regular-expressions.info/catastrophic.html

After making these changes, the data in the attached file can be formatted in ~1s on my machine.

The text was updated successfully, but these errors were encountered:

gerner · 2020-08-29T17:25:11Z

I ran into a problem with the same regex. a repro case can be found on this file:

https://github.com/simdjson/simdjson/blob/master/jsonchecker/adversarial/issue150/1005.json

e.g. the following will hang: pygmentize 1005.json

I pulled the regex and pos by asking RegexLexer to log as it applies regexes. This snippet illustrates the issue in isolation.

import re

regex = re.compile(r'"(\\\\|\\"|[^"])*"', re.DOTALL)
example = r'{"\u00D0000\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\63CD'
m = regex.match(example, 1)

print(str(m))

and this snippet illustrates the fix:

import re

regex = re.compile(r'"(\\\\|\\"|[^"\\])*"', re.DOTALL)
example = r'{"\u00D0000\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\63CD'
m = regex.match(example, 1)

print(str(m))

gerner · 2020-08-29T17:26:16Z

Also, I think it's funny that the pygments lexer hangs while trying to process an adversarial example from another parser.

gerner · 2020-08-29T17:50:37Z

not, so fast. the suggested fix doesn't work. it'll fail test: tests/examplefiles/http_response_example

specifically there's an embedded json document in this test that fails to parse, and also fails to parse correctly when using pygmentize on the command line:

[{"contributors_enabled":false,"profile_background_tile":true,"followers_count":644,"protected":false,"profile_image_url":"http:\/\/a0.twimg.com\/profile_images\/69064242\/gb_normal.jpg","screen_name":"birkenfeld","default_profile_image":false,"following":null,"friends_count":88,"profile_sidebar_fill_color":"7AC3EE","url":"http:\/\/pythonic.pocoo.org\/","name":"Georg Brandl","default_profile":false,"is_translator":false,"utc_offset":3600,"profile_sidebar_border_color":"65B0DA","description":"","profile_background_image_url_https":"https:\/\/si0.twimg.com\/images\/themes\/theme10\/bg.gif","favourites_count":0,"profile_use_background_image":true,"created_at":"Tue Dec 30 22:25:11 +0000 2008","status":{"retweet_count":10,"favorited":false,"geo":null,"possibly_sensitive":false,"coordinates":null,"in_reply_to_screen_name":null,"in_reply_to_status_id_str":null,"retweeted":false,"in_reply_to_status_id":null,"in_reply_to_user_id_str":null,"created_at":"Sat Jul 09 13:42:35 +0000 2011","truncated":false,"id_str":"89690914515206144","contributors":null,"place":null,"source":"web","in_reply_to_user_id":null,"id":89690914515206144,"retweeted_status":{"retweet_count":10,"favorited":false,"geo":null,"possibly_sensitive":false,"coordinates":null,"in_reply_to_screen_name":null,"in_reply_to_status_id_str":null,"retweeted":false,"in_reply_to_status_id":null,"in_reply_to_user_id_str":null,"created_at":"Sat Jul 09 13:07:04 +0000 2011","truncated":false,"id_str":"89681976755372032","contributors":null,"place":null,"source":"web","in_reply_to_user_id":null,"id":89681976755372032,"text":"Excellent Python posts from @mitsuhiko - http:\/\/t.co\/k1wt6e4 and @ncoghlan_dev - http:\/\/t.co\/eTxacgZ (links fixed)"},"text":"RT @jessenoller: Excellent Python posts from @mitsuhiko - http:\/\/t.co\/k1wt6e4 and @ncoghlan_dev - http:\/\/t.co\/eTxacgZ (links fixed)"},"follow_request_sent":null,"statuses_count":553,"geo_enabled":false,"notifications":null,"profile_text_color":"3D1957","id_str":"18490730","lang":"en","profile_background_image_url":"http:\/\/a1.twimg.com\/images\/themes\/theme10\/bg.gif","profile_image_url_https":"https:\/\/si0.twimg.com\/profile_images\/69064242\/gb_normal.jpg","show_all_inline_media":true,"listed_count":65,"profile_link_color":"FF0000","verified":false,"id":18490730,"time_zone":"Berlin","profile_background_color":"642D8B","location":"Bavaria, Germany"}]

The specific issue is that this string isn't matched properly in the proposed fix:

"http:\/\/a0.twimg.com\/profile_images\/69064242\/gb_normal.jpg"

Anteru · 2020-08-29T18:14:38Z

I fixed a very similar issue here: ab0537f, but the solution was to go to a different state to not have the problematic regex. I'm obviously curious to see if the regex can be massaged into shape, but I'm not super optimistic it can.

gerner · 2020-08-29T18:26:22Z

You're suggesting changing the parser to, instead of trying to parse the quoted string in one shot, recognize the start of the string and then move to parsing the inside of the string and then move back to the outer state?

This would be simpler than for ruby which supports two kinds of strings (single and double), whereas json only supports one. But this would still be a somewhat complex change because both object names and "simplevalue" will both need to dive into new states for string parsing, whereas your fix for ruby was on top of code already diving into a string parsing state. Does that sound correct?

Sorry, I'm not super familiar with the pygments model of lexing (or lexing in general). But I am motivated to fix this.

gerner · 2020-08-29T19:29:58Z

rather than adding new states what about this?

r'"(\\(["\\/bfnrt]|u[a-fA-F0-9]]{4})|[^\\"])+"'

I'm about 95% sure what the original regex was trying to accomplish and I think this one should capture the same things. Here's what this one does:

match an opening quote
match either an escaped character ("/bfnrt as per json spec) or a non escape or quote character
match a closing quote

the whole thing turns into Name.Tag in the case of object attribute name or String.Double in the case of value. Tests all pass with this and the 1005.json adversarial string is totally unmatched and doesn't hang because of catastrophic backtracking.

I did code up a solution with more states, but I don't like that it splits the strings into many tokens and I needed one state for the attribute name and one for the string value which seemed clunky.

I'll submit a PR, unless there is some objection.

)

) (pygments#1528) * more explicitly define escape sequencies in JsonLexer (fix pygments#1065) * adding test coverage for pygments#1065

Anteru added T-bug type: a bug X-imported imported from Bitbucket S-major severity: major labels Aug 31, 2019

Anteru self-assigned this Aug 29, 2020

gerner added a commit to gerner/pygments that referenced this issue Aug 29, 2020

more explicitly define escape sequencies in JsonLexer (fix pygments#1065

1cb2d0a

)

gerner mentioned this issue Aug 29, 2020

more explicitly define escape sequencies in JsonLexer (fix #1065) #1528

Merged

gerner added a commit to gerner/pygments that referenced this issue Aug 30, 2020

adding test coverage for pygments#1065

ec77778

Anteru closed this as completed in 9514e79 Aug 31, 2020

Anteru added this to the 2.7 milestone Aug 31, 2020

Anteru added changelog-update Items which need to get mentioned in the changelog and removed changelog-update Items which need to get mentioned in the changelog labels Aug 31, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

String matching regex in JsonLexer causes catastrophic backtracking #1065

String matching regex in JsonLexer causes catastrophic backtracking #1065

Anteru commented Aug 31, 2019

gerner commented Aug 29, 2020

gerner commented Aug 29, 2020

gerner commented Aug 29, 2020 •

edited

Anteru commented Aug 29, 2020

gerner commented Aug 29, 2020

gerner commented Aug 29, 2020

String matching regex in JsonLexer causes catastrophic backtracking #1065

String matching regex in JsonLexer causes catastrophic backtracking #1065

Comments

Anteru commented Aug 31, 2019

gerner commented Aug 29, 2020

gerner commented Aug 29, 2020

gerner commented Aug 29, 2020 • edited

Anteru commented Aug 29, 2020

gerner commented Aug 29, 2020

gerner commented Aug 29, 2020

gerner commented Aug 29, 2020 •

edited