String literals with many escapes take a long time to tokenize #61

gmlevin · 2014-11-11T21:16:16Z

Running the following program on the following data takes a long time to parse.

if __name__ == '__main__':
    ast = pycparser.parse_file("x.c")
    ast.show()

int main(int argc, char** argv) {
    "\123\123\123\123\123\123\123\123\123\123\123\123\123\123\123";
}

I think the problem is that the BAD_STRING_LITERAL regular expression is doing backtracking, resulting in exponential running time. Add another '\123' to the string and it will take roughly 3 times as long.

If I break the BAD_STRING_LITERAL pattern, by requiring that it start with some unexpected string 'xyzzy and then some' so that it never starts to match, the program runs quickly.

The text was updated successfully, but these errors were encountered:

eliben · 2014-11-11T23:19:20Z

Interesting find. If you have a fix, please submit a pull request

The test shows that on a simple lexer level the issue doesn't manifest. It does, however, manifest if parsing a file.

By making the first * non-greedy, performance is ~10-15% better; it still demonstrates pahological backtracking slowness (issue #61).

TysonAndre · 2019-08-19T00:05:07Z

The regex module (not re) has functionality that seems like it would help with that - atomic grouping, which will prevent it from backtracing. I'd recommend adding a dependency on that and using that

These commands now complete in milliseconds (both matches and failures) instead of minutes or longer.

>>> bad_string_literal = '"(?>'+string_char+'*)'+bad_escape+string_char+'*"'
>>> string_to_match='"0\\xE0\\xE1\\xE2\\xE3\\xE4\\xE5\\xE6\\xE7\\xE8\\xE9\\xEB\\xEC\\xEE\\xF0\\xF1\\xF2\\xF4\\xF6\\xF7\\xF8\\xF9\`\\xFA";\n\n\n\n\n\nin'
>>> regex.match(bad_string_literal, string_to_match)
<regex.Match object; span=(0, 93), match='"0\\xE0\\xE1\\xE2\\xE3\\xE4\\xE5\\xE6\\xE7\\xE8\\xE9\\xEB\\xEC\\xEE\\xF0\\xF1\\xF2\\xF4\\xF6\\xF7\\xF8\\xF9\\`\\xFA"'>
>>> string_to_match='"0\\xE0\\xE1\\xE2\\xE3\\xE4\\xE5\\xE6\\xE7\\xE8\\xE9\\xEB\\xEC\\xEE\\xF0\\xF1\\xF2\\xF4\\xF6\\xF7\\xF8\\xF9\\xFA";\n\n\n\n\n\nin'
>>> regex.match(bad_string_literal, r'"\123\123\123\123\123\123\123\123\123\123\123\123\123\123\123";')
>>> regex.match(bad_string_literal, r'"\123\123\123\123\123\123\`\123\123\123\123\123\123\123\123";')
<regex.Match object; span=(0, 60), match='"\\123\\123\\123\\123\\123\\123\\`\\123\\123\\123\\123\\123\\123\\123\\123"'>

Aside: I'm not familiar with the C standard. The regex looks like it allows exactly 1 invalid escape per string - Should this be allowing 1 or more invalid escapes instead (combine valid with invalid escape patterns and repeat)?

TysonAndre · 2019-08-19T00:21:19Z

Also, it looks like my patch to use regex will cause the test to fail because it no longer rejects "jx\9", but the test seems to have relied on a bug in the old re implementation.

I assume that it's testing that invalid octal escapes are rejected, but the implementation in c_lexer.py is permissive and tries to allow decimal escapes (which would permit \99, \9, etc?)

        self.assertLexerError(r'"jx\9"', ERR_STRING_ESCAPE)

    # character constants (K&R2: A.2.5.2)
    # Note: a-zA-Z and '.-~^_!=&;,' are allowed as escape chars to support #line
    # directives with Windows paths as filenames (..\..\dir\file)
    # For the same reason, decimal_escape allows all digit sequences. We want to
    # parse all correct code, even if it means to sometimes parse incorrect
    # code.

Use `regex` instead of `re`, and use atomic grouping. To avoid surprises, use it everywhere. https://pypi.org/project/regex/ > This regex implementation is backwards-compatible with the standard > ‘re’ module, but offers additional functionality. Fixes eliben#61 This uses atomic grouping, which avoids the unnecessary backtracking. > `(?>...)` > > If the following pattern subsequently fails, then the subpattern as a whole will fail. Also fix a test that relied on the incorrect handling of regexes. The implementation documentation says that it intends to allow **decimal** escapes permissively.

Use `regex` instead of `re`, and use atomic grouping. To avoid surprises, use it everywhere. https://pypi.org/project/regex/ > This regex implementation is backwards-compatible with the standard > ‘re’ module, but offers additional functionality. Fixes eliben#61 This uses atomic grouping, which avoids the unnecessary backtracking. > `(?>...)` > > If the following pattern subsequently fails, then the subpattern as a whole will fail. Also fix a test that relied on the incorrect handling of regexes. The implementation documentation says that it intends to allow **decimal** escapes permissively. Install `regex` in Travis and Appveyor

TysonAndre · 2019-08-21T13:58:56Z

Sorry, I'm not going to accept this. The bug isn't realistically important enough to add another dependency to pycparser. At this time pycparser has no external deps (PLY is vendored), and adding one is a big step function.

Would you accept PRs for either of these 3 options:

Load regex. If it exists, use the new regex. If it doesn't, use the old regex with re.
Run 3 regex matches with re, and return a dummy re.Match emulating the correct behavior
1. For the valid string prefix
2. For a single invalid escape sequence
3. For a combination of valid and and invalid sequences, followed by the closing quote
Try to fix ambiguity with negative lookahead.

Also, the root cause is still there. Thinking about this again, I assume that \123 can be parsed as \1, then 2+3, or \12, then 3, or \123, which means there would be 3 ways of parsing that escape sequence, and every single one would be tried if the regex failed somewhere (so each escape sequence makes worst-case matching 3 times longer).

I think the problem is that the BAD_STRING_LITERAL regular expression is doing backtracking, resulting in exponential running time. Add another '\123' to the string and it will take roughly 3 times as long.

Adding a lookahead assertion that the character after \[0-9]+ is not 0 or 9 might be a better approach to avoiding this exponential backtracking (same for anything else accepting variable numbers of characters). I don't know if python has had any bugs related to lookahead that I need to know about.


https://docs.python.org/2/library/re.html#regular-expression-syntax

(?!...)
Matches if `...` doesn’t match next. This is a negative lookahead assertion. For example, `Isaac (?!Asimov)` will match 'Isaac ' only if it’s not followed by 'Asimov'.

Fixes eliben#61 This uses negative lookaheads to avoid ambiguity in how string should be parsed by the regex. - https://docs.python.org/2/library/re.html#regular-expression-syntax - Previously, if it didn't immediately succeed at parsing an escape sequence such as `\123`, it would have to try `\1`+`23`, `\12` + `3`, and `\123`, which multiplied the time taken by 3 per additional escape sequence. This solves that by only allowing `\123` - The same fix was added for hex escapes. Also fix a test that relied on the incorrect handling of regexes. The implementation documentation says that it intends to allow **decimal** escapes permissively.

* Fix slow backtracking when parsing strings (no external deps) Fixes #61 This uses negative lookaheads to avoid ambiguity in how string should be parsed by the regex. - https://docs.python.org/2/library/re.html#regular-expression-syntax - Previously, if it didn't immediately succeed at parsing an escape sequence such as `\123`, it would have to try `\1`+`23`, `\12` + `3`, and `\123`, which multiplied the time taken by 3 per additional escape sequence. This solves that by only allowing `\123` - The same fix was added for hex escapes. Also fix a test that relied on the incorrect handling of regexes. The implementation documentation says that it intends to allow **decimal** escapes permissively. * WIP debug * Fix ambiguity caused by allowing #path directives Solve this by allowing "\x" when not followed by hex, in the regular string literal. In the previous commits, `\x12` could be parsed both as `\x`+`12` and `\x12`, which caused exponential options for backtracking. * Document changes to lexer, remove debug code * Optimize this for strings

eliben added the bug label Apr 20, 2015

eliben added a commit that referenced this issue Oct 10, 2016

Adding test for exercising issue #61

6db253a

The test shows that on a simple lexer level the issue doesn't manifest. It does, however, manifest if parsing a file.

eliben added a commit that referenced this issue Oct 10, 2016

Improve the behavior of pathological bad string literals a bit.

f0e7954

By making the first * non-greedy, performance is ~10-15% better; it still demonstrates pahological backtracking slowness (issue #61).

eliben added the patches-welcome label Sep 4, 2017

TysonAndre mentioned this issue Aug 19, 2019

Fix slow backtracking when checking for invalid strings #344

Closed

TysonAndre mentioned this issue Aug 21, 2019

Fix slow backtracking when parsing strings (no external deps) #347

Merged

eliben closed this as completed in #347 Aug 26, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

String literals with many escapes take a long time to tokenize #61

String literals with many escapes take a long time to tokenize #61

gmlevin commented Nov 11, 2014

eliben commented Nov 11, 2014

TysonAndre commented Aug 19, 2019

TysonAndre commented Aug 19, 2019

TysonAndre commented Aug 21, 2019

String literals with many escapes take a long time to tokenize #61

String literals with many escapes take a long time to tokenize #61

Comments

gmlevin commented Nov 11, 2014

eliben commented Nov 11, 2014

TysonAndre commented Aug 19, 2019

TysonAndre commented Aug 19, 2019

TysonAndre commented Aug 21, 2019