Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

String literals with many escapes take a long time to tokenize #61

Closed
gmlevin opened this issue Nov 11, 2014 · 4 comments · Fixed by #347
Closed

String literals with many escapes take a long time to tokenize #61

gmlevin opened this issue Nov 11, 2014 · 4 comments · Fixed by #347

Comments

@gmlevin
Copy link

gmlevin commented Nov 11, 2014

Running the following program on the following data takes a long time to parse.

if __name__ == '__main__':
    ast = pycparser.parse_file("x.c")
    ast.show()
int main(int argc, char** argv) {
    "\123\123\123\123\123\123\123\123\123\123\123\123\123\123\123";
}

I think the problem is that the BAD_STRING_LITERAL regular expression is doing backtracking, resulting in exponential running time. Add another '\123' to the string and it will take roughly 3 times as long.

If I break the BAD_STRING_LITERAL pattern, by requiring that it start with some unexpected string 'xyzzy and then some' so that it never starts to match, the program runs quickly.

@eliben
Copy link
Owner

eliben commented Nov 11, 2014

Interesting find. If you have a fix, please submit a pull request

@eliben eliben added the bug label Apr 20, 2015
eliben added a commit that referenced this issue Oct 10, 2016
The test shows that on a simple lexer level the issue doesn't manifest. It does,
however, manifest if parsing a file.
eliben added a commit that referenced this issue Oct 10, 2016
By making the first * non-greedy, performance is ~10-15% better; it still
demonstrates pahological backtracking slowness (issue #61).
@TysonAndre
Copy link
Contributor

The regex module (not re) has functionality that seems like it would help with that - atomic grouping, which will prevent it from backtracing. I'd recommend adding a dependency on that and using that

These commands now complete in milliseconds (both matches and failures) instead of minutes or longer.

>>> bad_string_literal = '"(?>'+string_char+'*)'+bad_escape+string_char+'*"'
>>> string_to_match='"0\\xE0\\xE1\\xE2\\xE3\\xE4\\xE5\\xE6\\xE7\\xE8\\xE9\\xEB\\xEC\\xEE\\xF0\\xF1\\xF2\\xF4\\xF6\\xF7\\xF8\\xF9\`\\xFA";\n\n\n\n\n\nin'
>>> regex.match(bad_string_literal, string_to_match)
<regex.Match object; span=(0, 93), match='"0\\xE0\\xE1\\xE2\\xE3\\xE4\\xE5\\xE6\\xE7\\xE8\\xE9\\xEB\\xEC\\xEE\\xF0\\xF1\\xF2\\xF4\\xF6\\xF7\\xF8\\xF9\\`\\xFA"'>
>>> string_to_match='"0\\xE0\\xE1\\xE2\\xE3\\xE4\\xE5\\xE6\\xE7\\xE8\\xE9\\xEB\\xEC\\xEE\\xF0\\xF1\\xF2\\xF4\\xF6\\xF7\\xF8\\xF9\\xFA";\n\n\n\n\n\nin'
>>> regex.match(bad_string_literal, r'"\123\123\123\123\123\123\123\123\123\123\123\123\123\123\123";')
>>> regex.match(bad_string_literal, r'"\123\123\123\123\123\123\`\123\123\123\123\123\123\123\123";')
<regex.Match object; span=(0, 60), match='"\\123\\123\\123\\123\\123\\123\\`\\123\\123\\123\\123\\123\\123\\123\\123"'>

Aside: I'm not familiar with the C standard. The regex looks like it allows exactly 1 invalid escape per string - Should this be allowing 1 or more invalid escapes instead (combine valid with invalid escape patterns and repeat)?

@TysonAndre
Copy link
Contributor

Also, it looks like my patch to use regex will cause the test to fail because it no longer rejects "jx\9", but the test seems to have relied on a bug in the old re implementation.

I assume that it's testing that invalid octal escapes are rejected, but the implementation in c_lexer.py is permissive and tries to allow decimal escapes (which would permit \99, \9, etc?)

        self.assertLexerError(r'"jx\9"', ERR_STRING_ESCAPE)
    # character constants (K&R2: A.2.5.2)
    # Note: a-zA-Z and '.-~^_!=&;,' are allowed as escape chars to support #line
    # directives with Windows paths as filenames (..\..\dir\file)
    # For the same reason, decimal_escape allows all digit sequences. We want to
    # parse all correct code, even if it means to sometimes parse incorrect
    # code.

TysonAndre added a commit to TysonAndre/pycparser that referenced this issue Aug 19, 2019
Use `regex` instead of `re`, and use atomic grouping.
To avoid surprises, use it everywhere.

https://pypi.org/project/regex/

> This regex implementation is backwards-compatible with the standard
> ‘re’ module, but offers additional functionality.

Fixes eliben#61

This uses atomic grouping, which avoids the unnecessary backtracking.

> `(?>...)`
>
> If the following pattern subsequently fails, then the subpattern as a
whole will fail.

Also fix a test that relied on the incorrect handling of regexes.
The implementation documentation says that it intends to allow
**decimal** escapes permissively.
TysonAndre added a commit to TysonAndre/pycparser that referenced this issue Aug 19, 2019
Use `regex` instead of `re`, and use atomic grouping.
To avoid surprises, use it everywhere.

https://pypi.org/project/regex/

> This regex implementation is backwards-compatible with the standard
> ‘re’ module, but offers additional functionality.

Fixes eliben#61

This uses atomic grouping, which avoids the unnecessary backtracking.

> `(?>...)`
>
> If the following pattern subsequently fails, then the subpattern as a
whole will fail.

Also fix a test that relied on the incorrect handling of regexes.
The implementation documentation says that it intends to allow
**decimal** escapes permissively.

Install `regex` in Travis and Appveyor
@TysonAndre
Copy link
Contributor

Sorry, I'm not going to accept this. The bug isn't realistically important enough to add another dependency to pycparser. At this time pycparser has no external deps (PLY is vendored), and adding one is a big step function.

Would you accept PRs for either of these 3 options:

  1. Load regex. If it exists, use the new regex. If it doesn't, use the old regex with re.

  2. Run 3 regex matches with re, and return a dummy re.Match emulating the correct behavior

    1. For the valid string prefix
    2. For a single invalid escape sequence
    3. For a combination of valid and and invalid sequences, followed by the closing quote
  3. Try to fix ambiguity with negative lookahead.


Also, the root cause is still there. Thinking about this again, I assume that \123 can be parsed as \1, then 2+3, or \12, then 3, or \123, which means there would be 3 ways of parsing that escape sequence, and every single one would be tried if the regex failed somewhere (so each escape sequence makes worst-case matching 3 times longer).

I think the problem is that the BAD_STRING_LITERAL regular expression is doing backtracking, resulting in exponential running time. Add another '\123' to the string and it will take roughly 3 times as long.

Adding a lookahead assertion that the character after \[0-9]+ is not 0 or 9 might be a better approach to avoiding this exponential backtracking (same for anything else accepting variable numbers of characters). I don't know if python has had any bugs related to lookahead that I need to know about.


https://docs.python.org/2/library/re.html#regular-expression-syntax

(?!...)
Matches if `...` doesn’t match next. This is a negative lookahead assertion. For example, `Isaac (?!Asimov)` will match 'Isaac ' only if it’s not followed by 'Asimov'.

TysonAndre added a commit to TysonAndre/pycparser that referenced this issue Aug 21, 2019
Fixes eliben#61

This uses negative lookaheads to avoid ambiguity in how string should be
parsed by the regex.
- https://docs.python.org/2/library/re.html#regular-expression-syntax

- Previously, if it didn't immediately succeed at parsing an escape
  sequence such as `\123`, it would have to try `\1`+`23`, `\12` + `3`,
  and `\123`, which multiplied the time taken by 3 per additional escape
  sequence. This solves that by only allowing `\123`
- The same fix was added for hex escapes.

Also fix a test that relied on the incorrect handling of regexes.
The implementation documentation says that it intends to allow
**decimal** escapes permissively.
eliben pushed a commit that referenced this issue Aug 26, 2019
* Fix slow backtracking when parsing strings (no external deps)

Fixes #61

This uses negative lookaheads to avoid ambiguity in how string should be
parsed by the regex.
- https://docs.python.org/2/library/re.html#regular-expression-syntax

- Previously, if it didn't immediately succeed at parsing an escape
  sequence such as `\123`, it would have to try `\1`+`23`, `\12` + `3`,
  and `\123`, which multiplied the time taken by 3 per additional escape
  sequence. This solves that by only allowing `\123`
- The same fix was added for hex escapes.

Also fix a test that relied on the incorrect handling of regexes.
The implementation documentation says that it intends to allow
**decimal** escapes permissively.

* WIP debug

* Fix ambiguity caused by allowing #path directives

Solve this by allowing "\x" when not followed by hex,
in the regular string literal.

In the previous commits,
`\x12` could be parsed both as `\x`+`12` and `\x12`,
which caused exponential options for backtracking.

* Document changes to lexer, remove debug code

* Optimize this for strings
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants