-
Notifications
You must be signed in to change notification settings - Fork 597
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
String literals with many escapes take a long time to tokenize #61
Comments
Interesting find. If you have a fix, please submit a pull request |
The test shows that on a simple lexer level the issue doesn't manifest. It does, however, manifest if parsing a file.
By making the first * non-greedy, performance is ~10-15% better; it still demonstrates pahological backtracking slowness (issue #61).
The These commands now complete in milliseconds (both matches and failures) instead of minutes or longer.
Aside: I'm not familiar with the C standard. The regex looks like it allows exactly 1 invalid escape per string - Should this be allowing 1 or more invalid escapes instead (combine valid with invalid escape patterns and repeat)? |
Also, it looks like my patch to use I assume that it's testing that invalid octal escapes are rejected, but the implementation in c_lexer.py is permissive and tries to allow decimal escapes (which would permit
|
Use `regex` instead of `re`, and use atomic grouping. To avoid surprises, use it everywhere. https://pypi.org/project/regex/ > This regex implementation is backwards-compatible with the standard > ‘re’ module, but offers additional functionality. Fixes eliben#61 This uses atomic grouping, which avoids the unnecessary backtracking. > `(?>...)` > > If the following pattern subsequently fails, then the subpattern as a whole will fail. Also fix a test that relied on the incorrect handling of regexes. The implementation documentation says that it intends to allow **decimal** escapes permissively.
Use `regex` instead of `re`, and use atomic grouping. To avoid surprises, use it everywhere. https://pypi.org/project/regex/ > This regex implementation is backwards-compatible with the standard > ‘re’ module, but offers additional functionality. Fixes eliben#61 This uses atomic grouping, which avoids the unnecessary backtracking. > `(?>...)` > > If the following pattern subsequently fails, then the subpattern as a whole will fail. Also fix a test that relied on the incorrect handling of regexes. The implementation documentation says that it intends to allow **decimal** escapes permissively. Install `regex` in Travis and Appveyor
Would you accept PRs for either of these 3 options:
Also, the root cause is still there. Thinking about this again, I assume that
Adding a lookahead assertion that the character after
|
Fixes eliben#61 This uses negative lookaheads to avoid ambiguity in how string should be parsed by the regex. - https://docs.python.org/2/library/re.html#regular-expression-syntax - Previously, if it didn't immediately succeed at parsing an escape sequence such as `\123`, it would have to try `\1`+`23`, `\12` + `3`, and `\123`, which multiplied the time taken by 3 per additional escape sequence. This solves that by only allowing `\123` - The same fix was added for hex escapes. Also fix a test that relied on the incorrect handling of regexes. The implementation documentation says that it intends to allow **decimal** escapes permissively.
* Fix slow backtracking when parsing strings (no external deps) Fixes #61 This uses negative lookaheads to avoid ambiguity in how string should be parsed by the regex. - https://docs.python.org/2/library/re.html#regular-expression-syntax - Previously, if it didn't immediately succeed at parsing an escape sequence such as `\123`, it would have to try `\1`+`23`, `\12` + `3`, and `\123`, which multiplied the time taken by 3 per additional escape sequence. This solves that by only allowing `\123` - The same fix was added for hex escapes. Also fix a test that relied on the incorrect handling of regexes. The implementation documentation says that it intends to allow **decimal** escapes permissively. * WIP debug * Fix ambiguity caused by allowing #path directives Solve this by allowing "\x" when not followed by hex, in the regular string literal. In the previous commits, `\x12` could be parsed both as `\x`+`12` and `\x12`, which caused exponential options for backtracking. * Document changes to lexer, remove debug code * Optimize this for strings
Running the following program on the following data takes a long time to parse.
I think the problem is that the BAD_STRING_LITERAL regular expression is doing backtracking, resulting in exponential running time. Add another '\123' to the string and it will take roughly 3 times as long.
If I break the BAD_STRING_LITERAL pattern, by requiring that it start with some unexpected string
'xyzzy and then some'
so that it never starts to match, the program runs quickly.The text was updated successfully, but these errors were encountered: