Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compiler: fix lazy DFA false quits on ASCII text #768

Merged
merged 1 commit into from May 1, 2021

Commits on May 1, 2021

  1. compiler: fix lazy DFA false quits on ASCII text

    One of the things the lazy DFA can't handle is Unicode word boundaries,
    since it requires multi-byte look-around. However, it turns out that on
    pure ASCII text, Unicode word boundaries are equivalent to ASCII word
    boundaries. So the DFA has a heuristic: it treats Unicode word
    boundaries as ASCII boundaries until it sees a non-ASCII byte. When it
    does, it quits, and some other (slower) regex engine needs to take over.
    
    In a bug report against ripgrep[1], it was discovered that the lazy DFA
    was quitting and falling back to a slower engine even though the
    haystack was pure ASCII.
    
    It turned out that our equivalence byte class optimization was at fault.
    Namely, a '{' (which appears very frequently in the input) was being
    grouped in with other non-ASCII bytes. So whenever the DFA saw it, it
    treated it as a non-ASCII byte and thus stopped.
    
    The fix for this is simple: when we see a Unicode word boundary in the
    compiler, we set a boundary on our byte classes such that ASCII bytes
    are guaranteed to be in a different class from non-ASCII bytes. And
    indeed, this fixes the performance problem reported in [1].
    
    [1] - BurntSushi/ripgrep#1860
    BurntSushi committed May 1, 2021
    Copy the full SHA
    dd7e6ac View commit details
    Browse the repository at this point in the history