Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected behavior of ungreedy ?? operator #862

Closed
vthib opened this issue May 20, 2022 · 6 comments · Fixed by #863
Closed

Unexpected behavior of ungreedy ?? operator #862

vthib opened this issue May 20, 2022 · 6 comments · Fixed by #863
Labels

Comments

@vthib
Copy link

vthib commented May 20, 2022

What version of regex are you using?

1.5.5

Describe the bug at a high level.

Running the regex ab?? on the input ab returns the match ab instead of a.

What are the steps to reproduce the behavior?

fn main() {
    let rx = regex::Regex::new("ab??").unwrap();
    let input = "ab";
    let mat = rx.find(input).unwrap();
    println!("match: {}", &input[mat.range()]);
}

What is the actual behavior?

This program returns: ab

What is the expected behavior?

I expect the output to be a, since ?? is non greedy, it should favor not matching the second letter in the input.

All other implementations i could find matches on a only.

@BurntSushi BurntSushi added the bug label May 20, 2022
@BurntSushi
Copy link
Member

Wow. Yes, this is a bug. I am amazed that this isn't captured by the test suite. Non-greediness itself works (both ab*? and ab+? work as expected), but it looks like ab?? doesn't specifically:

fn main() {
    let rx = regex::Regex::new("ab??").unwrap();
    let input = "abb";
    let mat = rx.find(input).unwrap();
    println!("match: {}", &input[mat.range()]);
    
    let rx = regex::Regex::new("ab+?").unwrap();
    let input = "abb";
    let mat = rx.find(input).unwrap();
    println!("match: {}", &input[mat.range()]);
    
    let rx = regex::Regex::new("ab*?").unwrap();
    let input = "abb";
    let mat = rx.find(input).unwrap();
    println!("match: {}", &input[mat.range()]);
}

Outputs:

match: ab
match: ab
match: a

Playground: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=da2560ea6ce7655338ef34101cf5eab3

@BurntSushi
Copy link
Member

Gah. This is another literal optimization bug. They are going to be the death of me. The regex \wab?? correctly reports [0, 2) as a match against Zab. That \w defeats literal optimizations and just runs the regex engine.

@BurntSushi
Copy link
Member

This actually a regression that was introduced in regex 1.1.8 in f43beb3. regex 1.1.7 handles this case correctly.

@BurntSushi
Copy link
Member

OK, so I at least understand the root cause. The root issue seems to be the order of literals extracted from the regex. For example:

$ regex-debug prefixes -a 'ab??'
Complete(ab)
Complete(a)

Since both literals are marked as "complete" (which means that if a match is found we can report a match for the overall regex and avoid using the regex engine to confirm the match), which is correct in this case, the regex crate will divert to using Aho-Corasick for this. It correctly uses "leftmost first" match semantics, which matches how the regex crate does things. But since ab comes before a, it means that a match of ab will take priority over a. That's correct for the regex ab? but wrong for ab??.

So why does ab*? and ab+? work as expected? Well, let's take a look at the prefix literals extracted for them:

$ regex-debug prefixes -a 'ab*?'
Cut(ab)
Complete(a)

$ regex-debug prefixes -a 'ab+?'
Cut(ab)

The latter is correct. The former generates literals in the wrong order, but since once of them is Cut (meaning we need the regex engine to confirm the match), we can't just divert to Aho-Corasick complete. Aho-Corasick will still prefer ab matches, but then it will ask the regex engine to confirm the match which will correctly report a, thus masking the bug.

Interestingly, despite the regex ab{0,1}? being equivalent to ab??, it does not exhibit this bug and correctly matches a in ab. The key is that the prefixes extracted from it actually appear to be different than what is extracted from ab??:

$ regex-debug prefixes -a 'ab{0,1}?'
Cut(ab)
Complete(a)

This is actually because in the code, the literal extraction isn't perfect and we treat bounded repetitions slightly differently. In this case, any bounded repetition e{minimum,maximum} with minimum == 0 is interpreted as e*:

fn repeat_range_literals<F: FnMut(&Hir, &mut Literals)>(
    e: &Hir,
    min: u32,
    max: Option<u32>,
    greedy: bool,
    lits: &mut Literals,
    mut f: F,
) {
    if min == 0 {
        // This is a bit conservative. If `max` is set, then we could
        // treat this as a finite set of alternations. For now, we
        // just treat it as `e*`.
        f(
            &Hir::repetition(hir::Repetition {
                kind: hir::RepetitionKind::ZeroOrMore,
                greedy: greedy,
                hir: Box::new(e.clone()),
            }),
            lits,
        );
    } else {

Since literal extraction is on my list of things to rewrite (the current implementation is quite bad and inscrutable), I'm going to just make an easy fix: treat e?? as e{0,n}, which in turn means we treat e?? as e*? or just e*. This will result in a list of literals where at least one is cut, and thus inhibit the Aho-Corasick optimization (sadly).

BurntSushi added a commit that referenced this issue May 20, 2022
Previously, 'ab??' returned [Complete(ab), Complete(a)], but the order
matters here because of greediness. The correct result is [Complete(a),
Complete(ab)].

Instead of trying to actually fix literal extraction (which is a mess),
we just rewrite 'ab?' (and 'ab??') as 'ab*'. 'ab*' still produces
literals in the incorrect order, i.e., [Cut(ab), Complete(a)], but since
one is cut we are guaranteed that the regex engine will be called to
confirm the match. In so doing, it will correctly report 'a' as a match
for 'ab??' in 'ab'.

Fixes #862
BurntSushi added a commit that referenced this issue May 20, 2022
Previously, 'ab??' returned [Complete(ab), Complete(a)], but the order
matters here because of greediness. The correct result is [Complete(a),
Complete(ab)].

Instead of trying to actually fix literal extraction (which is a mess),
we just rewrite 'ab?' (and 'ab??') as 'ab*'. 'ab*' still produces
literals in the incorrect order, i.e., [Cut(ab), Complete(a)], but since
one is cut we are guaranteed that the regex engine will be called to
confirm the match. In so doing, it will correctly report 'a' as a match
for 'ab??' in 'ab'.

Fixes #862
BurntSushi added a commit that referenced this issue May 20, 2022
Previously, 'ab??' returned [Complete(ab), Complete(a)], but the order
matters here because of greediness. The correct result is [Complete(a),
Complete(ab)].

Instead of trying to actually fix literal extraction (which is a mess),
we just rewrite 'ab?' (and 'ab??') as 'ab*'. 'ab*' still produces
literals in the incorrect order, i.e., [Cut(ab), Complete(a)], but since
one is cut we are guaranteed that the regex engine will be called to
confirm the match. In so doing, it will correctly report 'a' as a match
for 'ab??' in 'ab'.

Fixes #862
@BurntSushi
Copy link
Member

This bug is fixed in regex 1.5.6 and regex-syntax 0.6.26.

@vthib
Copy link
Author

vthib commented May 20, 2022

That was fast! Thanks for the fix and the very detailed explanation!

otc-zuul bot pushed a commit to opentelekomcloud-infra/cloudmon-plugin-smtp that referenced this issue Jun 7, 2022
Bump regex from 1.5.4 to 1.5.6

Bumps regex from 1.5.4 to 1.5.6.

Changelog
Sourced from regex's changelog.

1.5.6 (2022-05-20)
This release includes a few bug fixes, including a bug that produced incorrect
matches when a non-greedy ? operator was used.

[BUG #680](rust-lang/regex#680):
Fixes a bug where [[:alnum:][:^ascii:]] dropped [:alnum:] from the class.
[BUG #859](rust-lang/regex#859):
Fixes a bug where Hir::is_match_empty returned false for \b.
[BUG #862](rust-lang/regex#862):
Fixes a bug where 'ab??' matches 'ab' instead of 'a' in 'ab'.

1.5.5 (2022-03-08)
This releases fixes a security bug in the regex compiler. This bug permits a
vector for a denial-of-service attack in cases where the regex being compiled
is untrusted. There are no known problems where the regex is itself trusted,
including in cases of untrusted haystacks.

SECURITY #GHSA-m5pq-gvj9-9vr8:
Fixes a bug in the regex compiler where empty sub-expressions subverted the
existing mitigations in place to enforce a size limit on compiled regexes.
The Rust Security Response WG published an advisory about this:
https://groups.google.com/g/rustlang-security-announcements/c/NcNNL1Jq7Yw




Commits

9aef5b1 1.5.6
2931b07 syntax: bump minimum regex-syntax version to 0.6.26
b41bde0 regex-syntax-0.6.26
d98da65 changelog: 1.5.6
1c19619 syntax: fix literal extraction for 'ab??'
88a2a62 syntax: fix 'is_match_empty' predicate
72f09f1 syntax: fix ascii class union bug
b537286 doc: fix some typos
258bdf7 changelog: 1.5.5
d130381 1.5.5
Additional commits viewable in compare view




Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR
@dependabot recreate will recreate this PR, overwriting any edits that have been made to it
@dependabot merge will merge this PR after your CI passes on it
@dependabot squash and merge will squash and merge this PR after your CI passes on it
@dependabot cancel merge will cancel a previously requested merge and block automerging
@dependabot reopen will reopen this PR if it is closed
@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
@dependabot use these labels will set the current labels as the default for future PRs for this repo and language
@dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
@dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
@dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the Security Alerts page.

Reviewed-by: Artem Goncharov <Artem.goncharov@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants