Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect match behavior on $ #557

Closed
davisjam opened this issue Jan 31, 2019 · 5 comments
Closed

Incorrect match behavior on $ #557

davisjam opened this issue Jan 31, 2019 · 5 comments
Labels

Comments

@davisjam
Copy link
Contributor

davisjam commented Jan 31, 2019

Rust version:

(15:14:21) jamie@woody $ rustc --version
rustc 1.32.0-nightly (00e03ee57 2018-11-22)

(15:15:24) jamie@woody $ cargo --version
cargo 1.32.0-nightly (b3d0b2e54 2018-11-15)

This might be a duplicate of some other already-open bugs, but it's hard to guess at root causes from symptoms.

Consider the regex /(aa$)?/ and the input "aaz", using a partial match.

The docs say that $ should match $ the end of text (or end-of-line with multi-line mode).

The final ? means that the regex engine can partial-match any string --- either strings that end in "aa" (capturing "aa") or any string (capturing nothing).

Since the input "aaz" does not end in "aa":

  • I expect this regex to match and capture nothing.
  • The regex actually matches and captures "aa".

Here's the kernel of my test program:

  match Regex::new(&query.pattern) {
    Ok(re) => {
      queryResult.validPattern = true;

      for i in 0..query.inputs.len() {
        let input = query.inputs.get(i).unwrap();
        eprintln!("Input: {}", input);

        let mut matched = false;
        let mut matchedString = "".to_string();
        let mut captureGroups: Vec<String> = Vec::new();

        // Partial-match semantics
        match re.captures(&input) {
          Some(caps) => {
            matched = true;

            matchedString = caps.get(0).unwrap().as_str().to_string();
            captureGroups = Vec::new();
            for i in 1..caps.len() {
              match caps.get(i) {
                Some(m) => {
                  captureGroups.push(m.as_str().to_string());
                },
                None => {
                  captureGroups.push("".to_string()); // Interpret unused capture group as ""
                }
              }
            }
          },
          None => {
            matched = false;
          }
        }

        let mr: MatchResult = MatchResult{
          input: input.to_string(),
          matched: matched,
          matchContents: MatchContents{
            matchedString: matchedString,
            captureGroups: captureGroups,
          },
        };

        queryResult.results.push(mr);
      }
    },
    Err(error) => {
      // Could not build.
      queryResult.validPattern = false;
    }
  };

This is the behavior on the regex and input described above:

{"pattern": "(aa$)?", "inputs": ["aaz"]}

The pattern is: (aa$)?
Input: aaz
{
  "pattern": "(aa$)?",
  "inputs": [
    "aaz"
  ],
  "validPattern": true,
  "results": [
    {
      "input": "aaz",
      "matched": true,
      "matchContents": {
        "matchedString": "aa",
        "captureGroups": [
          "aa"
        ]
      }
    }
  ]
}

In this case, Rust is unique among the 8 languages I tried. Perl, PHP, Java, Ruby, Go, JavaScript (Node-V8), and Python all match with an empty/null capture.

@BurntSushi
Copy link
Member

I believe your analysis on what's expected is correct. Here's a much smaller reproduction: https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=30dfc1d0a4d9c158c2dfe55fed32331b

The first and third results are correct, but the second one is not. I do believe there is a duplicate bug for this, but I'd want to confirm the root cause first. It could be a while before this gets fixed.

@BurntSushi BurntSushi added the bug label Jan 31, 2019
@davisjam
Copy link
Contributor Author

To your test cases I would add:

   let re = Regex::new(r"(aa$)").unwrap();
   println!("{:?}", re.captures("aaz"));

which does not match. So the presence of the trailing ? appears to be important.

@hikotq
Copy link
Contributor

hikotq commented Feb 13, 2019

Hi.
I have investigated about this issue.
The problem seems to come from the following code:

regex/src/exec.rs

Lines 873 to 887 in 60d087a

fn captures_nfa_with_match(
&self,
slots: &mut [Slot],
text: &[u8],
match_start: usize,
match_end: usize,
) -> Option<(usize, usize)> {
// We can't use match_end directly, because we may need to examine one
// "character" after the end of a match for lookahead operators. We
// need to move two characters beyond the end, since some look-around
// operations may falsely assume a premature end of text otherwise.
let e = cmp::min(
next_utf8(text, next_utf8(text, match_end)), text.len());
self.captures_nfa(slots, &text[..e], match_start)
}

which in exec.rs

As an example, Execute the following sample code:
https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=1fef3de82985c493655ba57ffa6a7ba0

In case of example, capture of the submatch is performed for the range of the matching result after performing matching in the DFA.
In this case submatches are captured by NFA after matching by DFA. The match result by DFA matches the null character so submatch capture range is (s, e) = (0, 0).

regex/src/exec.rs

Lines 563 to 575 in 60d087a

MatchType::Dfa => {
if self.ro.nfa.is_anchored_start {
self.captures_nfa(slots, text, start)
} else {
match self.find_dfa_forward(text, start) {
dfa::Result::Match((s, e)) => {
self.captures_nfa_with_match(slots, text, s, e)
}
dfa::Result::NoMatch(_) => None,
dfa::Result::Quit => self.captures_nfa(slots, text, start),
}
}
}

captures_nfa_with_match seems to be a function to get the start position and end position of the matching result and to get a submatch within that range, but there are cases that the text two characters added to mached text specified as the submatch capture range.
In the code shown in the example, submatch capture range is (0, 0 + 2) = (0, 2) range, so "aa" is the submatch capture target.

Matching against $ is done when capture the submatches, but this process is to judge whether it is the end of the character string passed to captures_nfa in the current code. And the position at the time of finishing reading "aa" (at.pos() == 2) and the length of aa (self.len() == 2) are compared and matched. This makes /aa$/ match for "aa". But the condition of a match by EndText should be when at.pos() == 3 in case of example.

EndText => at.pos() == self.len(),

Also, #334 has been reported as an issue related to the relevant part. This issue is already closed, but it can be confirmed that even if the regular expression is slightly changed like this:
https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=fc92a5a10c24a89c300c974ae9296373
the problem is reproduced even now.

This seems to be the same cause as the issue of this issue.

To solve this problem, I feel that it is necessary to use the length of the original text when processing EndLine and EndText

hikotq pushed a commit to hikotq/regex that referenced this issue Feb 20, 2019
hikotq pushed a commit to hikotq/regex that referenced this issue Mar 10, 2019
When performing "EndText" matching, it is necessary to check whether
the current position matches the input text length.
However, when capturing a submatch using the matching result of DFA, "EndText" matching
wasn't actually performed correctly because the input character string sliced.

This patch resolve this problem by specifying the end position of the capture target
match by the argument "end", not using slice when performing capture with the matching result of DFA.

Fixes rust-lang#557
hikotq pushed a commit to hikotq/regex that referenced this issue Mar 10, 2019
When performing "EndText" matching, it is necessary to check whether
the current position matches the input text length.
However, when capturing a submatch using the matching result of DFA, "EndText" matching
wasn't actually performed correctly because the input character string sliced.

This patch resolve this problem by specifying the match end position
by the argument "end", not using slice when performing capture with the matching result of DFA.

Fixes rust-lang#557
hikotq pushed a commit to hikotq/regex that referenced this issue Mar 10, 2019
When performing "EndText" matching, it is necessary to check whether
the current position matches the input text length.
However, when capturing a submatch using the matching result of DFA, "EndText" matching
wasn't actually performed correctly because the input text sliced.

By applying this patch we specify the match end position by the argument "end",
not using slice when performing capture with the matching result of DFA.

Fixes rust-lang#557
hikotq pushed a commit to hikotq/regex that referenced this issue Mar 10, 2019
When performing "EndText" matching, it is necessary to check whether
the current position matches the input text length.However, when
capturing a submatch using the matching result of DFA, "EndText" matching
wasn't actually performed correctly because the input text is sliced.

By applying this patch we specify the match end position by the argument "end",
not using slice when performing capture with the matching result of DFA.

Fixes rust-lang#557
BurntSushi pushed a commit that referenced this issue Mar 30, 2019
When performing "EndText" matching, it is necessary to check whether
the current position matches the input text length. However, when
capturing a submatch using the matching result of DFA, "EndText"
matching wasn't actually performed correctly because the input text is
sliced.

By applying this patch we specify the match end position by the
argument "end", not using slice when performing capture with the
matching result of DFA.

Fixes #557, Closes #561
@BurntSushi
Copy link
Member

@Pipopa Thanks so much for your investigation into this issue and subsequent fix. I've merged it in #567. :-)

@hikotq
Copy link
Contributor

hikotq commented Mar 30, 2019

@BurntSushi Thank you for reviewing and merging!

BurntSushi added a commit that referenced this issue Apr 16, 2019
This fixes a bug introduced by a bug fix for #557. In particular, the
termination condition wasn't exactly right, and this appears to have
slipped through the test suite. This probably reveals a hole in our test
suite, which is specifically the testing of Unicode regexes with
bytes::Regex on invalid UTF-8.

This bug was originally reported against ripgrep:
BurntSushi/ripgrep#1247
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants