Incorrect match behavior on $ #557

davisjam · 2019-01-31T20:17:17Z

Rust version:

(15:14:21) jamie@woody $ rustc --version
rustc 1.32.0-nightly (00e03ee57 2018-11-22)

(15:15:24) jamie@woody $ cargo --version
cargo 1.32.0-nightly (b3d0b2e54 2018-11-15)

This might be a duplicate of some other already-open bugs, but it's hard to guess at root causes from symptoms.

Consider the regex /(aa$)?/ and the input "aaz", using a partial match.

The docs say that $ should match $ the end of text (or end-of-line with multi-line mode).

The final ? means that the regex engine can partial-match any string --- either strings that end in "aa" (capturing "aa") or any string (capturing nothing).

Since the input "aaz" does not end in "aa":

I expect this regex to match and capture nothing.
The regex actually matches and captures "aa".

Here's the kernel of my test program:

  match Regex::new(&query.pattern) {
    Ok(re) => {
      queryResult.validPattern = true;

      for i in 0..query.inputs.len() {
        let input = query.inputs.get(i).unwrap();
        eprintln!("Input: {}", input);

        let mut matched = false;
        let mut matchedString = "".to_string();
        let mut captureGroups: Vec<String> = Vec::new();

        // Partial-match semantics
        match re.captures(&input) {
          Some(caps) => {
            matched = true;

            matchedString = caps.get(0).unwrap().as_str().to_string();
            captureGroups = Vec::new();
            for i in 1..caps.len() {
              match caps.get(i) {
                Some(m) => {
                  captureGroups.push(m.as_str().to_string());
                },
                None => {
                  captureGroups.push("".to_string()); // Interpret unused capture group as ""
                }
              }
            }
          },
          None => {
            matched = false;
          }
        }

        let mr: MatchResult = MatchResult{
          input: input.to_string(),
          matched: matched,
          matchContents: MatchContents{
            matchedString: matchedString,
            captureGroups: captureGroups,
          },
        };

        queryResult.results.push(mr);
      }
    },
    Err(error) => {
      // Could not build.
      queryResult.validPattern = false;
    }
  };

This is the behavior on the regex and input described above:

{"pattern": "(aa$)?", "inputs": ["aaz"]}

The pattern is: (aa$)?
Input: aaz
{
  "pattern": "(aa$)?",
  "inputs": [
    "aaz"
  ],
  "validPattern": true,
  "results": [
    {
      "input": "aaz",
      "matched": true,
      "matchContents": {
        "matchedString": "aa",
        "captureGroups": [
          "aa"
        ]
      }
    }
  ]
}

In this case, Rust is unique among the 8 languages I tried. Perl, PHP, Java, Ruby, Go, JavaScript (Node-V8), and Python all match with an empty/null capture.

The text was updated successfully, but these errors were encountered:

BurntSushi · 2019-01-31T21:02:13Z

I believe your analysis on what's expected is correct. Here's a much smaller reproduction: https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=30dfc1d0a4d9c158c2dfe55fed32331b

The first and third results are correct, but the second one is not. I do believe there is a duplicate bug for this, but I'd want to confirm the root cause first. It could be a while before this gets fixed.

davisjam · 2019-01-31T21:04:18Z

To your test cases I would add:

   let re = Regex::new(r"(aa$)").unwrap();
   println!("{:?}", re.captures("aaz"));

which does not match. So the presence of the trailing ? appears to be important.

hikotq · 2019-02-13T15:37:59Z

Hi.
I have investigated about this issue.
The problem seems to come from the following code:

regex/src/exec.rs

Lines 873 to 887 in 60d087a

    
           fn captures_nfa_with_match( 
        
               &self, 
        
               slots: &mut [Slot], 
        
               text: &[u8], 
        
               match_start: usize, 
        
               match_end: usize, 
        
           ) -> Option<(usize, usize)> { 
        
               // We can't use match_end directly, because we may need to examine one 
        
               // "character" after the end of a match for lookahead operators. We 
        
               // need to move two characters beyond the end, since some look-around 
        
               // operations may falsely assume a premature end of text otherwise. 
        
               let e = cmp::min( 
        
                   next_utf8(text, next_utf8(text, match_end)), text.len()); 
        
               self.captures_nfa(slots, &text[..e], match_start) 
        
           }

which in exec.rs

As an example, Execute the following sample code:
https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=1fef3de82985c493655ba57ffa6a7ba0

In case of example, capture of the submatch is performed for the range of the matching result after performing matching in the DFA.
In this case submatches are captured by NFA after matching by DFA. The match result by DFA matches the null character so submatch capture range is (s, e) = (0, 0).

regex/src/exec.rs

Lines 563 to 575 in 60d087a

    
           MatchType::Dfa => { 
        
               if self.ro.nfa.is_anchored_start { 
        
                   self.captures_nfa(slots, text, start) 
        
               } else { 
        
                   match self.find_dfa_forward(text, start) { 
        
                       dfa::Result::Match((s, e)) => { 
        
                           self.captures_nfa_with_match(slots, text, s, e) 
        
                       } 
        
                       dfa::Result::NoMatch(_) => None, 
        
                       dfa::Result::Quit => self.captures_nfa(slots, text, start), 
        
                   } 
        
               } 
        
           }

captures_nfa_with_match seems to be a function to get the start position and end position of the matching result and to get a submatch within that range, but there are cases that the text two characters added to mached text specified as the submatch capture range.
In the code shown in the example, submatch capture range is (0, 0 + 2) = (0, 2) range, so "aa" is the submatch capture target.

Matching against $ is done when capture the submatches, but this process is to judge whether it is the end of the character string passed to captures_nfa in the current code. And the position at the time of finishing reading "aa" (at.pos() == 2) and the length of aa (self.len() == 2) are compared and matched. This makes /aa$/ match for "aa". But the condition of a match by EndText should be when at.pos() == 3 in case of example.

regex/src/input.rs

Line 188 in 60d087a

EndText => at.pos() == self.len(),

Also, #334 has been reported as an issue related to the relevant part. This issue is already closed, but it can be confirmed that even if the regular expression is slightly changed like this:
https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=fc92a5a10c24a89c300c974ae9296373
the problem is reproduced even now.

This seems to be the same cause as the issue of this issue.

To solve this problem, I feel that it is necessary to use the length of the original text when processing EndLine and EndText

Fixes rust-lang#557

When performing "EndText" matching, it is necessary to check whether the current position matches the input text length. However, when capturing a submatch using the matching result of DFA, "EndText" matching wasn't actually performed correctly because the input character string sliced. This patch resolve this problem by specifying the end position of the capture target match by the argument "end", not using slice when performing capture with the matching result of DFA. Fixes rust-lang#557

When performing "EndText" matching, it is necessary to check whether the current position matches the input text length. However, when capturing a submatch using the matching result of DFA, "EndText" matching wasn't actually performed correctly because the input character string sliced. This patch resolve this problem by specifying the match end position by the argument "end", not using slice when performing capture with the matching result of DFA. Fixes rust-lang#557

When performing "EndText" matching, it is necessary to check whether the current position matches the input text length. However, when capturing a submatch using the matching result of DFA, "EndText" matching wasn't actually performed correctly because the input text sliced. By applying this patch we specify the match end position by the argument "end", not using slice when performing capture with the matching result of DFA. Fixes rust-lang#557

When performing "EndText" matching, it is necessary to check whether the current position matches the input text length.However, when capturing a submatch using the matching result of DFA, "EndText" matching wasn't actually performed correctly because the input text is sliced. By applying this patch we specify the match end position by the argument "end", not using slice when performing capture with the matching result of DFA. Fixes rust-lang#557

When performing "EndText" matching, it is necessary to check whether the current position matches the input text length. However, when capturing a submatch using the matching result of DFA, "EndText" matching wasn't actually performed correctly because the input text is sliced. By applying this patch we specify the match end position by the argument "end", not using slice when performing capture with the matching result of DFA. Fixes #557, Closes #561

BurntSushi · 2019-03-30T14:50:38Z

@Pipopa Thanks so much for your investigation into this issue and subsequent fix. I've merged it in #567. :-)

hikotq · 2019-03-30T15:26:21Z

@BurntSushi Thank you for reviewing and merging!

This fixes a bug introduced by a bug fix for #557. In particular, the termination condition wasn't exactly right, and this appears to have slipped through the test suite. This probably reveals a hole in our test suite, which is specifically the testing of Unicode regexes with bytes::Regex on invalid UTF-8. This bug was originally reported against ripgrep: BurntSushi/ripgrep#1247

BurntSushi added the bug label Jan 31, 2019

hikotq pushed a commit to hikotq/regex that referenced this issue Feb 20, 2019

Fix a bug in capture with match

25eece1

Fixes rust-lang#557

hikotq mentioned this issue Mar 10, 2019

Fix a bug in capture with match #561

Closed

BurntSushi closed this as completed in 3563d73 Mar 30, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect match behavior on $ #557

Incorrect match behavior on $ #557

davisjam commented Jan 31, 2019 •

edited

BurntSushi commented Jan 31, 2019

davisjam commented Jan 31, 2019

hikotq commented Feb 13, 2019 •

edited

BurntSushi commented Mar 30, 2019

hikotq commented Mar 30, 2019

Incorrect match behavior on $ #557

Incorrect match behavior on $ #557

Comments

davisjam commented Jan 31, 2019 • edited

BurntSushi commented Jan 31, 2019

davisjam commented Jan 31, 2019

hikotq commented Feb 13, 2019 • edited

BurntSushi commented Mar 30, 2019

hikotq commented Mar 30, 2019

davisjam commented Jan 31, 2019 •

edited

hikotq commented Feb 13, 2019 •

edited