Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Regexp needles to String#index and String#rindex #2362

Open
lopopolo opened this issue Jan 28, 2023 · 7 comments
Open

Add support for Regexp needles to String#index and String#rindex #2362

lopopolo opened this issue Jan 28, 2023 · 7 comments
Labels
A-ruby-core Area: Ruby Core types. A-spec Area: ruby/spec infrastructure and completeness. C-enhancement Category: New feature or request. E-easy Call for participation: Experience needed to fix: Easy / not much.

Comments

@lopopolo
Copy link
Member

String#index and String#rindex both support being called with a Regexp pattern, like this:

'foo'.index(/o./) # => 1
'foo'.index(/.o/) # => 0

'foo'.index(/o./, -2) # => 1
'foo'.index(/.o/, -2) # => 1

'foo'.rindex(/f/) # => 0
'foo'.rindex(/o/) # => 2
'foo'.rindex(/oo/) # => 1
'foo'.rindex(/ooo/) # => nil

Don't worry about semantics of negative lookbehind and negative lookahead; Artichoke's Regexp engine does not support these features.

The implicated code is here:

#[cfg(feature = "core-regexp")]
if let Ok(_pattern) = unsafe { Regexp::unbox_from_value(&mut needle, interp) } {
return Err(NotImplementedError::from("String#index with Regexp pattern").into());
}

#[cfg(feature = "core-regexp")]
if let Ok(_pattern) = unsafe { Regexp::unbox_from_value(&mut needle, interp) } {
return Err(NotImplementedError::from("String#index with Regexp pattern").into());
}

See also:

@lopopolo lopopolo added C-enhancement Category: New feature or request. A-ruby-core Area: Ruby Core types. A-spec Area: ruby/spec infrastructure and completeness. E-easy Call for participation: Experience needed to fix: Easy / not much. labels Jan 28, 2023
@edmondop
Copy link

Is this still useful/actual?

@lopopolo
Copy link
Member Author

yup these cases are still hardcoded to NotImplementedError on trunk. If you'd like to take a stab at this, go for it.

@edmondop
Copy link

edmondop commented Sep 10, 2023

I looked into https://github.com/artichoke/artichoke/blob/trunk/artichoke-backend/src/extn/core/regexp/backend/mod.rs#L83 and it appears that the only functions that expose positions are for capture groups and not for matches. I just wanted to confirm the right approach is to extend the trait, its implementations and the wrapper to expose similar functions for matches and not capture groups before moving forward

LMK what you think

@lopopolo
Copy link
Member Author

lopopolo commented Oct 4, 2023

dumping some raw notes here from DM with @edmondop:

all capture group APIs have to go through MatchData. For example:

#[cfg(feature = "core-regexp")]
if let Ok(regexp) = unsafe { Regexp::unbox_from_value(&mut first, interp) } {
let match_data = regexp.match_(interp, Some(s.as_slice()), None, None)?;
if match_data.is_nil() {
return Ok(Value::nil());
}
return matchdata::trampoline::element_reference(interp, match_data, second, None);
}

@edmondop
Copy link

edmondop commented Oct 8, 2023

Not sure I fully understod, can you provide some additional context?

In my original question I was trying to explain that it seems to me that it is not possible to implement this feature by using underlying methods that already exists in RegexpType. The more promising I have found is pos

    fn pos(&self, haystack: &[u8], at: usize) -> Result<Option<(usize, usize)>, Error> {
        let haystack = str::from_utf8(haystack).map_err(|_| {
            ArgumentError::with_message("regex crate utf8 backend for Regexp only supports UTF-8 haystacks")
        })?;
        let pos = self
            .regex
            .captures(haystack)
            .and_then(|captures| captures.get(at))
            .map(|match_pos| (match_pos.start(), match_pos.end()));
        Ok(pos)
    }

but that provides start and ending position of the capture group, not of the full match. So I was seeking confirmation that somehow we need to create a method that returns either Matches by position, or a collection/iterator of matches. Maybe what I should have understood from your comment is that in reality it is possible to implement support for Regexp needless inString#index and String#rindex on top of match_?

@lopopolo
Copy link
Member Author

lopopolo commented Oct 8, 2023

@edmondop you seem to be hung up on using the Regexp methods to implement this method. I've linked to some representative code that uses MatchData to do this and I'd love to maintain consistency there.

You've found RegexpType::pos. I believe what you're asking can be achieved by passing 0 for the at parameter. The 0th capture group is the whole match.

Generally when implementing APIs like this I've found it useful to play around in the REPL to see how they behave:

[3.2.2] > s = "abcdefg"
=> "abcdefg"
[3.2.2] > s.index /.c/
=> 1
[3.2.2] > s = "你好!"
=> "你好!"
[3.2.2] > s.index /.!/
=> 1
[3.2.2] > s.index(/.f/, 0)
=> 4
[3.2.2] > s.index(/.f/, 1)
=> 4
[3.2.2] > s.index(/.f/, 5)
=> nil

All of this behavior for offset and such is already implemented in the methods on MatchData. MRI also implements these methods by going through MatchData since MatchData is actually an implementation detail of the underlying Onigmo regexp engine that leaks out into the Ruby Core lib.

I hope this additional context is enough to steer you in the right direction.

@edmondop
Copy link

edmondop commented Oct 14, 2023

I do have something that works for index, but not for rindex.

artichoke 0.1.0-pre.0 (2023-09-10 revision 7031) [aarch64-apple-darwin]
[rustc 1.72.0 (5680fa18f 2023-08-23) on aarch64-apple-darwin]
>>> s = "abcdefgh"
=> "abcdefgh"
>>> s.index /.efgh/
=> 3
>>> s.index /.h/
=> 6
>>> s.rindex /./
=> 1
>>> 


irb(main):001:0> s = "abcdefgh"
=> "abcdefgh"
irb(main):002:0> s.index /.efgh/
=> 3
irb(main):003:0> s.index /.h/
=> 6
irb(main):005:0> s.rindex /./
=> 7

I think the problem might be that the match_data is the first match probably, and so my end function return the end of the first match rather than the last match. I have tried in onig.rs and utf.rs to understand how to use the additional parameters of match_ without great success

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-ruby-core Area: Ruby Core types. A-spec Area: ruby/spec infrastructure and completeness. C-enhancement Category: New feature or request. E-easy Call for participation: Experience needed to fix: Easy / not much.
Development

No branches or pull requests

2 participants