Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rethink fallbacks for formally incorrect grammar #63

Closed
zverok opened this issue Jun 3, 2020 · 4 comments · Fixed by #64
Closed

Rethink fallbacks for formally incorrect grammar #63

zverok opened this issue Jun 3, 2020 · 4 comments · Fixed by #64

Comments

@zverok
Copy link

zverok commented Jun 3, 2020

Hi, and thanks for the awesome gem!

Recently regexp_parser started to be used in Rubocop to check regexp redundancy.

That led to uncovering of what can be considered as a bug (rubocop bug: rubocop/rubocop#8083). When parsing regexps like this, for example: /{.+}/ (which is valid Ruby regexp), regexp_parser fails (thinking that {} is incorrect quantifier). Same is related to some other forms, like /]\[/

I found out that the matter was discussed at #15, with verdict being, that it

...is an implementation quirk of the regex engine. In other words, it's not a documented feature.

...Hence I propose to never even try to implement "Ruby" but implement a sane subset, explicitly not supporting stuff that does not make sense outside MRI implementation quirks.

...It raises exceptions now, keep it like this. But document the fact that regexp_parser does not support each MRI quirk.

Actually, I believe that it is not "MRI quirk", but sane behavior of the Regexp parser, that some characters have special meaning only in context. The behavior about parsing {something that is not a quantifier}, and ] is consistent through:

  • Ruby
  • Python
  • JS
  • Perl
  • PHP
  • (probably most of the rest of the implementations, at this point I stopped checking)

So, it seems that parser that fails on those cases becomes less useful than it might be.

@jaynetics
Copy link
Collaborator

the argument about usability and prevalence in other engines looks really convincing to me.

regexp_parser seems to be used more and more on regular expressions encountered "in the wild", and there is no good workaround for these cases there.

as far as rubocop goes, you might catch any PrematureEndError and then skip the affected cop, but this is obviously not very appealing.

generally speaking, the current limitation might push people to either not use regexp_parser for such cases, or to attempt to pre-escape their input with their own regexp-scanning implementation, or catch the error and unexpectedly do nothing.

so i'm all on board. any opposition, @ammar?

if there is no documentation it might be best to look at the onigmo code for details about the behavior.

some things to keep in mind / investigate:

  • cases like /\#{}/ (currently a NoMethodError)
  • all unbalanced cases /{/, /]/ etc.
  • lone ( and ) always seem to be an error
  • possible differences between ruby versions supported by regexp_parser

@ammar
Copy link
Owner

ammar commented Jun 3, 2020

Thanks for the mention @jaynetics.

I also think that is convincing for some cases. Also, for ruby at least, the supported variants of regexp have stabilized (for example: https://bugs.ruby-lang.org/issues/8133#note-5)

I think the following cases are reasonable:

  • {
  • }
  • ] (leading)
  • {} (empty)

I have doubts about other empty or unbalanced cases, like ().

I can take a stab at it this coming weekend.

@jaynetics
Copy link
Collaborator

balanced curlies with content that doesn't match \d+(,\d*)? are also consistently treated as literals across rubies and other languages:

/a{2, 3}/ =~ 'a{2, 3}' # => 0
/a{2,3,4}/ =~ 'a{2,3,4}' # => 0

empty () also seem to be widely supported and treated as group that always matches. there are some use cases for that, too - breaking runs of other elements such as backref numbers, or achieving a desired numbering for following captures.

@ammar
Copy link
Owner

ammar commented Jun 3, 2020

Treating a {...} that don't match \d+(,\d*)? as literals makes sense.

I haven't encountered that usage of (). If it emits an empty group, instead of a literal, then that makes sense.

My understanding of the concerns of the parser has evolved over time. I no longer see "validation" as one of them. These cases enforce that understanding.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants