Rethink fallbacks for formally incorrect grammar #63

zverok · 2020-06-03T09:09:35Z

Hi, and thanks for the awesome gem!

Recently regexp_parser started to be used in Rubocop to check regexp redundancy.

That led to uncovering of what can be considered as a bug (rubocop bug: rubocop/rubocop#8083). When parsing regexps like this, for example: /{.+}/ (which is valid Ruby regexp), regexp_parser fails (thinking that {} is incorrect quantifier). Same is related to some other forms, like /]\[/

I found out that the matter was discussed at #15, with verdict being, that it

...is an implementation quirk of the regex engine. In other words, it's not a documented feature.

...Hence I propose to never even try to implement "Ruby" but implement a sane subset, explicitly not supporting stuff that does not make sense outside MRI implementation quirks.

...It raises exceptions now, keep it like this. But document the fact that regexp_parser does not support each MRI quirk.

Actually, I believe that it is not "MRI quirk", but sane behavior of the Regexp parser, that some characters have special meaning only in context. The behavior about parsing {something that is not a quantifier}, and ] is consistent through:

Ruby
Python
JS
Perl
PHP
(probably most of the rest of the implementations, at this point I stopped checking)

So, it seems that parser that fails on those cases becomes less useful than it might be.

The text was updated successfully, but these errors were encountered:

jaynetics · 2020-06-03T10:22:14Z

the argument about usability and prevalence in other engines looks really convincing to me.

regexp_parser seems to be used more and more on regular expressions encountered "in the wild", and there is no good workaround for these cases there.

as far as rubocop goes, you might catch any PrematureEndError and then skip the affected cop, but this is obviously not very appealing.

generally speaking, the current limitation might push people to either not use regexp_parser for such cases, or to attempt to pre-escape their input with their own regexp-scanning implementation, or catch the error and unexpectedly do nothing.

so i'm all on board. any opposition, @ammar?

if there is no documentation it might be best to look at the onigmo code for details about the behavior.

some things to keep in mind / investigate:

cases like /\#{}/ (currently a NoMethodError)
all unbalanced cases /{/, /]/ etc.
lone ( and ) always seem to be an error
possible differences between ruby versions supported by regexp_parser

ammar · 2020-06-03T13:02:27Z

Thanks for the mention @jaynetics.

I also think that is convincing for some cases. Also, for ruby at least, the supported variants of regexp have stabilized (for example: https://bugs.ruby-lang.org/issues/8133#note-5)

I think the following cases are reasonable:

{
}
] (leading)
{} (empty)

I have doubts about other empty or unbalanced cases, like ().

I can take a stab at it this coming weekend.

jaynetics · 2020-06-03T13:21:57Z

balanced curlies with content that doesn't match \d+(,\d*)? are also consistently treated as literals across rubies and other languages:

/a{2, 3}/ =~ 'a{2, 3}' # => 0
/a{2,3,4}/ =~ 'a{2,3,4}' # => 0

empty () also seem to be widely supported and treated as group that always matches. there are some use cases for that, too - breaking runs of other elements such as backref numbers, or achieving a desired numbering for following captures.

ammar · 2020-06-03T13:59:01Z

Treating a {...} that don't match \d+(,\d*)? as literals makes sense.

I haven't encountered that usage of (). If it emits an empty group, instead of a literal, then that makes sense.

My understanding of the concerns of the parser has evolved over time. I no longer see "validation" as one of them. These cases enforce that understanding.

zverok mentioned this issue Jun 3, 2020

Lint/MixedRegexpCaptureTypes and "premature end of pattern" rubocop/rubocop#8083

Closed

ammar mentioned this issue Jun 6, 2020

Support informal delimiter literals #64

Merged

jaynetics mentioned this issue Jun 6, 2020

Nested repetitions parsed potentially incorrectly #3

Closed

ammar closed this as completed in #64 Jun 7, 2020

marcandre mentioned this issue Sep 5, 2020

Failure to parse \g #65

Closed

dgollahon mentioned this issue Dec 20, 2020

regexp_parser rejects /\xA/ but MRI accepts it #75

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rethink fallbacks for formally incorrect grammar #63

Rethink fallbacks for formally incorrect grammar #63

zverok commented Jun 3, 2020 •

edited

jaynetics commented Jun 3, 2020

ammar commented Jun 3, 2020

jaynetics commented Jun 3, 2020

ammar commented Jun 3, 2020

Rethink fallbacks for formally incorrect grammar #63

Rethink fallbacks for formally incorrect grammar #63

Comments

zverok commented Jun 3, 2020 • edited

jaynetics commented Jun 3, 2020

ammar commented Jun 3, 2020

jaynetics commented Jun 3, 2020

ammar commented Jun 3, 2020

zverok commented Jun 3, 2020 •

edited