Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix multibyte support in the regexp node handler #8989

Merged
merged 1 commit into from Nov 3, 2020

Conversation

knu
Copy link
Contributor

@knu knu commented Nov 2, 2020

Auto-correction by Style/RedundantRegexpEscape can corrupt a program by removing a character at the wrong index when it deals with a regexp pattern containing multibyte characters.

% cat a.rb
x = s[/[一二\.]+/]
p x
% be rubocop -a a.rb
Inspecting 1 file
F

Offenses:

a.rb:1:7: F: Lint/Syntax: unterminated string meets end of file
(Using Ruby 2.4 parser; configure using TargetRubyVersion parameter, under AllCops)
x = s[/[一二\.]+]
      ^
a.rb:1:15: C: [Corrected] Style/RedundantRegexpEscape: Redundant escape inside regexp literal
x = s[/[一二\.]+/]
              ^^

1 file inspected, 2 offenses detected, 1 offense corrected
% cat a.rb
x = s[/[一二\.]+]
p x

Note that the closing slash is removed instead of the redundant backslash.

The problem is that the cop mixes byte index with character index and results in manipulating source code with the wrong index values. Regexp::Parser stores byte index in Regexp::Expression#ts and that is the source of confusion.

It can be fixed by converting index values like this, but ideally it should be done in Regexp::Parser.


Before submitting the PR make sure the following are checked:

  • Wrote good commit messages.
  • Commit message starts with [Fix #issue-number] (if the related issue exists).
  • Feature branch is up-to-date with master (if not - rebase it).
  • Squashed related commits together.
  • Added tests.
  • Added an entry (file) to the changelog folder named {change_type}_{change_description}.md if the new code introduces user-observable changes. See changelog entry format for details.
  • The PR relates to only one subject with a clear title and description in grammatically correct, complete sentences.
  • Run bundle exec rake default. It executes all tests and RuboCop for itself, and generates the documentation.

@knu
Copy link
Contributor Author

knu commented Nov 3, 2020

#8948 is probably related to this issue.

CHANGELOG.md Outdated Show resolved Hide resolved
Auto-correction by Style/RedundantRegexpEscape can corrupt a program
by removing a character at the wrong index when it deals with a regexp
pattern containing multibyte characters.

```console
% cat a.rb
x = s[/[一二\.]+/]
p x
% be rubocop -a a.rb
Inspecting 1 file
F

Offenses:

a.rb:1:7: F: Lint/Syntax: unterminated string meets end of file
(Using Ruby 2.4 parser; configure using TargetRubyVersion parameter, under AllCops)
x = s[/[一二\.]+]
      ^
a.rb:1:15: C: [Corrected] Style/RedundantRegexpEscape: Redundant escape inside regexp literal
x = s[/[一二\.]+/]
              ^^

1 file inspected, 2 offenses detected, 1 offense corrected
% cat a.rb
x = s[/[一二\.]+]
p x
```

Note that the closing slash is removed instead of the redundant
backslash.

The problem is that the cop mixes byte index with character index and
results in manipulating source code with the wrong index values.
Regexp::Parser stores byte index in Regexp::Expression#ts and that is
the source of confusion.

It can be fixed by converting index values like this, but ideally it
should be done in Regexp::Parser.
@knu knu force-pushed the fix_multibyte_support_in_regexp_node branch from 2a043d4 to b87785e Compare November 3, 2020 09:10
@marcandre marcandre merged commit b8f59b6 into rubocop:master Nov 3, 2020
@marcandre
Copy link
Contributor

Looks great, thanks! 👍

@knu knu deleted the fix_multibyte_support_in_regexp_node branch November 4, 2020 01:27
@knu knu mentioned this pull request Nov 26, 2020
8 tasks
knu added a commit to knu/rubocop that referenced this pull request Nov 26, 2020
Methods we use that used to return byte-based indices now return
char-based indices, so we can revert the changes made in rubocop#8989.

https://github.com/ammar/regexp_parser/blob/master/CHANGELOG.md#200---2020-11-25---janosch-m%C3%BCller
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants