Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NodePattern: Support regexp literal #112

Merged
merged 6 commits into from Sep 27, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
4 changes: 3 additions & 1 deletion .github/workflows/rubocop.yml
Expand Up @@ -74,7 +74,7 @@ jobs:
run: bundle exec rake spec
- name: internal investigation
if: matrix.internal_investigation
run: bundle exec rake internal_investigation
run: bundle exec rake generate internal_investigation
rubocop_specs:
name: >-
Main Gem Specs | RuboCop: ${{ matrix.rubocop }} | ${{ matrix.ruby }} (${{ matrix.os }})
Expand All @@ -98,6 +98,8 @@ jobs:
ruby-version: ${{ matrix.ruby }}
- name: install dependencies
run: bundle install --jobs 3 --retry 3
- name: generate lexer and parser
run: bundle exec rake generate
- name: clone rubocop from source for full specs -- master
if: matrix.rubocop == 'master'
run: git clone --branch ${{ matrix.rubocop }} https://github.com/rubocop-hq/rubocop.git ../rubocop
Expand Down
5 changes: 5 additions & 0 deletions .gitignore
@@ -1,3 +1,8 @@
# generated parser / lexer
/lib/rubocop/ast/node_pattern/parser.racc.rb
/lib/rubocop/ast/node_pattern/parser.output
/lib/rubocop/ast/node_pattern/lexer.rex.rb

# rcov generated
coverage
coverage.data
Expand Down
3 changes: 3 additions & 0 deletions .rubocop.yml
Expand Up @@ -13,6 +13,9 @@ AllCops:
- 'spec/fixtures/**/*'
- 'tmp/**/*'
- '.git/**/*'
- 'lib/rubocop/ast/node_pattern/parser.racc.rb'
- 'lib/rubocop/ast/node_pattern/lexer.rex.rb'
- 'spec/rubocop/ast/node_pattern/parse_helper.rb'
TargetRubyVersion: 2.4

Naming/PredicateName:
Expand Down
4 changes: 3 additions & 1 deletion .rubocop_todo.yml
Expand Up @@ -32,7 +32,7 @@ Metrics/MethodLength:
# Offense count: 1
# Configuration parameters: CountComments.
Metrics/ModuleLength:
Max: 101
Max: 108

# Offense count: 1
# Configuration parameters: ExpectMatchingDefinition, Regex, IgnoreExecutableScripts, AllowedAcronyms.
Expand Down Expand Up @@ -65,6 +65,7 @@ RSpec/ContextWording:
- 'spec/rubocop/ast/resbody_node_spec.rb'
- 'spec/rubocop/ast/token_spec.rb'
- 'spec/spec_helper.rb'
- 'spec/rubocop/ast/node_pattern/helper.rb'

# Offense count: 6
# Configuration parameters: Max.
Expand All @@ -73,6 +74,7 @@ RSpec/ExampleLength:
- 'spec/rubocop/ast/node_pattern_spec.rb'
- 'spec/rubocop/ast/processed_source_spec.rb'
- 'spec/rubocop/ast/send_node_spec.rb'
- 'spec/rubocop/ast/node_pattern/parser_spec.rb'

# Offense count: 6
RSpec/LeakyConstantDeclaration:
Expand Down
8 changes: 8 additions & 0 deletions CHANGELOG.md
Expand Up @@ -2,6 +2,14 @@

## master (unreleased)

### New features

* [#105](https://github.com/rubocop-hq/rubocop-ast/pull/105): `NodePattern` compiler [complete rewrite](https://docs.rubocop.org/rubocop-ast/node_pattern_compiler.html). Add support for multiple variadic terms. ([@marcandre][])
* [#109](https://github.com/rubocop-hq/rubocop-ast/pull/109): Add `NodePattern` debugging rake tasks: `test_pattern`, `compile`, `parse`. See also [this app](https://nodepattern.herokuapp.com) ([@marcandre][])
* [#110](https://github.com/rubocop-hq/rubocop-ast/pull/110): Add `NodePattern` support for multiple terms unions. ([@marcandre][])
* [#111](https://github.com/rubocop-hq/rubocop-ast/pull/111): Optimize some `NodePattern`s by using `Set`s. ([@marcandre][])
* [#112](https://github.com/rubocop-hq/rubocop-ast/pull/112): Add `NodePattern` support for Regexp literals. ([@marcandre][])

## 0.6.0 (2020-09-26)

### New features
Expand Down
4 changes: 3 additions & 1 deletion Gemfile
Expand Up @@ -5,8 +5,10 @@ source 'https://rubygems.org'
gemspec

gem 'bump', require: false
gem 'oedipus_lex', require: false
gem 'pry'
gem 'rake', '~> 12.0'
gem 'racc'
gem 'rake', '~> 13.0'
gem 'rspec', '~> 3.7'
local_ast = File.expand_path('../rubocop', __dir__)
if Dir.exist? local_ast
Expand Down
2 changes: 1 addition & 1 deletion Rakefile
Expand Up @@ -15,7 +15,7 @@ end

require 'rspec/core/rake_task'

RSpec::Core::RakeTask.new(:spec) do |spec|
RSpec::Core::RakeTask.new(spec: :generate) do |spec|
spec.pattern = FileList['spec/**/*_spec.rb']
end

Expand Down
1 change: 1 addition & 0 deletions docs/modules/ROOT/nav.adoc
Expand Up @@ -2,3 +2,4 @@
* xref:installation.adoc[Installation]
* xref:node_types.adoc[Node Types]
* xref:node_pattern.adoc[Node Pattern]
* xref:node_pattern_compiler.adoc[Node Pattern Compiler]
18 changes: 16 additions & 2 deletions docs/modules/ROOT/pages/node_pattern.adoc
Expand Up @@ -173,7 +173,7 @@ You can add `+...+` before the closing bracket to allow for additional parameter
This will match both our examples, but not `sum(1.0, 2)` or `sum(2)`,
since the first node in the brackets is found, but not the second (`int`).

== `{}` for "OR"
== `{}` for "OR" (union)

Lets make it a bit more complex and introduce floats:

Expand All @@ -185,7 +185,21 @@ $ ruby-parse -e '1.0'
(float 1.0)
----

* `({int float} _)` - int or float types, no matter the value
* `({int | float} _)` - int or float types, no matter the value

Branches of the union can contain more than one term:

* `(array {int int | range})` - matches an array with two integers or a single range element

If all the branches have a single term, you can omit the `|`, so `{int | float}` can be
simplified to `{int float}`.

When checking for symbols or string, you can use regexp literals for a similar effect:

[source,sh]
----
(send _ /to_s|inspect/) # => matches calls to `to_s` or `inspect`
----

== `[]` for "AND"

Expand Down
252 changes: 252 additions & 0 deletions docs/modules/ROOT/pages/node_pattern_compiler.adoc
@@ -0,0 +1,252 @@
= Hacker's guide to the `NodePattern` compiler

This documentation is aimed at anyone wanting to understand / modify the `NodePattern` compiler.
It assumes some familiarity with the syntax of https://github.com/rubocop-hq/rubocop-ast/blob/master/doc/modules/ROOT/pages/node_pattern.md[`NodePattern`], as well as the AST produced by the `parser` gem.

== High level view

The `NodePattern` compiler uses the same techniques as the `parser` gem:

* a `Lexer` that breaks source into tokens
* a `Parser` that uses tokens and a `Builder` to emit an AST
* a `Compiler` that converts this AST into Ruby code

Example:

* Pattern: `+(send nil? {:puts :p} $...)+`
* Tokens: `+'(', [:tNODE_TYPE, :send], [:tPREDICATE, :nil?], '{', ...+`
* AST: `+s(:sequence, s(:node_type, :send), s(:predicate, :nil?), s(:union, ...+`
* Ruby code:
+
[source,ruby]
----
node.is_a?(::RuboCop::AST::Node) && node.children.size >= 2 &&
node.send_type? &&
node.children[0].nil?() &&
(union2 = node.children[1]; ...
----

The different parts are described below

== Vocabulary

*"node pattern"*: something that can be matched against a single `AST::Node`.
While `(int 42)` and `#is_fun?` both correspond to node patterns, `+...+` (without the parenthesis) is not a node pattern.

*"sequence"*: a node pattern that describes the sequence of children of a node (and its type): `+(type first_child second_child ...)+`

*"variadic"*: element of a sequence that can match a variable number of children.
`+(send _ int* ...)+` has two variadic elements (`int*` and `+...+`).
`(send _ :name)` contains no variadic element.
Note that a sequence is itself never variadic.

*"atom"*: element of a pattern that corresponds with a simple Ruby object.
`(send nil?
:puts (str 'hello'))` has two atoms: `:puts` and `'hello'`.

== Lexer

The `lexer.rb` defines `Lexer` and has the few definitions needed for the lexer to work.
The bulk of the processing is in the inherited class that is generated by https://github.com/seattlerb/oedipus_lex[`oedipus_lex`]

[discrete]
==== Rules

https://github.com/seattlerb/oedipus_lex[`oedipus_lex`] generates the Ruby file `lexer.rex.rb` from the rules defined in `lexer.rex`.

These rules map a Regexp to code that emits a token.

`oedipus_lex` aims to be simple and the generated file is readable.
It uses https://ruby-doc.org/stdlib-2.7.1/libdoc/strscan/rdoc/StringScanner.html[`StringScanner`] behind the scene.
It selects the first rule that matches, contrary to many lexing tools that prioritize longest match.

[discrete]
==== Tokens

The `Lexer` emits tokens with types that are:

* string for the syntactic symbols (e.g.
`'('`, `'$'`, `+'...'+`)
* symbols of the form `:tTOKEN_TYPE` for the rest (e.g.
`:tPREDICATE`)

Tokens are stored as `[type, value]`, or `[type, [value, location]]` if locations are emitted.

[discrete]
==== Generation

Use `rake generate:lexer` to generate the `lexer.rex.rb` from `lexer.rex` file.
This is done automatically by `rake spec`.

NOTE: the `lexer.rex.rb` is not under source control, but is included in the gem.

== Parser

Similarly to the `Lexer`, the `parser.rb` defines `Parser` and has the few definitions needed for the parser to work.
The bulk of the processing is in the inherited class `parser.racc.rb` that is generated by https://ruby-doc.org/stdlib-2.7.0/libdoc/racc/parser/rdoc/Racc.html#module-Racc-label-Writing+A+Racc+Grammar+File[`racc`] from the rules in `parser.y`.

[discrete]
==== Nodes

The `Parser` emits `NodePattern::Node` which are similar to RuboCop's node.
They both inherit from ``parser``'s `Parser::AST::Source::Node`, and share additional methods too.

Like for RuboCop's nodes, some nodes have specicialized classes (e.g.
`Sequence`) while other nodes use the base class directly (e.g.
`s(:number, 42)`)

[discrete]
==== Rules

The rules follow closely the definitions above.
In particular a distinction between `node_pattern_list`, which is a list of node patterns (each term can match a single node), while the more generic `variadic_pattern_list` is a list of elements, some of which could be variadic, others simple node patterns.

[discrete]
==== Generation

Similarly to the lexer, use `rake generate:parser` to generate the `parser.racc.rb` from `parser.y` file.
This is done automatically by `rake spec`.

NOTE: the `parser.racc.rb` is not under source control, but is included in the gem.

== Compiler

The compiler's core is the `Compiler` class.
It holds the global state (e.g.
references to named arguments).
The goal of the compiler is to produce `matching_code`, Ruby code that can be run against an `AST::Node`, or any Ruby object for that matter.

Packaging of that `matching_code` into code for a `lambda`, or method `def` is handled separately by the `MethodDefiner` module.

The compilation itself is handled by three subcompilers:

* `NodePatternSubcompiler`
* `AtomSubcompiler`
* `SequenceSubcompiler`

=== Visitors

The subcompilers use the visitor pattern [https://en.wikipedia.org/wiki/Visitor_pattern]

The methods starting with `visit_` are used to process the different types of nodes.
For a node of type `:capture`, the method `visit_capture` will be called, or if none is defined then `visit_other_type` will be called.

No argument is passed, as the visited node is accessible with the `node` attribute reader.

=== NodePatternSubcompiler

Given any `NodePattern::Node`, it generates the Ruby code that can return `true` or `false` for the given node, or node type for sequence head.

==== `var` vs `access`

The subcompiler can be called with the current node stored either in a variable (provided with the `var:` keyword argument) or via a Ruby expression (e.g.
`access: 'current_node.children[2]'`).

The subcompiler will not generate code that executes this `access` expression more than once or twice.
If it might access the node more than that, `multiple_access` will store the result in a temporary variable (e.g.
`union`).

==== Sequences

Sequences are the most difficult elements to handle and are deferred to the `SequenceSubcompiler`.

==== Atoms

Atoms are handled with `visit_other_type`, which defers to the `AtomSubcompiler` and converts that result to a node pattern by appending `=== cur_node` (or `=== cur_node.type` if in sequence head).

This way, the two arguments in `(_ #func?(%1) %2)` would be compiled differently;
`%1` would be compiled as `param1`, while `%2` gets compiled as `param2 === node.children[1]`.

==== Precedence

The code generated has higher or equal precedence to `&&`, so as to make chaining convenient.

=== AtomSubcompiler

This subcompiler produces Ruby code that gets evaluated to a Ruby object.
E.g.
`"42"`, `:a_symbol`, `param1`.

A good way to think about it is when it has to be passed as arguments to a function call.
For example:

[source,ruby]
----
# Pattern '#func(42, %1)' compiles to
func(node, 42, param1)
----

Note that any node pattern can be output by this subcompiler, but those that don't correspond to a Ruby literal will be output as a lambda so they can be combined.
For example:

[source,ruby]
----
# Pattern '#func(int)' compiles to
func(node, ->(compare) { compare.is_a?(::RuboCop::AST::Node) && compare.int_type? })
----

=== SequenceSubcompiler

The subcompiler compiles the sequences' terms in turn, keeping track of which children of the `AST::Node` are being matched.

==== Variadic terms

The complexity comes from variadic elements, which have complex processing _and_ may make it impossible to know at compile time which children are matched by the subsequent terms.

*Example* (no variadic terms)

----
(_type int _ str)
----

First child must match `int`, third child must match `str`.
The subcompiler will use `children[0]` and `children[2]`.

*Example* (one variadic terms)

----
(_type int _* str)
----

First child must match `int` and _last_ child must match `str`.
The subcompiler will use `children[0]` and `children[-1]`.

*Example* (multiple variadic terms)

----
(_type int+ sym str+)
----

The subcompiler can not use any integer and `children[]` to match `sym`.
This must be tracked at runtime in a variable (`cur_index`).

The subcompiler will use fixed indices before the first variadic element and after the last one.

==== Node pattern terms

The node pattern terms are delegated to the `NodePatternSubcompiler`.

In the pattern `(:sym :sym)`, both `:sym` will be compiled differently because the first `:sym` is in "sequence head": `:sym === node.type` and `:sym == node.children[0]` respectively.
The subcompiler indicates if the pattern is in "sequence head" or not, so the `NodePatternSubcompiler` can produce the right code.

Variadic elements may not (currently) cover the sequence head.
As a convenience, `+(...)+` is understood as `+(_ ...)+`.
Other types of nodes will raise an error (e.g.
`(<will not compile>)`;
see `Node#in_sequence_head`)

==== Precedence

Like the node pattern subcompiler, it generates code that has higher or equal precedence to `&&`, so as to make chaining convenient.

== Variant: WithMeta

These variants of the Parser / Builder / Lexer generate `location` information (exactly like the `parser` gem) for AST nodes as well as comments with their locations (like the `parser` gem).

Since this information is not typically used when one ony wants to define methods, it is not loaded by default.

== Variant: Debug

These variants of the Compiler / Subcompilers works by adding tracing code before and after each compilation of `NodePatternSubcompiler` and `SequenceSubcompiler`.
A unique ID is assigned to each node and the tracing code flips a corresponding switch when the expression is about to be evaluated, and after (joined with `&&` so it only flips the switch if the node was a match).
Atoms are not compiled differently as they are not really matchable (when not compiled as a node pattern)