Skip to content

Commit

Permalink
Improve lexing of ternaries that include symbols in Ruby lexer (#1476)
Browse files Browse the repository at this point in the history
Ruby's rules for how it parses ternaries are complicated. This is all
the more the case if the ternary contains symbols. The current lexer
uses a simple test to determine whether a colon demarcates the branches
of the ternary: is the colon immediately followed by another colon? 

While this rule suffices for many cases, it causes ternaries including
symbols to be lexed incorrectly. This commit replaces the simple rule
with a rule for each of the following cases:

- **Simple case**: The simple case is where there is whitespace
  following the colon being matched.

- **Complex case**: The complex case is where there are no additional 
  colons on that line (excluding colons in trailing comments) that 
  follow the colon being matched.

If either of the above cases apply, the colon is tokenised as
`Punctuation` and the lexer moves to the `:expr_start` state.

These rules have been tested with a number of complex ternaries
involving colons and are lexed in a manner consistent with Ruby's
parser. These test cases have been added to the visual sample.
  • Loading branch information
pyrmont committed Apr 14, 2020
1 parent 652a622 commit 7bf2159
Show file tree
Hide file tree
Showing 2 changed files with 36 additions and 8 deletions.
10 changes: 9 additions & 1 deletion lib/rouge/lexers/ruby.rb
Expand Up @@ -329,7 +329,15 @@ def self.detect?(text)
end

state :ternary do
rule(/:(?!:)/) { token Punctuation; goto :expr_start }
rule %r/(:)(\s+)/ do
groups Punctuation, Text
goto :expr_start
end

rule %r/:(?![^#\n]*?[:\\])/ do
token Punctuation
goto :expr_start
end

mixin :root
end
Expand Down
34 changes: 27 additions & 7 deletions spec/visual/samples/ruby
Expand Up @@ -26,21 +26,41 @@ end
hash = { answer: 42, special?: true }
link_to 'new', new_article_path, class: 'btn'

#####
# ternaries
########
# Ternaries
########

# NB [jneen]: MRI ruby actually has different parsing behavior depending on
# ~what variables are defined~, which we can't know in a highlighting context.
# So... we're going to be wrong here, sometimes. Whatever. These cases look
# okay though, but if they break I don't care a whole lot, because Ruby itself
# doesn't parse them consistently.
a ? b::c : :d
a ? b:c

a ? b::c : :d # parsed as (a) ? (b::c) : (:d)
a ? b : :c # parsed as (a) ? (b) : (:c)
a ? b:c # parsed as (a) ? (b) : (c)
a ? :b : :c # parsed as (a) ? (:b) : (:c)
a ? :b :c # parsed as (a) ? (:b) : (c)
a ?b:c # parsed as (a) ? (b) : (c)
a ?b :c # parsed as (a) ? (b) : (c)
(a) ? b : c # parsed as (a) ? (b) : (c)
(a) \ # parsed as (a) ? (:b) : (:c)
? :b \
: :c
(a) ? # parsed as (a) ? (:b) : (:c)
:b :
:c
a # parsed as (a) ? (b) : (c)
? b
: c
?????:?? # parsed as (??) ? (??) : (??)
//?//:// # parsed as (//) ? (//) : (//)

method_that_takes_a_char ?b
cond??b::c : d
a?b :c # parsed as a? (b) (:c), syntax error for missing comma
a?b:c # parsed as a?(b: c), b is a symbol

# lol
?????:??
//?//://

a/1 # comment
a / b # comment
Expand Down

0 comments on commit 7bf2159

Please sign in to comment.