Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(elixir) fix regular expression detection #3207

Merged
merged 8 commits into from Jun 2, 2021
Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
239 changes: 130 additions & 109 deletions src/languages/elixir.js
Expand Up @@ -6,115 +6,167 @@ Category: functional
Website: https://elixir-lang.org
*/

import * as regex from '../lib/regex.js';

/** @type LanguageFn */
export default function(hljs) {
const ELIXIR_IDENT_RE = '[a-zA-Z_][a-zA-Z0-9_.]*(!|\\?)?';
const ELIXIR_METHOD_RE = '[a-zA-Z_]\\w*[!?=]?|[-+~]@|<<|>>|=~|===?|<=>|[<>]=?|\\*\\*|[-/+%^&*~`|]|\\[\\]=?';
const ELIXIR_KEYWORDS = {
const KEYWORDS = [
"alias",
"alias",
"and",
"begin",
"break",
"case",
"cond",
"defined",
"do",
"end",
"ensure",
"false",
"fn",
"for",
"import",
"in",
"include",
"module",
"next",
"nil",
"not",
"or",
"quote",
"redo",
"require",
"retry",
"return",
"self",
"then",
"true",
"unless",
"until",
"use",
"when",
"while",
"with|0"
];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that this PR merely reformats the list, but it caught my attention because there are a lot of keywords in it that I have never seen in Elixir. Also let my starts by saying that I am not sure if there is a formal definition of a "keyword". I'm using that term rather intuitively.

As far as I know, those are not Elixir keywords at all:

  • begin
  • break (Inspect.Algebra.break/1 is a function in the standard lib, but it has nothing to do with breaking out of loops like it might do in other languages)
  • defined
  • ensure
  • include
  • module
  • next (OptionParser.next/2 is a function in the standard lib, it has nothing to do with skipping to another iteration in a loop like it might do in other languages)
  • redo
  • retry
  • return
  • until
  • while

I'm unsure about:

  • self. It exists, but IMO it is a normal macro, not a keyword. I wouldn't expect it to be colored differently than other normal macros (like rem, round, trunc etc).
  • then. It's a new macro added in Elixir 1.12 together with tap, I don't think those are keywords.
  • with|0. I'm not sure what the pipe and zero mean here. with on its own is definitely a keyword.

Missing IMO:

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alias is also listed twice in the list.

const KWS = {
$pattern: ELIXIR_IDENT_RE,
keyword: 'and false then defined module in return redo retry end for true self when ' +
'next until do begin unless nil break not case cond alias while ensure or ' +
'include use alias fn quote require import with|0'
keyword: KEYWORDS
};
const SUBST = {
className: 'subst',
begin: /#\{/,
end: /\}/,
keywords: ELIXIR_KEYWORDS
keywords: KWS
};
const NUMBER = {
className: 'number',
begin: '(\\b0o[0-7_]+)|(\\b0b[01_]+)|(\\b0x[0-9a-fA-F_]+)|(-?\\b[1-9][0-9_]*(\\.[0-9_]+([eE][-+]?[0-9]+)?)?)',
relevance: 0
};
// TODO: could be tightened
// https://elixir-lang.readthedocs.io/en/latest/intro/18.html
// but you also need to include closing delemeters in the escape list per
// individual sigil mode from what I can tell,
// ie: \} might or might not be an escape depending on the sigil used
const ESCAPES_RE = /\\[\s\S]/;
// const ESCAPES_RE = /\\["'\\abdefnrstv0]/;
const BACKSLASH_ESCAPE = {
match: ESCAPES_RE,
scope: "char.escape",
relevance: 0
};
const SIGIL_DELIMITERS = '[/|([{<"\']';
const SIGIL_DELIMITER_MODES = [
{
begin: /"/,
end: /"/
},
{
begin: /'/,
end: /'/
},
{
begin: /\//,
end: /\//
},
{
begin: /\|/,
end: /\|/
},
{
begin: /\(/,
end: /\)/
},
{
begin: /\[/,
end: /\]/
},
{
begin: /\{/,
end: /\}/
},
{
begin: /</,
end: />/
}
];
const escapeSigilEnd = (end) => {
return {
scope: "char.escape",
begin: regex.concat(/\\/, end),
relevance: 0
};
};
const LOWERCASE_SIGIL = {
className: 'string',
begin: '~[a-z]' + '(?=' + SIGIL_DELIMITERS + ')',
contains: [
contains: SIGIL_DELIMITER_MODES.map(x => hljs.inherit(x,
{
endsParent: true,
contains: [
{
contains: [
hljs.BACKSLASH_ESCAPE,
SUBST
],
variants: [
{
begin: /"/,
end: /"/
},
{
begin: /'/,
end: /'/
},
{
begin: /\//,
end: /\//
},
{
begin: /\|/,
end: /\|/
},
{
begin: /\(/,
end: /\)/
},
{
begin: /\[/,
end: /\]/
},
{
begin: /\{/,
end: /\}/
},
{
begin: /</,
end: />/
}
]
}
escapeSigilEnd(x.end),
BACKSLASH_ESCAPE,
SUBST
]
}
]
))
};

const UPCASE_SIGIL = {
className: 'string',
begin: '~[A-Z]' + '(?=' + SIGIL_DELIMITERS + ')',
contains: [
contains: SIGIL_DELIMITER_MODES.map(x => hljs.inherit(x,
{
begin: /"/,
end: /"/
},
{
begin: /'/,
end: /'/
},
{
begin: /\//,
end: /\//
},
{
begin: /\|/,
end: /\|/
},
{
begin: /\(/,
end: /\)/
},
{
begin: /\[/,
end: /\]/
},
contains: [ escapeSigilEnd(x.end) ]
}
))
};

const REGEX_SIGIL = {
className: 'regex',
variants: [
{
begin: /\{/,
end: /\}/
begin: '~r' + '(?=' + SIGIL_DELIMITERS + ')',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this also mark regex modifiers as part of the regex? E.g.

~r(hello)
~r(hello)i
~r(hello)ui
~r(hello)U

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, but we should add that. What are all the valid modifiers in Elixir?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for sigils r/R, combinations of those letters:

uismxfU

for sigils: w/W, combinations of those letters:

sac

Treating any group of lower and/or uppercase letters immediately after the closing delimiter as a modifier and thus part of the sigil could also be enough.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added modifiers to ~r/~R. I'll save w for another day/PR since it's not already broken out. Really if we're going to customize these any further we perhaps need a simple data table and then programmatically auto-generate all the sigil rules from that table. Are R and W really the only ones with modifiers?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are R and W really the only ones with modifiers?

At the moment yes, but the language syntax already allows for any sigil to have any modifier that it wants. That actually makes me realize that anyone can define their own custom sigil with their own modifiers 😅.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Problems for another day. :) Does the PR look workable for now you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it hard to judge the code, but from playing around with tools/developer.html I discovered that there is a problem with uppercase R sigil. It still needs to support escaping the closing delimiter.

For example for those two pairs the output HTML should literally only differ by a single letter (r <-> R), but right now it tries to end the R sigil too early:

Regex.match?(~r|foo\|bar|, "foo")
Regex.match?(~R|foo\|bar|, "foo")


Regex.match?(~r(hello( there\)*!)u, "hello!")
Regex.match?(~R(hello( there\)*!)u, "hello!")

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because I thought uppercase sigils didn't allow escaping. I guess escaping the end sigil character is the exception to the rule?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is the exception. Quoting the docs about the sigil R:

It returns a regular expression pattern without interpolations and without escape characters. Note it still supports escape of Regex tokens (such as escaping + or ?) and it also requires you to escape the closing sigil character itself if it appears on the Regex.

contains: SIGIL_DELIMITER_MODES.map(x => hljs.inherit(x,
{
end: regex.concat(x.end, /[uismxfU]{0,7}/),
contains: [
escapeSigilEnd(x.end),
BACKSLASH_ESCAPE,
SUBST
]
}
))
},
{
begin: /</,
end: />/
begin: '~R' + '(?=' + SIGIL_DELIMITERS + ')',
contains: SIGIL_DELIMITER_MODES.map(x => hljs.inherit(x,
{
end: regex.concat(x.end, /[uismxfU]{0,7}/),
contains: [ escapeSigilEnd(x.end) ]
})
)
}
]
};
Expand Down Expand Up @@ -182,6 +234,7 @@ export default function(hljs) {
});
const ELIXIR_DEFAULT_CONTAINS = [
STRING,
REGEX_SIGIL,
UPCASE_SIGIL,
LOWERCASE_SIGIL,
hljs.HASH_COMMENT_MODE,
Expand Down Expand Up @@ -213,45 +266,13 @@ export default function(hljs) {
},
{
begin: '->'
},
{ // regexp container
begin: '(' + hljs.RE_STARTERS_RE + ')\\s*',
contains: [
hljs.HASH_COMMENT_MODE,
{
// to prevent false regex triggers for the division function:
// /:
begin: /\/: (?=\d+\s*[,\]])/,
relevance: 0,
contains: [NUMBER]
},
{
className: 'regexp',
illegal: '\\n',
contains: [
hljs.BACKSLASH_ESCAPE,
SUBST
],
variants: [
{
begin: '/',
end: '/[a-z]*'
},
{
begin: '%r\\[',
end: '\\][a-z]*'
}
]
}
],
relevance: 0
}
];
SUBST.contains = ELIXIR_DEFAULT_CONTAINS;

return {
name: 'Elixir',
keywords: ELIXIR_KEYWORDS,
keywords: KWS,
contains: ELIXIR_DEFAULT_CONTAINS
};
}
30 changes: 28 additions & 2 deletions test/markup/elixir/sigils.expect.txt
@@ -1,9 +1,12 @@
<span class="hljs-string">~R&#x27;this + i\s &quot;a&quot; regex too&#x27;</span>
<span class="hljs-regex">~R&#x27;this + i\s &quot;a&quot; regex too&#x27;</span>
<span class="hljs-string">~w(hello <span class="hljs-subst">#{ [<span class="hljs-string">&quot;has&quot;</span> &lt;&gt; <span class="hljs-string">&quot;123&quot;</span>, <span class="hljs-string">&#x27;\c\d&#x27;</span>, <span class="hljs-string">&quot;\123 interpol&quot;</span> | []] }</span> world)</span>s
<span class="hljs-string">~W(hello #{no &quot;123&quot; \c\d \123 interpol} world)</span>s
<span class="hljs-string">~s{Escapes terminators \{ and \}, but no {balancing}</span> <span class="hljs-comment"># outside of sigil here }</span>
<span class="hljs-string">~s{Escapes terminators <span class="hljs-char escape_">\{</span> and <span class="hljs-char escape_">\}</span>, but no {balancing}</span> <span class="hljs-comment"># outside of sigil here }</span>
<span class="hljs-string">~S&quot;No escapes \s\t\n and no #{interpolation}&quot;</span>

<span class="hljs-string">~S(No escapes \&quot; \&#x27; \\ \a \b \d \e \f \n \r \s \t \v \0)</span>
<span class="hljs-string">~s(Plenty of escapes <span class="hljs-char escape_">\&quot;</span> <span class="hljs-char escape_">\&#x27;</span> <span class="hljs-char escape_">\\</span> <span class="hljs-char escape_">\a</span> <span class="hljs-char escape_">\b</span> <span class="hljs-char escape_">\d</span> <span class="hljs-char escape_">\e</span> <span class="hljs-char escape_">\f</span> <span class="hljs-char escape_">\n</span> <span class="hljs-char escape_">\r</span> <span class="hljs-char escape_">\s</span> <span class="hljs-char escape_">\t</span> <span class="hljs-char escape_">\v</span> <span class="hljs-char escape_">\0</span>)</span>

<span class="hljs-string">~S/hello/</span>
<span class="hljs-string">~S|hello|</span>
<span class="hljs-string">~S&quot;hello&quot;</span>
Expand All @@ -21,3 +24,26 @@
<span class="hljs-string">~s[hello <span class="hljs-subst">#{name}</span>]</span>
<span class="hljs-string">~s{hello <span class="hljs-subst">#{name}</span>}</span>
<span class="hljs-string">~s&lt;hello <span class="hljs-subst">#{name}</span>&gt;</span>

<span class="hljs-regex">~r/hello/</span>
<span class="hljs-regex">~r|hello|u</span>
<span class="hljs-regex">~r&quot;hello&quot;i</span>
<span class="hljs-regex">~r&#x27;hello&#x27;m</span>
<span class="hljs-regex">~r(hello)x</span>
<span class="hljs-regex">~r[hello]f</span>
<span class="hljs-regex">~r{hello}U</span>
<span class="hljs-regex">~r&lt;hello&gt;</span>

<span class="hljs-regex">~r&lt;regex here&gt;uismxfU</span>
<span class="hljs-regex">~r/regex here/uismxfU</span>
<span class="hljs-regex">~R&lt;regex here&gt;uismxfU</span>
<span class="hljs-regex">~R/regex here/uismxfU</span>

<span class="hljs-regex">~r|foo<span class="hljs-char escape_">\|</span>bar|</span>
<span class="hljs-regex">~R|foo<span class="hljs-char escape_">\|</span>bar|</span>

<span class="hljs-regex">~r(hello( there<span class="hljs-char escape_">\)</span>*!)u</span>
<span class="hljs-regex">~R(hello( there<span class="hljs-char escape_">\)</span>*!)u</span>

<span class="hljs-string">~s|foo<span class="hljs-char escape_">\|</span>bar|</span>
<span class="hljs-string">~S|foo<span class="hljs-char escape_">\|</span>bar|</span>
26 changes: 26 additions & 0 deletions test/markup/elixir/sigils.txt
Expand Up @@ -4,6 +4,9 @@
~s{Escapes terminators \{ and \}, but no {balancing} # outside of sigil here }
~S"No escapes \s\t\n and no #{interpolation}"

~S(No escapes \" \' \\ \a \b \d \e \f \n \r \s \t \v \0)
~s(Plenty of escapes \" \' \\ \a \b \d \e \f \n \r \s \t \v \0)

~S/hello/
~S|hello|
~S"hello"
Expand All @@ -21,3 +24,26 @@
~s[hello #{name}]
~s{hello #{name}}
~s<hello #{name}>

~r/hello/
~r|hello|u
~r"hello"i
~r'hello'm
~r(hello)x
~r[hello]f
~r{hello}U
~r<hello>

~r<regex here>uismxfU
~r/regex here/uismxfU
~R<regex here>uismxfU
~R/regex here/uismxfU

~r|foo\|bar|
~R|foo\|bar|

~r(hello( there\)*!)u
~R(hello( there\)*!)u

~s|foo\|bar|
~S|foo\|bar|