Failing to parse some sublime-syntax files #156

sharkdp · 2018-05-02T21:11:47Z

I'm currently using syntect with a custom SyntaxSet which is compiled from several sources.

I have experienced a few problems with some syntax files due to errors in the regex parsing (unfortunately resulting in a panic!, see #98).

I'm not sure if there is something that can be done about this here, but I thought it would be good to share (will #34 change something in this respect?).

Invalid pattern in look-behind

I downloaded an advanced Markdown syntax definition from https://github.com/jonschlinkert/sublime-markdown-extended and it results in a thread 'main' panicked at 'called Result::unwrap() on an Err value: Error(-122, invalid pattern in look-behind)', libcore/result.rs:945:5 when using it with an input like this:

Test

<div>
</div>

I tracked the error down to this section

    - match: '^(?=<(p|div|h[1-6]|blockquote|pre|table|dl|ol|ul|script|noscript|form|fieldset|iframe|math|ins|del)\b)(?!.*?</\1>)'
      comment: Markdown formatting is disabled inside block-level tags.
      push:
        - meta_scope: meta.disable-markdown
        - match: (?<=^</\1>$\n)
          pop: true
        - include: scope:text.html.basic
        - include: scope:text.html.handlebars

and this regular expression in particular: (?<=^</\1>$\n). I was able to fix the parsing-error by removing the trailing \n in the parens, but I'm not sure if this is the right way to go.

Target of repeat operator is invalid

The current version of the Javascript.sublime-syntax in https://github.com/sublimehq/Packages fails with target of repeat operator is invalid on any JavaScript file, for example:

var x = 2;

I tracked the error down to this section

variables:
  identifier_start: '[_$\p{L}\p{Nl}]'
  identifier_part: '[_$\p{L}\p{Nl}\p{Mn}\p{Mc}\p{Nd}\p{Pc}\x{200C}\x{200D}]'
  identifier_break: (?!{{identifier_part}})

  identifier: '{{identifier_start}}{{identifier_part}}*{{identifier_break}}'
  constant_identifier: '[[:upper:]]{{identifier_part}}*{{identifier_break}}'
  dollar_only_identifier: '\${{identifier_break}}'
  dollar_identifier: '(\$){{identifier_part}}*{{identifier_break}}+'
  func_lookahead: '\s*(async\s+)?function{{identifier_break}}'
  arrow_func_lookahead: '\s*(async\s*)?({{identifier}}|\(([^()]|\([^()]*\))*\))\s*=>'
  either_func_lookahead: (?:{{func_lookahead}}|{{arrow_func_lookahead}})
  binding_pattern_lookahead: (?:{{identifier}}|\[|\{)
  left_expression_end_lookahead: '(?!\s*[.\[\(])'

and the dollar_identifier regex in particular. The last part of this regex is {{identifier_break}}+ which expands to (?!...)+. The + here is superfluous and I have opened a pull request to remove it, but it would be great if syntect would not panic in such a case.

The text was updated successfully, but these errors were encountered:

trishume · 2018-05-03T00:23:34Z

Yah the fact that syntect panics on invalid regexes is unfortunate, and should probably be fixed for 3.0 when we next make breaking API changes. It may require a lot of bubbling though.

As for the specific errors, I wonder if we're using an older version of Oniguruma than Sublime or a newer one. Because Sublime should be parsing these using the same regex parser we are.

robinst · 2018-05-03T00:32:06Z

Regarding the first problem, "Invalid pattern in look-behind":

The regex (?<=^</\1>$\n) as it is in the syntax definition works fine with Oniguruma. The problem is that we rewrite it to (?<=^</\1>$\z) for the "no-newlines" mode, which is invalid. That's a bug in syntect that we can fix (here somewhere).

But, I would recommend that you switch to "newlines" mode. To do that, do the following changes:

Change the false to true where you load the syntax definitions. That means the regexes will not get rewritten.
Where you pass the lines to the highlighter, make sure the lines include the original \n of the line. You can do that by using read_line instead of lines.

Some background: The regexes in Sublime Text are all written assuming that the lines that they match on include the trailing \n, because Sublime Text itself does that. Sometimes that's not simple to do and would mean you'd have to copy every line. That's why syntect offers a "no newlines" mode, where it tries to rewrite regexes so that they match on lines that don't contain the trailing \n. But that's not possible to do correctly in all cases, and as you can see with this example, it contains bugs too.

robinst · 2018-05-03T00:37:05Z

@trishume: With regards to panicking, can we make it so that we try to compile the regexes when loading from YAML (just for checking them)? So loading them would fail, instead of only later when using them for parsing. I think that would reduce the bubbling a bit.

trishume · 2018-05-03T02:02:11Z

@robinst yah that would help. But the trick there is regexes with capture interpolations. There I guess you have to compile them with some placeholder such that if that compiles then any validly escaped interpolation will compile.

robinst · 2018-05-03T02:09:14Z

Yeah, replacing them with placeholders sounds good.

sharkdp · 2018-05-03T06:04:41Z

The regex (?<=^</\1>$\n) as it is in the syntax definition works fine with Oniguruma. The problem is that we rewrite it to (?<=^</\1>$\z) for the "no-newlines" mode, which is invalid. That's a bug in syntect that we can fix (here somewhere).

Oh, that's why I had problems finding that regular expression initially! I didn't remember that "no-newlines" option, thank you very much for pointing this out (and also for the detailed instructions - I will try this!).

@trishume: With regards to panicking, can we make it so that we try to compile the regexes when loading from YAML (just for checking them)?

Two things to (possibly) be aware of

This will probably have a significant impact on SyntaxSet loading performance and also overall performance (if someone only uses a subset of the SyntaxSet there is no need to check all language definitions)?
If the load_syntaxes function (with the current function signature) would fail with Err(..) if any regex is invalid, you would get either all or nothing. Right now, you get a partially-working SyntaxSet (not sure what is to be preferred).

robinst · 2018-05-03T07:12:21Z

The idea is that it would only impact loading from YAML files, which I'd recommend to do infrequently anyway.

After that, you can serialize the syntax definition (using bincode) as a cache. Then later you load the cached version which does not do the checks.

keith-hall · 2018-05-03T07:13:04Z

I downloaded an advanced Markdown syntax definition from https://github.com/jonschlinkert/sublime-markdown-extended

You'll probably find that the default Markdown syntax definition is actually much better than all the old third party ones these days, and will likely have none of these compatibility problems. From a quick glance at the readme of Markdown Extended, the only feature you'd not get is the YAML "front matter" (and only the official SublimeHQ syntaxes would be highlighted in fenced code blocks).

For single lines without newlines, the end result is the same. The advantage of `$` is that it can be used in lookbehinds. `\z` in a lookbehind results in a regex compilation error, so rewriting would need to be more complicated. See #156.

robinst · 2018-05-06T10:52:49Z

Created a PR so that the regex (?<=^</\1>$\n) is rewritten to (?<=^</\1>$$) which Oniguruma is fine with: #157

sharkdp · 2018-05-06T13:24:32Z

You'll probably find that the default Markdown syntax definition is actually much better than all the old third party ones these days, and will likely have none of these compatibility problems. From a quick glance at the readme of Markdown Extended, the only feature you'd not get is the YAML "front matter" (and only the official SublimeHQ syntaxes would be highlighted in fenced code blocks).

@keith-hall Thank you very much for the feedback. I tried the default Markdown syntax definition but some of the highlighting is done with background colors (which I don't want to use in the terminal / my application) and I didn't get the syntax highlighting in code blocks to work(?).

With #157 merged, I think this can be closed. My second problem ("Target of repeat operator is invalid") is already addressed somewhere else and the panic! discussion should probably be moved to #98.

sharkdp mentioned this issue May 5, 2018

thread main panicked on Result::unwrap() with simple js file sharkdp/bat#60

Closed

robinst mentioned this issue May 6, 2018

Rewrite "no newlines" regexes using $ instead of \z #157

Merged

sharkdp mentioned this issue May 6, 2018

Use syntect newlines mode (instead of nonewlines) sharkdp/bat#71

Merged

trishume closed this as completed May 6, 2018

robinst mentioned this issue May 7, 2018

Replace calls to panic! with Result types #98

Closed

sharkdp mentioned this issue May 25, 2018

Use the default Markdown Sublime syntax sharkdp/bat#157

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failing to parse some sublime-syntax files #156

Failing to parse some sublime-syntax files #156

sharkdp commented May 2, 2018 •

edited

trishume commented May 3, 2018

robinst commented May 3, 2018 •

edited

robinst commented May 3, 2018

trishume commented May 3, 2018

robinst commented May 3, 2018

sharkdp commented May 3, 2018

robinst commented May 3, 2018

keith-hall commented May 3, 2018

robinst commented May 6, 2018

sharkdp commented May 6, 2018 •

edited

Failing to parse some sublime-syntax files #156

Failing to parse some sublime-syntax files #156

Comments

sharkdp commented May 2, 2018 • edited

Invalid pattern in look-behind

Target of repeat operator is invalid

trishume commented May 3, 2018

robinst commented May 3, 2018 • edited

robinst commented May 3, 2018

trishume commented May 3, 2018

robinst commented May 3, 2018

sharkdp commented May 3, 2018

robinst commented May 3, 2018

keith-hall commented May 3, 2018

robinst commented May 6, 2018

sharkdp commented May 6, 2018 • edited

sharkdp commented May 2, 2018 •

edited

robinst commented May 3, 2018 •

edited

sharkdp commented May 6, 2018 •

edited