(parser) Improper tokenization with variable names in languages that use Unicode / UTF-8 #2756

EmNudge · 2020-10-15T23:04:30Z

Taking over this issue to track the overall process on this. Original issue is below.

Prework:

(chore) Clean up all regexs to be UTF-8 compliant/ready #2759 Update all our regexe to be UTF8 compliant (10.4)
Add Grammar#unicodeRegex to allow per-language UTF8 regex (11.x)
Performance testing - /u seems MUCH slower in my initial testing. (10.4)

Individual languages may now add support if they don't trigger performance regressions:

Perhaps one day:

Flip the parser to always using u by default (10.4) [needs resolution of performance slowdowns]
Ensure this change results in no overall regressions (10.4)
Consider MODES IDENT_RE and UNDERSCORE_IDENT_RE, although I think they should not change.

Before any of this is possible we'll need to start building regexs with u mode. After that it should simply be a matter of altering the appropriate IDENT_REs for the language in question. Actually upon second though we do have IDENT_RE but I'm not confident that it should change (vs this being a per language opt-in)... perhaps an IDENT_RE_UTF8?

A few markup tests should be added to 2 or 3 of the languages (or perhaps unit tests if we add a new mode) to make sure the behavior is as expected.

Original issue

Describe the issue
Most modern languages use the Unicode spec to decide what constitutes a proper variable name. Thus, hangul filler characters, even though they seem to be whitespace, are valid variable names.
The variable name thisﾠisﾠaﾠvariableﾠname is a valid variable, although hl.js tokenizes the first word or so as separate from the rest.
In highlighting systems that are aware of this, like Chrome's dev tools, it is parsed correctly. The first snippet uses hangul fillers while the second one uses whitespace (which is why it throws an error):

Which language seems to have the issue?
Every language that follows the Unicode spec for variable names theoretically. I've only tested the ones mentioned in the title, but it might make more sense to just tokenize every language as if it were one of those that followed the unicode spec since this should be the default.

Are you using highlight or highlightAuto?
N/A

Sample Code to Reproduce
N/A

Expected behavior
hl.js should be using ID_Start and ID_Continue for the languages that use it, or at least use it as a default.
The regex to match this is rather large, but it can be shortened significantly in regex engines supporting the u flag (modern JavaScript)

const matchIDStart = char => /\p{ID_Start}/u.test(char);
const matchIDContinue = char => /\p{ID_Continue}/u.test(char);

\p{ID_Start} matches any valid starting character in a variabe name
\p{ID_Continue} matches any valid non-starting character in a variable name (such as numbers and some variation selectors)

The text was updated successfully, but these errors were encountered:

joshgoebel · 2020-10-16T01:37:47Z

hl.js should be using ID_Start and ID_Continue for the languages that use it, or at least use it as a default.

How might we know which languages those are? :-) We have almost 200 languages... Right now our keyword default is \w and we don't really have an "identifier" default... each language kind of provides their own identifier regex... (and many provide their own keyword regex also). I'm thinking this might be something we have to do one language at a time as it comes up... flipping just that language into "utf8 mode" and then making sure it's regex are all still valid, etc...

This might be quite an effort for the whole project. Just a quick attempt to build all our regex with /u and then run just our markup test suite:

  108 passing (716ms)
  173 failing

Evidently the /u flag changes how regex strings are parsed and it's a LOT more strict about a lot of things that work just fine without it. One example:

Valid before:

/[$][\w-:]+/

Now with the u flag it wants the - to be escaped also... are you aware of an "upgrading" guide that covers most of the common pitfalls here (or any pointers)? I've seen just two cases already but I'm sure there are more... Seems a lot of issues about what should be escaped and what should not { always should now (even when sourced from a string)... < and > should NOT despite working just fine escaped prior.

EmNudge · 2020-10-16T02:17:11Z

How might we know which languages those are? :-) We have almost 200 languages... Right now our keyword default is \w and we don't really have an "identifier" default... each language kind of provides their own identifier regex... (and many provide their own keyword regex also). I'm thinking this might be something we have to do one language at a time as it comes up... flipping just that language into "utf8 mode" and then making sure it's regex are all still valid, etc...

To be honest, I'm not entirely sure. I'd have to do more research on the topic, but most modern imperative/OOP languages use ID_Start and ID_Continue these days as those categories seem to exist explicitly for that purpose.
Some languages, such as Rust, use XID_Start and XID_Continue which are slightly more strict subsets. Rust also has a linter option which turns it off mostly, so that's a weird case. XID vs ID excludes very few characters. You can play around with seeing how they differentiate with https://unicode-lookup.netlify.app by the category search.

To tell you which ones to target specifically, I'd have to check every language that hl.js supports, but I can tell you languages where I've confirmed it to the best of my knowledge (through use and/or the language spec):
JavaScript, Java, C/C++, Ruby, Python, Go, Rust*, Kotlin, C#, and Swift

I haven't used many more languages than that, so that's all I can back up. In my experience, ID_Start is the default, not the anomaly. I'd assume it's different for languages like SQL which likely have stricter rules for identifiers.

Now with the u flag it wants the - to be escaped also... are you aware of an "upgrading" guide that covers most of the common pitfalls here? I've seen just two cases already but I'm sure there are more...

I am not, unfortunately. I understand that fixing it, in general, is frustrating, so it's understandable if it goes unfixed. I just recognized an issue and thought it would be convenient to have a GitHub issue dedicated to it for future reference.

As an aside, if there were some convenient way to switch all regex to use ID_Start and ID_Continue, I'm not sure it would break much, even in languages which require only ASCII-idents, ID_Start still doesn't match for spaces or equals signs, so it's rare that it would match too much.

joshgoebel · 2020-10-16T02:23:19Z

As an aside, if there were some convenient way to switch all regex to use ID_Start and ID_Continue, I'm not sure it would break much, even in languages which require only ASCII-idents, ID_Start still doesn't match for spaces or equals signs, so it's rare that it would match too much.

Mostly it would all have to be done by hand... so if no one is having issues with a given language then I don't see any harm in languages keeping their older [a-z] style rules... and then when (or if) it DOES provide to be a problem and we have someone on the horn who knows what the rules are for language xyz, then we can deal with them on a one off basis.

I think the first step here would be to slowly clean up our regexs to be UTF8 compatible if possible (avoiding the need for per language flags)... and then at that point it could be decide which languages needed new regex rules to take advantage of what the u flag has to offer.

joshgoebel · 2020-10-16T16:12:31Z

@klmr Any thoughts on the C/C++ side of this?

joshgoebel · 2020-10-16T16:13:32Z

@EmNudge I turned this issue into the tracking issue for this whole initiative. Turns out I forgot we have IDENT_RE in our mode library, but still I don't think we should opt all languages into this by default - although I could perhaps be persuaded. I think the question is going to be "What is correct" vs "what does no harm".

joshgoebel · 2020-10-16T17:23:21Z

This has implications for auto-detect also for some languages... if a language does not declare itself UTF8 aware then it shouldn't potentially claim credit for UTF8 identifiers, it'd be better if those fall to a language that knows how to identify them... although as things stand now I wonder if this is stacked against us since it'd be likely a non-UTF8 language would see MULTIPLE identifiers here (would the utf8s count as word boundaries (\b)?), which would definitely have some bearing on the discussion.

var ident_ascii[utf8]ident_ascii[utf8]ascii = 39;

Grrrrr. :-)

> "testȠtestȠbooger".match(/\b[a-z]+\b/g)
[ 'test', 'test', 'booger' ]

Though thankfully most languages don't match (and give relevance to) random identifiers... though some do (looking at Swift, etc)...

joshgoebel · 2020-10-16T17:41:30Z

Dumping resources:

https://mathiasbynens.be/notes/es6-unicode-regex

klmr · 2020-10-16T17:56:09Z

@klmr Any thoughts on the C/C++ side of this?

Do I ever!

In fact, it’s backslashes again. The following is valid (if the compiler’s source character set supports it, I don’t think this is required):

#include <stdio.h>

int main(void) {
    int \u00e4 = 1;
    printf("%d\n", ä);
}

Oh, and we can also add R to the list of languages that definitely support Unicode identifiers.

joshgoebel · 2020-10-16T18:07:03Z

I'm slating this for around 10.5+... perhaps switching to /u in 10.4 and seeing if that causes any problems in an of itself before looking into individual language (or whole mode) upgrades for the identifiers.

@EmNudge @klmr However if someone was interested in floating a "beta" version of this (to land in 10.4) for one or two of these languages I'd be open to that with a short-term _utf8_enable flag at the language level that would be used to turn on u mode (mode_compiler.js, line ~24-28). Again the idea being to flush out any issues we aren't thinking of before exposing the whole library.

That would allow us a "soft" pilot.

Removing bug as this is a feature request, the current behavior is intentional as UTF8 has not really been a big consideration.

Work toward #2756. Cleans up a lot of incorrect (unnecessary escaped) regex that would not compile with the `u` flag. After that makes some rather large performance improvements (with utf8 turned on at least) to `yaml` and `mipsasm`. It looks like the mipasm rules have been wrong all along... as far as I can determine they are intended to match a literal `.` (otherwise they are far too broad) but were matching any character - which seems to terribly slow down the whole grammar in `u` mode. The changes consisted largely of: - Most unescaped { and } - Lots of unneeded escapes for -, <, >, and others. - Converting strings to regex if it made them simpler, easier to read (editor syntax coloring)

joshgoebel · 2020-11-10T23:48:41Z

In just running our full test suite on my dev machine I'm seeing quite measurable differences:

Current master: ~8.2s
Simply add /u to all built regex's: ~9.6s (mode_compiler.js, line 27)

Running ~17% slower for a feature no one is really demanding overall doesn't seem to be a huge win. I suppose we could try to auto-detect regex type, but I'm not sure yet how we'd even go about that with strings (much of our regex is encoded as strings) unless we just try non-UTF8 and then if that blows up, try UTF-8?

So I'm probably pausing the plan to enable this globally in 10.4 unless someone can persuade me otherwise. If anyone has any thoughts or wants to step up and champion this issue, let me know. Not that we have any mandate to be the "fastest" highlighter, but we do try to keep an eye on our overall speed and 17% slower seems like a big loss.

Also this is Node of course... I know browsers can sometimes have VERY different regex performance so it's possible this might be faster (or slower) in Safari or Chrome.

joshgoebel · 2020-11-10T23:57:01Z

It seems auto-detect by error isn't workable:

> "p{ID_Start}".match(/\p{ID_Start}/) // yes, non-UTF8
[ 'p{ID_Start}', index: 0, input: 'p{ID_Start}', groups: undefined ]

Since this \p stuff seems to be valid non-UTF8 regex. :-) So it may not even be possible to discern the intention with 100% certainty given a regex in a string with no mode mapped.

Feels like perhaps we're back to turning in on one-at-a-time for languages that truly NEED it... or showing I'm wrong about the performance.

klmr · 2020-11-11T17:31:55Z

Isn’t there a way for language definitions to opt in to this feature? I don’t know how the hljs engine currently handles this but in my naïve understanding, the following should “just work”:

…
return {
  language: 'My beautiful Uni Code',
  contains: [
    {
      className: 'identifier',
      begin: /\p{ID_Start}\p{ID_Continue}*/u;
    }
  ]
};

The resulting regex has the unicode property set, after all, so hljs should be able to use this information when internally constructing other expressions from it.

joshgoebel · 2020-11-11T19:58:25Z

Problems with per-regex:

Often they are strings (so there is a HUGE legacy cost to upgrading). Though maybe this could be done by machine?
I suppose you could argue "only change ones that are going utf-8", which makes sense but you still have "waterfalls" very a string (like IDENT) is used everywhere... so now you have to rewrite many places to use regex.concat vs +... so a one line change ends up touching the entire file.
Even when not strings they often get converted several times (to concat, add, etc) since JS has no native way to combine regex... so we'd have to keep track of that /u flag this whole time...
Not to mention you technically can't mix them (so you can't add a /u to non /u...
...and parser combines regex into MultiMatchers, so now we have a big problem since a regex could mean entirely different things based on whether it's UTF-8 or not - making it impossible to mix UTF-8 and non UTF-8 in a single language anyways (or at least very problematic).

So right now it looks like it would need toggled at the language level, and simply apply to all regex inside the language... if any of those points seem wrong/off please let me know.

There are also sublanguages to consider but I'm not sure that's an issue since they are running into their own completely isolated highlighting context.

I'm really hoping someone can just show I'm wrong about performance, but I don't have my hopes up.

joshgoebel · 2021-10-07T15:47:33Z

Update: This can now be taken up one language at a time - now that we allow turning on UTF8 regex per languages - (so we can test for performance regressions) if anyone is interested in PRs. That probably also means we won't be changing the existing MODES, though perhaps we could add new UTF8_IDENT etc... worth discussing in whatever PRs get filed.

alkisg · 2022-02-21T13:03:12Z

Hi, I've implemented GeSHi highlighters for two Greek educational programming languages for my SMF forum, and I'm looking to migrate them to a javascript-based solution. I.e. support for Unicode keywords and identifiers is needed.

Is this something that can be implemented currently with highlightjs? I tried starting from delphi.js, adding unicodeRegex=true, and I changed e.g. the end keyword to τέλος, but it's not getting highlighted.

joshgoebel · 2022-02-21T20:13:01Z

You'll likely need to edit keywords/$pattern also. (See docs). The default pattern only matches ASCII keywords.

alkisg · 2022-02-22T11:26:19Z

Thank you very much @joshgoebel, indeed setting $pattern to the Greek unicode range solved that issue.

Another unicode-related issue concerns the language name and aliases; that is, my highlighter gets activated when I use an ASCII name like "glossa" (<code class="language-glossa">) but not when I use its proper Greek name which is "ΓΛΩΣΣΑ" (<code class="language-ΓΛΩΣΣΑ">).

I've uploaded two examples in this forum post, as you can see the second example with language-ΓΛΩΣΣΑ isn't highlighted. Is there a way to make that work?

joshgoebel · 2022-02-22T13:17:28Z

See aliases and registerAliases.

alkisg · 2022-02-22T15:12:47Z

Yeah I've used aliases and I think the problem is that there's some place in the code that's not unicode enabled.
I've tested both of the following language definitions:

return {
    name: 'ΓΛΩΣΣΑ',
    aliases: [
        'glossa',
...

return {
    name: 'glossa',
    aliases: [
        'ΓΛΩΣΣΑ',
...

...and in both cases, <code class="language-glossa"> works but <code class="language-ΓΛΩΣΣΑ"> doesn't work.

Note though that if I omit the language-xxx attribute AND the code sample is big enough, then highlightjs correctly identifies the language and adds the language=ΓΛΩΣΣΑ attribute dynamically.

So I think that the "autodetection and language tagging" part is unicode enabled,
but the "parsing the language tag" part is not unicode enabled.

/me checks if he needs to modify languageDetectRe...

joshgoebel · 2022-02-22T15:23:56Z

languageDetectRe

Yep, I'd say that would fix your issue.

alkisg · 2022-02-22T15:32:19Z

Indeed, setting languageDetectRe:/\blang(?:uage)?-([\u0386-\u03ce\w-]+)/i fixed the issue.
Thank you very much all for your code, and especially @joshgoebel for your help on this. :)

EmNudge added bug help welcome Could use help from community language labels Oct 15, 2020

joshgoebel mentioned this issue Oct 16, 2020

(chore) Clean up all regexs to be UTF-8 compliant/ready #2759

Merged

joshgoebel changed the title ~~(JS, C++, C#, python, etc) Improper tokenization with variable names in languages that use Unicode~~ (parser) Improper tokenization with variable names in languages that use Unicode Oct 16, 2020

joshgoebel added parser enhancement An enhancement or new feature labels Oct 16, 2020

joshgoebel added this to the 10.4 milestone Oct 16, 2020

joshgoebel removed the bug label Oct 16, 2020

joshgoebel changed the title ~~(parser) Improper tokenization with variable names in languages that use Unicode~~ (parser) Improper tokenization with variable names in languages that use Unicode / UTF-8 Oct 31, 2020

joshgoebel mentioned this issue Oct 31, 2020

WIP: Proper highlighting for the Wolfram Language (Mathematica) #2706

Merged

joshgoebel mentioned this issue Nov 6, 2020

[Discussion] Is an ES5 build/distribution necessary? #2642

Closed

joshgoebel removed this from the 10.4 milestone Nov 10, 2020

joshgoebel mentioned this issue Nov 18, 2020

enh(asciidoc) support unconstrained syntax #2870

Merged

3 tasks

joshgoebel mentioned this issue Dec 4, 2020

(Swift) Incorrect highlighting in operator functions #2895

Closed

svanimpe mentioned this issue Dec 5, 2020

(swift) Revamped keywords and added support for operators #2908

Merged

joshgoebel mentioned this issue Dec 26, 2020

(swift) Improved highlighting for functions, initializers, and subscripts #2930

Merged

joshgoebel added the on hold label Jan 7, 2021

joshgoebel mentioned this issue Feb 20, 2021

(md/html/etc) Parsing of spacing characters #3017

Closed

joshgoebel mentioned this issue Jun 2, 2021

fix(python) identifiers starting with underscore not highlighted #3221

Merged

joshgoebel mentioned this issue Jun 29, 2021

enh(xml) Allow Unicode attribute and tag names #3257

Closed

2 tasks

austin-schick mentioned this issue Jul 20, 2021

enh(python) Add support for unicode identifiers #3280

Merged

2 tasks

joshgoebel removed the on hold label Oct 7, 2021

joshgoebel mentioned this issue Jan 18, 2022

Syntax error in regular expression (in v11.3.1) #3387

Closed

alkisg mentioned this issue Feb 21, 2022

Use the same font for line numbers dragomano/PrismJS-for-SMF#1

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(parser) Improper tokenization with variable names in languages that use Unicode / UTF-8 #2756

(parser) Improper tokenization with variable names in languages that use Unicode / UTF-8 #2756

EmNudge commented Oct 15, 2020 •

edited by joshgoebel

joshgoebel commented Oct 16, 2020 •

edited

EmNudge commented Oct 16, 2020

joshgoebel commented Oct 16, 2020

joshgoebel commented Oct 16, 2020

joshgoebel commented Oct 16, 2020

joshgoebel commented Oct 16, 2020 •

edited

joshgoebel commented Oct 16, 2020

klmr commented Oct 16, 2020

joshgoebel commented Oct 16, 2020 •

edited

joshgoebel commented Nov 10, 2020 •

edited

joshgoebel commented Nov 10, 2020 •

edited

klmr commented Nov 11, 2020 •

edited

joshgoebel commented Nov 11, 2020 •

edited

joshgoebel commented Oct 7, 2021 •

edited

alkisg commented Feb 21, 2022

joshgoebel commented Feb 21, 2022

alkisg commented Feb 22, 2022

joshgoebel commented Feb 22, 2022

alkisg commented Feb 22, 2022 •

edited

joshgoebel commented Feb 22, 2022

alkisg commented Feb 22, 2022

(parser) Improper tokenization with variable names in languages that use Unicode / UTF-8 #2756

(parser) Improper tokenization with variable names in languages that use Unicode / UTF-8 #2756

Comments

EmNudge commented Oct 15, 2020 • edited by joshgoebel

Original issue

joshgoebel commented Oct 16, 2020 • edited

EmNudge commented Oct 16, 2020

joshgoebel commented Oct 16, 2020

joshgoebel commented Oct 16, 2020

joshgoebel commented Oct 16, 2020

joshgoebel commented Oct 16, 2020 • edited

joshgoebel commented Oct 16, 2020

klmr commented Oct 16, 2020

joshgoebel commented Oct 16, 2020 • edited

joshgoebel commented Nov 10, 2020 • edited

joshgoebel commented Nov 10, 2020 • edited

klmr commented Nov 11, 2020 • edited

joshgoebel commented Nov 11, 2020 • edited

joshgoebel commented Oct 7, 2021 • edited

alkisg commented Feb 21, 2022

joshgoebel commented Feb 21, 2022

alkisg commented Feb 22, 2022

joshgoebel commented Feb 22, 2022

alkisg commented Feb 22, 2022 • edited

joshgoebel commented Feb 22, 2022

alkisg commented Feb 22, 2022

EmNudge commented Oct 15, 2020 •

edited by joshgoebel

joshgoebel commented Oct 16, 2020 •

edited

joshgoebel commented Oct 16, 2020 •

edited

joshgoebel commented Oct 16, 2020 •

edited

joshgoebel commented Nov 10, 2020 •

edited

joshgoebel commented Nov 10, 2020 •

edited

klmr commented Nov 11, 2020 •

edited

joshgoebel commented Nov 11, 2020 •

edited

joshgoebel commented Oct 7, 2021 •

edited

alkisg commented Feb 22, 2022 •

edited