enh(python) Add support for unicode identifiers #3280

austin-schick · 2021-07-19T23:45:27Z

Changes

Adds support for unicode Python identifiers. As discussed in #2756, Python identifiers have supported a wider range of characters than the Javascript regex \w since Python 3.

This PR starts using the /u flag in regexes for Python only, and introduces a mechanism for converting one language to unicode at a time. It also allows hljs to fall back to regexes without unicode property escapes if the user's browser does not yet support them.

I did a bit of performance testing in node against this file from the bs4 library. Highlighting the file 1500 times was marginally faster with unicode mode enabled -- 18278ms vs 18615ms with unicode mode disabled.

Checklist

Added markup tests
Updated the changelog at CHANGES.md

…ssions

test/regex/lib/util.js

austin-schick · 2021-07-20T17:16:30Z

For a bit of context here, I'm working on internationalizing an interactive online textbook. Code like

def enRatónPesionado(x, y):
    Círculo(x, y, 50, relleno='azulMarino', opacidad=x)

isn't highlighting properly (and as you can see here, good support for Spanish-language programmers is rare, which is a shame)

austin-schick · 2021-07-20T17:18:34Z

It looks like I can't request a review, but I'd love to get some eyes on this @joshgoebel

joshgoebel · 2021-07-20T22:47:01Z

Some very high level thoughts:

We're not interested in conditional behavior across different browsers. If all browsers we support do not support /u yet then it's simply off the table until they do (like negative lookbehind, because of Safari lagging). At a glance though I don't think this is an issue, so we shouldn't need a conditional...
It's not just Python you'd have to test (for performance) but EVERY potential input source language. Auto-detect is a big feature which means that any grammars end up being run against code from many different languages. A change to Python that makes auto-detection of C++ much slower is a problem, even if it has 0 ill effects for Python highlighting.
It's also possible Unicode is faster (or similar) at run-time yet slower at startup. Start-up concerns me less than run-time but if it's considerably slower it's still relevant since in-browser usage is a large portion of our use case - and there start-up time matters.

Previously I changed the entirely library to /u and saw 10-20% slower runs of the test suite (Node)... that's not am exact benchmark, but it's something. I feel like the performance tests we really need here are:

benchmarked versions of our auto-detection tests (it's possible ./tools/CheckAutoDetect.js could also serve this purpose for Node, though it would also include start-up time)
benchmark highlighting all our markup test content with individual grammars... (comparing UTF mode to non-UTF) (the markup tests are a larger sample size than the auto-detect tests, and targeted at a single grammar, ie: C code is only run against the C grammar)
possibly benchmark some larger files (with samples for our common subset of languages)

So if you're interesting in pursuing this further I think the best bang for the buck course to get some useful info would be:

Add some actual benchmarking to CheckAutoDetect.js... or it's possible just running in 5-6 times in a Bash script might be roughly informative.
Benchmark current HEAD vs HEAD + /u (for all regex). (see mode_compiler.js line ~33 os so)

Doing this quickly right now (with just a single run) I see ~7.4s HEAD vs ~8.5s with /u... (that's without using any complex UTF-8 matchers, which I assume would be slower, though perhaps I'm mistaken). That's a 15% slow-down (akin to what I was seeing before), which is not great - and why #2756 got stuck in the first place.

So I'd be curious to know:

Is this start-up cost or runtime cost [or both] (modifying the script itself to run the tests say 5-10 times vs 1 should prove educational here)
Is there some way that those users wishing to highlight UTF-8 identifiers could pay this performance penalty? I'm afraid the answer here will be no since once UTF-8 starts sneaking into all our regex then we're forced to use /u everywhere... and having two entirely different regex that are conditional isn't a route we'd want to go either.
We could also support /u on a per-grammar basis (like we currently do with case-sensitivity), adding some complexity to try and avoid paying the 15% performance penalty everywhere.

@austin-schick For starters perhaps see if you can reproduce my results with /u and checkAutoDetect?

joshgoebel · 2021-07-20T22:52:25Z

The parsing engine often combines 100s of regex into a huge single "either" regex... so just saying "this single regex is /u not others" isn't easily possible. Also, many of our regex are still in string form giving us no place to even store this the /u information. IDENTs especially are commonly string form for easy appending. So switching to regex everywhere just to know which are /u and which aren't is no small task. Even then we'd be forced to trigger /u if a single regex in a contains group needed it... so 1 regex out of 100 might force all 100 to use /u...

Hence the easiest way to split /u vs not is at the grammar level (as we already do with case).

austin-schick · 2021-07-21T00:36:48Z

Thanks @joshgoebel! I'm responding to a subset of your comments to start out.

For starters perhaps see if you can reproduce my results with /u and checkAutoDetect?

I ran time (for i in {1..10}; do node checkAutoDetect.js; done) on this branch and main. This branch took 70.78s to run, main took 73.69s to run. So we actually got a small speedup, but there seems to be a few seconds of variation in the tests.

We're not interested in conditional behavior across different browsers

Makes sense! What's your list of supported browsers / versions? I'm not concerned about the browser support of \u, but rather the support of \p{} escapes (here), which weren't supported in Firefox until mid last year.

Hence the easiest way to split /u vs not is at the grammar level (as we already do with case).

I believe that's what I'm doing with this PR. Are we on the same page?

joshgoebel · 2021-07-21T00:59:12Z

. So we actually got a small speedup, but there seems to be a few seconds of variation in the tests.

I'm pretty sure you aren't testing the FULL change until you do it globally. You're only touching Python in this PR but the tests are running thousands of highlights, of which only a tiny percent (1/190 or so) are Python. It's still possible Python is slower, but it's getting lost with all the other tests being the same speed (because they aren't using /u).

In my testing I turned on /u GLOBALLY (one line patch - though I had to fix a few broken regex)... and saw the 15% speed decrease - which to me means that (until we can prove otherwise) that we're looking at a measurable slowdown if this support is added to all languages.

Makes sense! What's your list of supported browsers / versions? I'm not concerned about the browser support of \u, but rather the support of \p{} escapes (here), which weren't supported in Firefox until mid last year.

Well, that Firefox issue is perhaps more problematic than the speed. One could make an argument that we should trade performance for accuracy (though I'm not sure all users would agree). Generally we support "all modern green-field browsers, going a few years back"... which until now has meant great browser coverage, but also no conditional support or polyfills... saying we only support Firefox from mid-2020 forward would be a change.

@highlightjs/core Though, since most browsers have auto-update now (and Firefox generally isn't tied to OS versions) am I worrying too much about older Firefox versions? I wonder.

I believe that's what I'm doing with this PR. Are we on the same page?

Yes, indeed, it'd be a per grammar flag as you're doing here.

joshgoebel · 2021-07-21T01:02:57Z

Here is my test case:

https://github.com/joshgoebel/highlight.js/tree/slower_utf8

I think we need to somehow show that the 15% loss is incorrect, or switch to a discussion I proposed above:

One could make an argument that we should trade performance for accuracy (though I'm not sure all users would agree).

And yes I do realize IF we went per-grammar (which has it's own small costs, I'd prefer if it could be done globally) we'd only pay that cost on "modern" languages that support UTF-8 identifiers, but I'd also guess that would be most modern languages - so if our common bundle was say 70% modern languages then I assume we'd still be paying a measurable cost even if we only turn on support for those.

We must of course also keep in mind that performance "leaks" over into all other grammars during auto-detection... so someone highlighting Lua (via auto-detect) with the common build would still pay that 70% of 15% price, etc.

CC @highlightjs/core

allejo · 2021-07-21T03:44:36Z

Though, since most browsers have auto-update now (and Firefox generally isn't tied to OS versions) am I worrying too much about older Firefox versions? I wonder.

Browsers do usually have an auto-update feature on personal computers but I've been in enterprise environments where that is controlled by the IT department and they may not always be on top of updates.

I think if we want to adopt a policy, we can do what similar to what autoprefixer, Create React App, and friends do. They use something like browserslist and support "last 2 versions" (or X in our case) of each browser.

austin-schick · 2021-07-21T04:04:54Z

In my testing I turned on /u GLOBALLY (one line patch - though I had to fix a few broken regex)... and saw the 15% speed decrease

I also see this 15% to 20% speed decrease running on your branch, so I agree that it may not be good to make this a global change.

However, if we just focus on this PR, which is just targeting Python, we get:

Better accuracy in the highlighting (with real world use cases)
No performance difference in the Python markup tests (58.81s here vs 63.12s main, run 100x)
No performance difference in global checkAutoDetect (70.78s here vs 73.69s main, run 10x)
Minimal performance difference in Python-only checkAutoDetect (58.70s here vs 56.89s main, run 100x)
No performance difference on a large real-world file (18.28s here vs 18.61s main, run 1500x)

This may not be the way forward for every language, or individual languages may need optimization, but I don't see any reason not to take this approach for Python in particular

joshgoebel · 2021-07-21T04:58:55Z

Browsers do usually have an auto-update feature on personal computers but I've been in enterprise environments where that is controlled by the IT department and they may not always be on top of updates.

I'm much less worried about true enterprise environments. If an enterprise wants to purposely use an older browser then they need to deal with the problems that causes. If Highlight.js is "enterprise critical" then such an enterprise may need to maintain a compatible version just for their use... we shouldn't punish the rest of the world just for the sake of the enterprise. I know at least one person running us thru Babel to produce an ES5 build... which we've long since moved on from (thankfully). That's fine for them if it works, but not something the core team needs to deal with.

support "last 2 versions" (or X in our case) of each browser.

I'm not sure how we'd enforce this exactly (does Babel or something have some sort of "check only" tooling?), but I think I like the spirit of it... If by "version" you mean "release" here... Firefox seems to issue releases every 1 - 1.5 months (recently)... if you extrapolate back a year that's ~9-12 releases... I think I'm semi-comfortable saying "last few" or "past 6 months" etc... I'm slightly more sympathetic for those on older Mac OS using a Safari with no upgrade path, but Safari has supported this stuff since v11.1 in early 2018.

I'm only slightly bothered because it feels like this is the first time we're drawing a line in the sand... otherwise I think we support the past several years worth of modern browsers work just fine... but I suppose I can get over it.

joshgoebel · 2021-07-21T05:10:15Z

However, if we just focus on this PR, which is just targeting Python, we get:

That's a fair redirect... and thanks for that. :-) Can you provide a benchmark script/scaffold or something I could use to quickly confirm the performance numbers you are seeing on this side - do you have something semi-automated for all those 10x, 100x, 1500x you mention? Perhaps there is indeed something else entirely at work here - some other grammar - or a particular regex that is somehow ill-suited for UTF-8 mode that is skewing the results when the change is global.

... I don't see any reason not to take this approach for Python in particular.

If I can confirm those performance numbers then I believe I would concur. This PR would still need some small changes, but I could work with you on those once I double check the numbers.

austin-schick · 2021-07-21T18:45:25Z

Awesome, thank you! I'll put together a script for easy performance testing and send that over Thursday or Friday.

austin-schick · 2021-07-27T03:31:57Z

Here you go, @joshgoebel! You can run node perf.js from the tools directory to get four different performance metrics. I'm hoping I wrote this in a way that's easily extensible so you can apply it to #3257 as well.

I think the easiest way to run the tests on main is to cherry pick 62e64d5 onto main and then run the script exactly the same way

austin-schick · 2021-07-27T03:40:08Z

Whoops, you'll want to cherry pick this commit too (or just uncomment the line it uncomments) ^

joshgoebel · 2021-08-02T01:57:25Z

Sorry I've been busy, I should get to this later this week.

austin-schick · 2021-08-02T05:40:47Z

Awesome, looking forward to your review :)

austin-schick · 2021-08-18T16:17:07Z

@joshgoebel Any update on this? With the fall semester starting, it'd be great to get this working now

joshgoebel · 2021-08-19T22:52:38Z

Ok, I don't see any performance regression here... let me drop a few comments on the PR itself so we van move forward!

Thanks for the ping.

joshgoebel · 2021-08-19T22:53:42Z

src/languages/python.js

@@ -5,10 +5,14 @@ Website: https://www.python.org
 Category: common
 */

-import { UNDERSCORE_IDENT_RE } from '../lib/modes.js';
+import { UNICODE_SUPPORTED, UNDERSCORE_IDENT_RE } from '../lib/modes.js';


We don't need a check for this (we can assume we only support browsers with unicode regex support), and we don't need the import at all anymore I don't think.

Suggested change

import { UNICODE_SUPPORTED, UNDERSCORE_IDENT_RE } from '../lib/modes.js';

joshgoebel · 2021-08-19T22:54:18Z

src/languages/python.js

 import * as regex from '../lib/regex.js';

 export default function(hljs) {
+  const PY_IDENT_RE = UNICODE_SUPPORTED ?


Please remove conditional and just write out the regex as a regex with /u.

joshgoebel · 2021-08-19T22:55:47Z

src/languages/python.js

@@ -358,6 +362,7 @@ export default function(hljs) {
      'gyp',
      'ipython'
    ],
+    unicode: UNICODE_SUPPORTED,


Lets go with unicodeRegex: true...

Suggested change

unicode: UNICODE_SUPPORTED,

unicodeRegex: true,

And of course compiler will need the name updated there.

joshgoebel · 2021-08-19T22:56:35Z

src/lib/modes.js

@@ -4,6 +4,16 @@ import * as regex from './regex.js';
 /** @typedef {import('highlight.js').Mode} Mode */
 /** @typedef {import('highlight.js').ModeCallback} ModeCallback */

+// Determine browser support for RegExp unicode property escapes


Remove, etc.

joshgoebel · 2021-08-19T22:57:10Z

You can just add new commits on top... I can clean up before merging... and figure out whether I want the perf stuff in this PR or make a sep one for that...

joshgoebel · 2021-08-28T17:28:43Z

I know I took a while to review this - if you're too busy to finish it up I may pick it up in the next few days and push it to completion myself, since now the XML stuff is potentially waiting on this.

austin-schick · 2021-08-30T21:39:41Z

Thanks for helping to push this to the finish line @joshgoebel. I had a busy two weeks, but I'm a little more open now. Let me know if there's anything else I can do

joshgoebel · 2021-09-01T00:20:36Z

@austin-schick Thanks for much for the PR!

Add support for diacritics in Python identifiers

4eb3aab

austin-schick changed the title ~~Add support for diacritics in Python identifiers~~ fix(python) Add support for diacritics in Python identifiers Jul 20, 2021

austin-schick marked this pull request as draft July 20, 2021 00:29

austin-schick added 2 commits July 20, 2021 12:35

Update Python diacritic support approach based on other unicode discu…

009ffd8

…ssions

Update changelog

b3587eb

austin-schick commented Jul 20, 2021

View reviewed changes

test/regex/lib/util.js Show resolved Hide resolved

austin-schick changed the title ~~fix(python) Add support for diacritics in Python identifiers~~ fix(python) Add support for unicode Python identifiers Jul 20, 2021

austin-schick marked this pull request as ready for review July 20, 2021 17:16

joshgoebel mentioned this pull request Jul 26, 2021

enh(xml) Allow Unicode attribute and tag names #3257

Closed

2 tasks

Add performance testing scripts

62e64d5

Rebuild hljs before each perf test

9e5cc01

austin-schick changed the title ~~fix(python) Add support for unicode Python identifiers~~ enh(python) Add support for unicode identifiers Jul 27, 2021

run from root dir, be just a bit faster

e7735b8

joshgoebel reviewed Aug 19, 2021

View reviewed changes

joshgoebel added 3 commits August 30, 2021 02:24

fixup basic unicode support for Python

f11d3e2

Merge branch 'main' into add-python-diacritic-support

4ebeefd

simplify ONLY_LANGUAGES

12e382b

clean up perf tests

228e67a

joshgoebel approved these changes Sep 1, 2021

View reviewed changes

joshgoebel merged commit bb4fb22 into highlightjs:main Sep 1, 2021

richardlau mentioned this pull request Dec 4, 2021

doc build problem in withoutintl builds nodejs/node#41077

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

enh(python) Add support for unicode identifiers #3280

enh(python) Add support for unicode identifiers #3280

austin-schick commented Jul 19, 2021 •

edited

austin-schick commented Jul 20, 2021

austin-schick commented Jul 20, 2021

joshgoebel commented Jul 20, 2021 •

edited

joshgoebel commented Jul 20, 2021 •

edited

austin-schick commented Jul 21, 2021 •

edited

joshgoebel commented Jul 21, 2021

joshgoebel commented Jul 21, 2021 •

edited

allejo commented Jul 21, 2021

austin-schick commented Jul 21, 2021 •

edited

joshgoebel commented Jul 21, 2021 •

edited

joshgoebel commented Jul 21, 2021

austin-schick commented Jul 21, 2021

austin-schick commented Jul 27, 2021

austin-schick commented Jul 27, 2021

joshgoebel commented Aug 2, 2021

austin-schick commented Aug 2, 2021

austin-schick commented Aug 18, 2021

joshgoebel commented Aug 19, 2021

joshgoebel Aug 19, 2021

joshgoebel Aug 19, 2021

joshgoebel Aug 19, 2021

joshgoebel Aug 19, 2021

joshgoebel Aug 19, 2021

joshgoebel commented Aug 19, 2021

joshgoebel commented Aug 28, 2021

austin-schick commented Aug 30, 2021

joshgoebel commented Sep 1, 2021

enh(python) Add support for unicode identifiers #3280

enh(python) Add support for unicode identifiers #3280

Conversation

austin-schick commented Jul 19, 2021 • edited

Changes

Checklist

austin-schick commented Jul 20, 2021

austin-schick commented Jul 20, 2021

joshgoebel commented Jul 20, 2021 • edited

joshgoebel commented Jul 20, 2021 • edited

austin-schick commented Jul 21, 2021 • edited

joshgoebel commented Jul 21, 2021

joshgoebel commented Jul 21, 2021 • edited

allejo commented Jul 21, 2021

austin-schick commented Jul 21, 2021 • edited

joshgoebel commented Jul 21, 2021 • edited

joshgoebel commented Jul 21, 2021

austin-schick commented Jul 21, 2021

austin-schick commented Jul 27, 2021

austin-schick commented Jul 27, 2021

joshgoebel commented Aug 2, 2021

austin-schick commented Aug 2, 2021

austin-schick commented Aug 18, 2021

joshgoebel commented Aug 19, 2021

joshgoebel Aug 19, 2021

Choose a reason for hiding this comment

joshgoebel Aug 19, 2021

Choose a reason for hiding this comment

joshgoebel Aug 19, 2021

Choose a reason for hiding this comment

joshgoebel Aug 19, 2021

Choose a reason for hiding this comment

joshgoebel Aug 19, 2021

Choose a reason for hiding this comment

joshgoebel commented Aug 19, 2021

joshgoebel commented Aug 28, 2021

austin-schick commented Aug 30, 2021

joshgoebel commented Sep 1, 2021

austin-schick commented Jul 19, 2021 •

edited

joshgoebel commented Jul 20, 2021 •

edited

joshgoebel commented Jul 20, 2021 •

edited

austin-schick commented Jul 21, 2021 •

edited

joshgoebel commented Jul 21, 2021 •

edited

austin-schick commented Jul 21, 2021 •

edited

joshgoebel commented Jul 21, 2021 •

edited