Support ES2015 Regex `u` flag (Unicode) #958

p-bakker · 2021-06-25T09:49:21Z

See https://mathiasbynens.be/notes/es6-unicode-regex

p-bakker · 2021-06-25T14:15:56Z

The harness code used by the tests in test262 make use of unicode regexes, so tests are failing not so much because what is being tested fails, but because the code for executing the tests fails.

tuchida · 2022-01-28T20:40:15Z

In order to run Function.prototype.toString tests, assertNativeFunction needs to work.
https://github.com/tc39/test262/blob/f94fc660cc3c59b1f2f9f122fc4d44b4434b935c/harness/nativeFunctionMatcher.js#L26

p-bakker · 2022-01-28T20:54:49Z

Will try next week to resurrect my attempt to implement this and share my WIP

p-bakker · 2022-02-18T22:07:57Z

Did make progress, but not yet as far as I hoped. Still at it though

p-bakker · 2022-03-02T15:11:11Z

Got some Java implementation-related questions I'm hoping to get some answers on before going down the wrong rabbits-hole.

Right now, inside the regex implementation there are 2 things going on that I'm contemplating on changing:

all logic operates on the chars. Chars however can only contain chars with a codePoint up to 65xxx. Unicode however is about also supporting codePoints beyond that limit. So I'm contemplating changing all the internal regex code from operating on chars to operating on ints
regexes are compiled to some 'bytecode', which is stored in a byte[], with some logic to store char values in two separate bytes. Unicode codePoints beyond 65xxx don't fit into two separate bytes anymore, so I either adjust the logic to support splitting higher codePoints into 3 separates bytes in the byte[] or I forego the byte[] completely and switch it to an int[], which allows any unicode codePoint to be stored in a single position

Anyone got insight what the changes from char > int and byte[] to int[] would do memory and performance-wise?

p-bakker · 2022-03-02T15:40:38Z

And another questionThe logic needs the mappings defined in this file: https://www.unicode.org/Public/14.0.0/ucd/CaseFolding.txt

What's the better approach:

include this file in the Rhino source and have a class read it at runtime and create a mapping object
write a build script that takes this file and generates a class file from it

FYI: Graal did the later: https://github.com/oracle/graal/blob/89e4cfc7aeea69970b60c64cd075ceb2a104e864/regex/docs/UpdatingUnicodeFiles.md

tuchida · 2022-03-02T19:23:58Z

Anyone got insight what the changes from char > int and byte[] to int[] would do memory and performance-wise?

I am not familiar with Java performance, but when I implemented BigInt I used ./gradlew testBenchmark to check it.

Improve toNumeric speed #885

Perhaps you can check with SunSpiderBenchmark.regexpDna.

However, the benchmarks used by Rhino are quite old. When I looked at what benchmarks other JavaScript engines use, I couldn't find any that were compatible with the latest ECMA262, but instead found this article.
https://v8.dev/blog/retiring-octane

tonygermano · 2022-03-02T20:16:17Z

I think you want to look at UTF-16 code units/surrogate pairs. Java Strings that require more than a single byte per character are stored as UTF-16, which is a variable encoding, and the higher code points are stored in 2 chars (or two pairs of bytes.)

p-bakker · 2022-03-02T20:22:31Z

yeah, already got to the bottom of surrogate pairs and such. Thing is that if the code is to operate on chars in the string, then on each and every operation on the char, you'd need to check whether its a high/leading surrogate and if so, if its followed by a low/trailing surrogate, which messed up the code big time. If instead operating on ints (unicode codepoints), the code is much cleaner.

The EcmaScript spec also mentions this in the note in https://tc39.es/ecma262/#sec-regexpidentifiercodepoint: from a spec pov the operations are on codePoints. Its up to the implementation to choose whether internally operating on codePoints or on chars (and thus deal with the high/low surrogate pairs)

From an easy-of-coding perspective, I'd prefer codePoints, just not sure about the ramifications memory/performance-wise

p-bakker · 2022-03-02T20:52:44Z

Another interesting one:
EcmaScript nowadays says a \u is to be followed by 4 digits or else throw a SyntaxException
However: '\u' is allowed now in Rhino, been like that since the the first (public) commit in Rhino, got a testcase referring to section 7.7.4 of EcmaScript v1

However: imho that v1 spec doesn't say what should happen if \u is NOT followed by 4 digits. Current code in Rhino just takes it as a string literal and continues, however the latest EcmaScript spec says to throw a SyntaxException (although I cannot decypher exactly where it says that), FF and GC do so and Test262 checks for this scenario and whether a SyntaxException is thrown

So, how to proceed? Break backwards compatibility completely? Tie the spec-compliant behavior to a specific language version? (if so, which one?). Remain incompatible? By default become spec-compliant, but through a feature flag allow enabling the old behavior (I can see a lot of such flags coming)?

I think my preference would be to tie the proper, spec-compliant behavior to a new EcmaScript language version and then have one generic feature flag to enable ALL non-standard behavior (as far as it can be made optional in the codebase)

p-bakker changed the title ~~Support ES2015 Regex u flag~~ Support ES2015 Regex u flag (Unicode) Jun 25, 2021

p-bakker self-assigned this Jun 25, 2021

p-bakker added this to the ES2015 milestone Jun 29, 2021

p-bakker mentioned this issue Jun 29, 2021

Add support for ES6 “u” flag for regular expressions #291

Closed

p-bakker added the test262 harness Issues related to failing test262 tests, due to Rhino not yet support certain JS features label Jun 29, 2021

p-bakker pinned this issue Jul 1, 2021

p-bakker mentioned this issue Jul 2, 2021

Fix some NPEs in tests #987

Merged

p-bakker unpinned this issue Jul 2, 2021

p-bakker added this to In progress in v1.7.xx Oct 14, 2021

p-bakker mentioned this issue Jan 27, 2022

add Object.hasOwn (#1052) #1157

Merged

p-bakker added the feature Issues considered a new feature label Jan 28, 2022

tuchida mentioned this issue Jan 28, 2022

Resolve structural issue with exception handling in IRFactory #967

Open

p-bakker pinned this issue Mar 3, 2022

p-bakker mentioned this issue Mar 12, 2022

Support ES2015 unicode escape sequence additions #917

Closed

rbri mentioned this issue Dec 21, 2022

RegExpJsToJavaConverter \u{12345} to \x{12345} HtmlUnit/htmlunit#535

Closed

p-bakker mentioned this issue Jul 1, 2023

Support ES2024 Regex v flag #1350

Open

tuchida mentioned this issue Aug 20, 2023

Refactor IRFactory #1188

Open

p-bakker mentioned this issue Oct 23, 2023

Support ES2023 Change Array by copy #1307

Open

p-bakker mentioned this issue Nov 12, 2023

Update NativeArray for toReversed() #1414

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support ES2015 Regex `u` flag (Unicode) #958

Support ES2015 Regex `u` flag (Unicode) #958

p-bakker commented Jun 25, 2021

p-bakker commented Jun 25, 2021 •

edited

tuchida commented Jan 28, 2022

p-bakker commented Jan 28, 2022

p-bakker commented Feb 18, 2022

p-bakker commented Mar 2, 2022 •

edited

p-bakker commented Mar 2, 2022 •

edited

tuchida commented Mar 2, 2022

tonygermano commented Mar 2, 2022

p-bakker commented Mar 2, 2022

p-bakker commented Mar 2, 2022 •

edited

Support ES2015 Regex u flag (Unicode) #958

Support ES2015 Regex u flag (Unicode) #958

Comments

p-bakker commented Jun 25, 2021

p-bakker commented Jun 25, 2021 • edited

tuchida commented Jan 28, 2022

p-bakker commented Jan 28, 2022

p-bakker commented Feb 18, 2022

p-bakker commented Mar 2, 2022 • edited

p-bakker commented Mar 2, 2022 • edited

tuchida commented Mar 2, 2022

tonygermano commented Mar 2, 2022

p-bakker commented Mar 2, 2022

p-bakker commented Mar 2, 2022 • edited

Support ES2015 Regex `u` flag (Unicode) #958

Support ES2015 Regex `u` flag (Unicode) #958

p-bakker commented Jun 25, 2021 •

edited

p-bakker commented Mar 2, 2022 •

edited

p-bakker commented Mar 2, 2022 •

edited

p-bakker commented Mar 2, 2022 •

edited