Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support ES2015 Regex u flag (Unicode) #958

Open
p-bakker opened this issue Jun 25, 2021 · 10 comments
Open

Support ES2015 Regex u flag (Unicode) #958

p-bakker opened this issue Jun 25, 2021 · 10 comments
Assignees
Labels
feature Issues considered a new feature test262 harness Issues related to failing test262 tests, due to Rhino not yet support certain JS features
Projects
Milestone

Comments

@p-bakker
Copy link
Collaborator

See https://mathiasbynens.be/notes/es6-unicode-regex

@p-bakker p-bakker changed the title Support ES2015 Regex u flag Support ES2015 Regex u flag (Unicode) Jun 25, 2021
@p-bakker p-bakker self-assigned this Jun 25, 2021
@p-bakker
Copy link
Collaborator Author

p-bakker commented Jun 25, 2021

The harness code used by the tests in test262 make use of unicode regexes, so tests are failing not so much because what is being tested fails, but because the code for executing the tests fails.

@p-bakker p-bakker added this to the ES2015 milestone Jun 29, 2021
@p-bakker p-bakker added the test262 harness Issues related to failing test262 tests, due to Rhino not yet support certain JS features label Jun 29, 2021
@p-bakker p-bakker pinned this issue Jul 1, 2021
@p-bakker p-bakker unpinned this issue Jul 2, 2021
@p-bakker p-bakker added this to In progress in v1.7.xx Oct 14, 2021
@p-bakker p-bakker added the feature Issues considered a new feature label Jan 28, 2022
@tuchida
Copy link
Contributor

tuchida commented Jan 28, 2022

In order to run Function.prototype.toString tests, assertNativeFunction needs to work.
https://github.com/tc39/test262/blob/f94fc660cc3c59b1f2f9f122fc4d44b4434b935c/harness/nativeFunctionMatcher.js#L26

@p-bakker
Copy link
Collaborator Author

Will try next week to resurrect my attempt to implement this and share my WIP

@p-bakker
Copy link
Collaborator Author

Did make progress, but not yet as far as I hoped. Still at it though

@p-bakker
Copy link
Collaborator Author

p-bakker commented Mar 2, 2022

Got some Java implementation-related questions I'm hoping to get some answers on before going down the wrong rabbits-hole.

Right now, inside the regex implementation there are 2 things going on that I'm contemplating on changing:

  1. all logic operates on the chars. Chars however can only contain chars with a codePoint up to 65xxx. Unicode however is about also supporting codePoints beyond that limit. So I'm contemplating changing all the internal regex code from operating on chars to operating on ints
  2. regexes are compiled to some 'bytecode', which is stored in a byte[], with some logic to store char values in two separate bytes. Unicode codePoints beyond 65xxx don't fit into two separate bytes anymore, so I either adjust the logic to support splitting higher codePoints into 3 separates bytes in the byte[] or I forego the byte[] completely and switch it to an int[], which allows any unicode codePoint to be stored in a single position

Anyone got insight what the changes from char > int and byte[] to int[] would do memory and performance-wise?

@p-bakker
Copy link
Collaborator Author

p-bakker commented Mar 2, 2022

And another questionThe logic needs the mappings defined in this file: https://www.unicode.org/Public/14.0.0/ucd/CaseFolding.txt

What's the better approach:

  1. include this file in the Rhino source and have a class read it at runtime and create a mapping object
  2. write a build script that takes this file and generates a class file from it

FYI: Graal did the later: https://github.com/oracle/graal/blob/89e4cfc7aeea69970b60c64cd075ceb2a104e864/regex/docs/UpdatingUnicodeFiles.md

@tuchida
Copy link
Contributor

tuchida commented Mar 2, 2022

Anyone got insight what the changes from char > int and byte[] to int[] would do memory and performance-wise?

I am not familiar with Java performance, but when I implemented BigInt I used ./gradlew testBenchmark to check it.

Perhaps you can check with SunSpiderBenchmark.regexpDna.

However, the benchmarks used by Rhino are quite old. When I looked at what benchmarks other JavaScript engines use, I couldn't find any that were compatible with the latest ECMA262, but instead found this article.
https://v8.dev/blog/retiring-octane

@tonygermano
Copy link
Contributor

I think you want to look at UTF-16 code units/surrogate pairs. Java Strings that require more than a single byte per character are stored as UTF-16, which is a variable encoding, and the higher code points are stored in 2 chars (or two pairs of bytes.)

@p-bakker
Copy link
Collaborator Author

p-bakker commented Mar 2, 2022

yeah, already got to the bottom of surrogate pairs and such. Thing is that if the code is to operate on chars in the string, then on each and every operation on the char, you'd need to check whether its a high/leading surrogate and if so, if its followed by a low/trailing surrogate, which messed up the code big time. If instead operating on ints (unicode codepoints), the code is much cleaner.

The EcmaScript spec also mentions this in the note in https://tc39.es/ecma262/#sec-regexpidentifiercodepoint: from a spec pov the operations are on codePoints. Its up to the implementation to choose whether internally operating on codePoints or on chars (and thus deal with the high/low surrogate pairs)

From an easy-of-coding perspective, I'd prefer codePoints, just not sure about the ramifications memory/performance-wise

@p-bakker
Copy link
Collaborator Author

p-bakker commented Mar 2, 2022

Another interesting one:
EcmaScript nowadays says a \u is to be followed by 4 digits or else throw a SyntaxException
However: '\u' is allowed now in Rhino, been like that since the the first (public) commit in Rhino, got a testcase referring to section 7.7.4 of EcmaScript v1

However: imho that v1 spec doesn't say what should happen if \u is NOT followed by 4 digits. Current code in Rhino just takes it as a string literal and continues, however the latest EcmaScript spec says to throw a SyntaxException (although I cannot decypher exactly where it says that), FF and GC do so and Test262 checks for this scenario and whether a SyntaxException is thrown

So, how to proceed? Break backwards compatibility completely? Tie the spec-compliant behavior to a specific language version? (if so, which one?). Remain incompatible? By default become spec-compliant, but through a feature flag allow enabling the old behavior (I can see a lot of such flags coming)?

I think my preference would be to tie the proper, spec-compliant behavior to a new EcmaScript language version and then have one generic feature flag to enable ALL non-standard behavior (as far as it can be made optional in the codebase)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Issues considered a new feature test262 harness Issues related to failing test262 tests, due to Rhino not yet support certain JS features
Projects
v1.7.xx
In progress
Development

No branches or pull requests

3 participants