-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New: add no-misleading-character-class (fixes #10049) #10511
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General thoughts:
- In the rule docs, it would be good to have a section with correct code. (In particular, if the only possible correct code is surrogate pairs with the
u
flag, it would be good to emphasize that there is no good way to match for the other characters in character classes.)
Would it be possible to write some sort of test for the tool which updates the combining character file?
Everything else LGTM. Thanks!
# Disallow characters which are made with multiple code points in character class syntax (no-dismantled-character-class) | ||
|
||
Unicode includes the characters which are made with multiple code points. | ||
RegExp character class syntax (`/[abc]/`) cannot such a character as a character. For example, `❇️` is made by `❇` (`U+2747`) and VARIATION SELECTOR-16 (`U+FE0F`). If this character is in RegExp character class, it will match to either `❇` (`U+2747`) or VARIATION SELECTOR-16 (`U+FE0F`) rather than `❇️`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first sentence here is a bit confusing: "...cannot such a character as a character." Maybe this should say, "...cannot directly match such a character"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for correction.
"...cannot directly match such a character" sounds good. I wanted to say... "RegExp character class syntax cannot handle [characters which are made by multiple code points] as a character; those characters will be dissolved to each code point."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like what you've suggested better 😄
|
||
**A character with combining characters:** | ||
|
||
The combining characters are characters which belong to one of `Mc`, `Me`, and `Mn` categories ([Unicode general categories](http://www.unicode.org/L2/L1999/UnicodeData.html#General%20Category)). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would read slightly better if the Unicode general categories link were inline:
The combining characters are characters which belong to one of `Mc`, `Me`, and `Mn` [Unicode general categories](http://www.unicode.org/L2/L1999/UnicodeData.html#General%20Category).
What do you think?
*/ | ||
module.exports = function isCombiningCharacter(c) { | ||
return ( | ||
(c >= 0x300 && c <= 0x36f) || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the checking seems time-consuming, can we create a lookup table here? (it takes a little more memory, but can be reused.)
I'm not sure if the test of the tool is valuable or not.
Indeed, it was slow.
@eslint/eslint-team I'm happy if I get advice about the rule name. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
What would you think about renaming this to something like |
Sounds good to me. I renamed it. |
*/ | ||
"use strict"; | ||
|
||
const fs = require("fs"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this also be done with unicode property escapes?
for (let charCode = 0; charCode < 2 ** 20; charCode++) {
if (/^\p{Mn}|\p{Mc}|\p{Me}$/u.test(String.fromCodePoint(charCode))) {
combiningChars.add(charCode);
}
}
It might be simpler than downloading a file from a server, although it would prevent people from running the script unless they use Node 10.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, good idea!
I updated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks! Sorry for the delay in reviewing again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
What is the purpose of this pull request? (put an "X" next to item)
[X] New rule: fixes #10049, closes #10620.
What changes did you make? (Give an overview)
no-dismantled-character-class
rule.lib/util/unicode/*
).tools/update-unicode-utils.js
to generate the utility. This generatesisCombiningCharacter
function from https://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt . It will be useful to update the function for the future Unicode versions.Is there anything you'd like reviewers to focus on?