Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New: add no-misleading-character-class (fixes #10049) #10511

Merged
merged 8 commits into from
Jul 30, 2018

Conversation

mysticatea
Copy link
Member

@mysticatea mysticatea commented Jun 23, 2018

What is the purpose of this pull request? (put an "X" next to item)

[X] New rule: fixes #10049, closes #10620.

What changes did you make? (Give an overview)

  • Add a new rule no-dismantled-character-class rule.
  • Add utilities for Unicode (lib/util/unicode/*).
  • Add a script tools/update-unicode-utils.js to generate the utility. This generates isCombiningCharacter function from https://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt . It will be useful to update the function for the future Unicode versions.

Is there anything you'd like reviewers to focus on?

  • Please correct documentation.
  • Are there other patterns of multi-code-point characters?

@mysticatea mysticatea added rule Relates to ESLint's core rules accepted There is consensus among the team that this change meets the criteria for inclusion feature This change adds a new feature to ESLint labels Jun 23, 2018
Copy link
Member

@platinumazure platinumazure left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General thoughts:

  • In the rule docs, it would be good to have a section with correct code. (In particular, if the only possible correct code is surrogate pairs with the u flag, it would be good to emphasize that there is no good way to match for the other characters in character classes.)

Would it be possible to write some sort of test for the tool which updates the combining character file?

Everything else LGTM. Thanks!

# Disallow characters which are made with multiple code points in character class syntax (no-dismantled-character-class)

Unicode includes the characters which are made with multiple code points.
RegExp character class syntax (`/[abc]/`) cannot such a character as a character. For example, `❇️` is made by `❇` (`U+2747`) and VARIATION SELECTOR-16 (`U+FE0F`). If this character is in RegExp character class, it will match to either `❇` (`U+2747`) or VARIATION SELECTOR-16 (`U+FE0F`) rather than `❇️`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first sentence here is a bit confusing: "...cannot such a character as a character." Maybe this should say, "...cannot directly match such a character"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for correction.

"...cannot directly match such a character" sounds good. I wanted to say... "RegExp character class syntax cannot handle [characters which are made by multiple code points] as a character; those characters will be dissolved to each code point."

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like what you've suggested better 😄


**A character with combining characters:**

The combining characters are characters which belong to one of `Mc`, `Me`, and `Mn` categories ([Unicode general categories](http://www.unicode.org/L2/L1999/UnicodeData.html#General%20Category)).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would read slightly better if the Unicode general categories link were inline:

The combining characters are characters which belong to one of `Mc`, `Me`, and `Mn` [Unicode general categories](http://www.unicode.org/L2/L1999/UnicodeData.html#General%20Category).

What do you think?

*/
module.exports = function isCombiningCharacter(c) {
return (
(c >= 0x300 && c <= 0x36f) ||
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the checking seems time-consuming, can we create a lookup table here? (it takes a little more memory, but can be reused.)

@mysticatea
Copy link
Member Author

Would it be possible to write some sort of test for the tool which updates the combining character file?

I'm not sure if the test of the tool is valuable or not.
I think that working of no-dismantled-character-class is the evidense of the tool. Probably more test cases about combining characters in no-dismantled-character-class is useful.

the checking seems time-consuming, can we create a lookup table here? (it takes a little more memory, but can be reused.)

Indeed, it was slow.
I updated it by using Set, then it got x8.5 faster.

---- before ----
Times:  1024
Median: 0.003924
Mean:   0.0072428291015623515
Min:    0.000302
Max:    3.091807

---- after ----
Times:  1024
Median: 0.000603
Mean:   0.0004560087890625067
Min:    0.000301
Max:    0.068217

https://gist.github.com/mysticatea/836702d259bc6e3650ac3f8c46b64183


@eslint/eslint-team I'm happy if I get advice about the rule name.

Copy link
Member

@aladdin-add aladdin-add left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@not-an-aardvark
Copy link
Member

What would you think about renaming this to something like no-misleading-character-class? I feel like the word "dismantled" might be hard to understand in this context.

@mysticatea
Copy link
Member Author

Sounds good to me. I renamed it.

@not-an-aardvark not-an-aardvark changed the title New: add no-dismantled-character-class (fixes #10049) New: add no-misleading-character-class (fixes #10049) Jul 28, 2018
*/
"use strict";

const fs = require("fs");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this also be done with unicode property escapes?

for (let charCode = 0; charCode < 2 ** 20; charCode++) {
    if (/^\p{Mn}|\p{Mc}|\p{Me}$/u.test(String.fromCodePoint(charCode))) {
        combiningChars.add(charCode);
    }
}

It might be simpler than downloading a file from a server, although it would prevent people from running the script unless they use Node 10.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, good idea!

@mysticatea
Copy link
Member Author

I updated update-unicode-utils.js. The Unicode version of Node.js doesn't seem the latest version, so the character set was different a bit.

Copy link
Member

@platinumazure platinumazure left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks! Sorry for the delay in reviewing again.

Copy link
Member

@not-an-aardvark not-an-aardvark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@aladdin-add aladdin-add merged commit 2cc3240 into master Jul 30, 2018
@aladdin-add aladdin-add deleted the no-dismantled-character-class branch July 30, 2018 03:24
@eslint-deprecated eslint-deprecated bot locked and limited conversation to collaborators Jan 27, 2019
@eslint-deprecated eslint-deprecated bot added the archived due to age This issue has been archived; please open a new issue for any further discussion label Jan 27, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
accepted There is consensus among the team that this change meets the criteria for inclusion archived due to age This issue has been archived; please open a new issue for any further discussion feature This change adds a new feature to ESLint rule Relates to ESLint's core rules
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Rule Proposal: no-multi-code-point-character-in-character-class
4 participants