Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enh(php) Left and right-side of double colon #3422

Merged

Conversation

wkania
Copy link
Contributor

@wkania wkania commented Dec 6, 2021

Double colon aka Scope Resolution Operator is used in a few contexts:

  • access static variable or method of the class: Example::$uuid; self::$a; Example::test(); $a::test();
  • constant reference: Foo::Test;
  • access overridden properties or methods of a class: parent::$a; parent::__construct();
  • enum value reference: Foo::Test;
  • class name reference: self::class; static::class; parent::class; Foo::class; $a::class

Changes

  • declare and use IDENT_RE for valid labels instead of w+
  • declaration of PHP constants were not highlighted. Now they use the same class as variables.
  • enum and constant reference were not highlighted. Now they use the same class as variables.
  • Class name references are highlighted as variable.language.
    class is a special case of the constant, it's also a reserved keyword.
  • left-side class name highlighting

Info about rules for valid labels (variables, constants, and other names).

The name of a constant follows the same rules as any label in PHP. A valid constant name starts with a letter or underscore, followed by any number of letters, numbers, or underscores. As a regular expression, it would be expressed thusly: ^[a-zA-Z_\x80-\xff][a-zA-Z0-9_\x80-\xff]*$

Checklist

  • Added markup tests
  • Updated the changelog at CHANGES.md

@wkania wkania force-pushed the php/constants-and-php-8-1-keywords branch from 8354aa3 to d55e79e Compare December 6, 2021 20:59
src/languages/php.js Outdated Show resolved Hide resolved
src/languages/php.js Outdated Show resolved Hide resolved
src/languages/php.js Outdated Show resolved Hide resolved
@wkania wkania force-pushed the php/constants-and-php-8-1-keywords branch from d55e79e to d7f714b Compare December 7, 2021 23:08
src/languages/php.js Outdated Show resolved Hide resolved
src/languages/php.js Outdated Show resolved Hide resolved
},
{
begin: [
/[A-Z]\w*/,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets wrap this in a constant since it's re-used multiple times.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because now I have an error, I found the root of my regression. The match is case-insensitive because it uses i by default. It totally ignores the uppercase rule. Can I disable it or is it some kind of bug?

Error: php/[9]/begin: Polynomial backtracking. By repeating any character that matches /\xa0/i, an attack string can be created.

    \s+)([a-zA-Z_-ÿ][a-zA-Z0-9_-ÿ]*
    ^~~              ^~~~~~~~~~~~~~~~

Full pattern:
/(const)(\s+)([a-zA-Z_-ÿ][a-zA-Z0-9_-ÿ]*(?![A-Za-z0-9])(?![$]))(\s*=)/im
  1. I also added a new const for name and in MR description added info about names.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joshgoebel Do you know anything about this case-insensitive matching?

Copy link
Member

@joshgoebel joshgoebel Dec 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PHP is case_insensitive: true, so EVERY regex is /i, there i no need to specify and it's impossible to have ANY regex that is case-dependent. I forgot this or I'd have mentioned it earlier.

So forget about using CamelCase rules... I've read not ALL of PHP is case-sensitive though... perhaps we could switch it to "case-sensitive" and try and deal with the exceptions by hand? How familiar are you with PHP's case sensitivity overall?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PHP is case_insensitive: true, so EVERY regex is /i, there i no need to specify and it's impossible to have ANY regex that is case-dependent. I forgot this or I'd have mentioned it earlier.

Now that's news. How could I miss this case_insensitive: true, :)

I checked the current implementation of the PHP highlighter and part of camel case keywords are affected by this. I would separate keywords into case-insensitive (false, if) and case-sensitive (ArithmeticError ArrayIterator). Case-insensitive function and class names are not affected by this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would separate keywords into case-insensitive (false, if) and case-sensitive

Except we have no current way to do that [case insensitive in a case sensitive grammar] - the setting is at the grammar level (because we concat every regex into a single huge one)... I could imagine a possible enhancement to JUST keywords to allow "as given" or "all lowercase" (would allow the two most common variants, which might in many cases be sufficient)... but that doesn't necessarily help with any custom rules, which would need every letter individually split out... of course we could potentially write a helper for that...

What do you think of the "as given" + all lowercase... ie, supporting "false" and "FALSE" but not "fAlSe"?

I'm trying to figure out what could be done here to wrap this PR up and then circle back to the larger discussion of casing as a sep concern.

src/languages/php.js Outdated Show resolved Hide resolved
Copy link
Member

@joshgoebel joshgoebel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking pretty good. I dropped a few notes.

@wkania wkania force-pushed the php/constants-and-php-8-1-keywords branch from d7f714b to e6b4237 Compare December 8, 2021 20:21
@wkania
Copy link
Contributor Author

wkania commented Dec 8, 2021

Looking pretty good. I dropped a few notes.

I fixed issues and added some questions.

src/languages/php.js Outdated Show resolved Hide resolved
src/languages/php.js Outdated Show resolved Hide resolved
src/languages/php.js Outdated Show resolved Hide resolved
src/languages/php.js Outdated Show resolved Hide resolved
@joshgoebel
Copy link
Member

joshgoebel commented Dec 9, 2021

If you could please open a new PR from wkania instead of wkania-forks so that I can push changes also that would be greatly appreciated. GitHub is very picky about such things. Or else allow pushing, but it's likely the fact that your using an alternative account is the issue.

It's costing more time to explain some of the tiny things than to just fix them myself.

@joshgoebel
Copy link
Member

Alternatively you could perhaps add me with repo write permission against your fork wkania-forks/highlight.js....

@wkania wkania changed the title enh(php) Add support for constants and php 8.1 keywords enh(php) Add support for constants Dec 16, 2021
@wkania wkania mentioned this pull request Dec 16, 2021
2 tasks
@wkania wkania changed the title enh(php) Add support for constants (WIP) enh(php) Add support for constants Dec 16, 2021
@wkania wkania changed the title (WIP) enh(php) Add support for constants (WIP) enh(php) Left and right-side of double colon Dec 19, 2021
@wkania wkania force-pushed the php/constants-and-php-8-1-keywords branch from e6b4237 to e929ec6 Compare December 19, 2021 22:32
src/languages/php.js Outdated Show resolved Hide resolved
// This will not detect camelCase classes
const PASCAL_CASE_CLASS_NAME = regex.concat("[A-Z]", IDENT_RE);

const CONSTANT_REFERENCE = regex.concat(IDENT_RE, "\\b(?!\\()");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this a constant reference? In other languages our constant rule is ALL_CAPS_CASE. Is anything following a :: just a constant in php?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And if it's truly a constant we'd use variable.constant...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this a constant reference? In other languages our constant rule is ALL_CAPS_CASE. Is anything following a :: just a constant in PHP?

Language allows also lower case or any other, but probably almost all use ALL_CAPS_CASE.
It is just IDENT_RE that can't be followed by (. This distinguishes it from the static method call. I left examples in the PR description. In this context, it's just a helper that is part of a few matching rules.

And if it's truly a constant we'd use variable.constant...

Ok, I guess accessing Enum value is also variable.constant, because we can't distinguish those.

Constant declaration and Enum values also should be marked with variable.constant?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is just IDENT_RE that can't be followed by (.

That's that's a "function dispatch" or something, not a "constant"... constant to me means const or read only or UPPERCASE by convention in may languages...

Constant declaration and Enum values also should be marked with variable.constant?

Perhaps if you could detect they are enum values... I dunno what the PHP syntax is... for most languages it's impossible to do this 100% correctly so we usually have a UPPERCASE constant rule for languages where it's a string pattern and then only catch others in very explicit cases like class name, etc... where it's clean name is constant despite the weird casing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I just looked above, yes the enum declarations could be marked that way I suppose...

src/languages/php.js Outdated Show resolved Hide resolved
@wkania wkania force-pushed the php/constants-and-php-8-1-keywords branch from e929ec6 to 2e93647 Compare January 1, 2022 15:46
@wkania wkania changed the title (WIP) enh(php) Left and right-side of double colon enh(php) Left and right-side of double colon Jan 1, 2022
@wkania wkania force-pushed the php/constants-and-php-8-1-keywords branch from 2e93647 to 4f02ee1 Compare January 1, 2022 15:53
@wkania
Copy link
Contributor Author

wkania commented Jan 1, 2022

@joshgoebel Pls review.
I added all use cases in the description of the PR. Tests use variations of those cases.

@wkania wkania requested a review from joshgoebel January 1, 2022 18:55
@wkania wkania force-pushed the php/constants-and-php-8-1-keywords branch from 4f02ee1 to 1d9cdf2 Compare January 2, 2022 19:40
/const/,
/\s/,
IDENT_RE,
/\s*=/,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be a look-ahead, I'm not sure why we need to consume it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I don't know enough about the highlight js to know what's the difference.

  1. Why can't we consume it?
  2. Why is look-ahead, is a better solution?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we don't just consume things arbitrarily... we aren't highlighting it here, we only need to guarantee it's presence, not consume it. In general we don't consume things we don't need to.

Later someone may want to add operator matching where we highlight all = and consuming it here would make that impossible... it should be left int he input stream so another rule might match it later.

/(?![a-zA-Z0-9_\x7f-\xff])/
/(::|->)+/,
IDENT_RE,
regex.concat("(?!", WHITESPACE, "*\\()")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was there an issue with line 387?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We define again the same regex pattern instead of using an already defined one.

Copy link
Member

@joshgoebel joshgoebel Jan 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You changed the regex pattern... did you not? It was previously ~:

-> ident [ NOT "\s(" AND NOT IDENT ]

You removed the "AND NOT IDENT", portion did you not? Was there a reason?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, lines 385 and 387 are the same as the regex rule of IDENT_RE. IDENT_RE only has the suffix (?![$])), which was missing here (in my opinion). Documentation states that all labels follow the same pattern:

Variable names follow the same rules as other labels in PHP. A valid variable name starts with a letter or underscore, followed by any number of letters, numbers, or underscores. As a regular expression, it would be expressed thus: ^[a-zA-Z_\x80-\xff][a-zA-Z0-9_\x80-\xff]*$

I also changed the position of the negative look-ahead from the middle to the end. It didn't break any test and according to my knowledge about regex, it should work the same. If you could provide me with a sample that passes the old code and fail in the new one, then I would fix it.

Copy link
Member

@joshgoebel joshgoebel Jan 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It didn't break any test...

Sadly our tests are far from comprehensive, hence the careful eye I have with reviewing PRs.

This was added with: d49e9fb

The test code added was , 3 => \Location\Web\URI::class)); so I imagine this was to protect the ::class... but now you're explicitly handling that... so if we remove this guard entirely, does it break anything?

If not, lets throw it out completely. And if it does, then we can look at what broken to help clarify it's purpose of existing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Reverting back to, didn't break anything.

src/languages/php.js Outdated Show resolved Hide resolved
src/languages/php.js Outdated Show resolved Hide resolved
Co-authored-by: Josh Goebel <me@joshgoebel.com>
src/languages/php.js Outdated Show resolved Hide resolved
@joshgoebel
Copy link
Member

@wkania I removed the rule, seems fine. I pretty sure it's redundant now. Thanks for all work on this!

@joshgoebel joshgoebel merged commit e828d69 into highlightjs:main Jan 4, 2022
brainlid added a commit to brainlid/highlight.js that referenced this pull request Feb 11, 2022
* master: (816 commits)
  (chore) add sideEffects: false to enable tree-shaking in esbuild & others
  fix(markdown) Handle `***Hello world***` without breaking (highlightjs#3457)
  (chore) DRY up php grammar just a little
  enh(php) support CSSCase attribute naming
  refactor, security issues
  enh(php) Add support for Attributes
  fix(java) prevent false variable init on `else` (highlightjs#3455)
  (ci) min change threshold for size report (highlightjs#3401)
  (themes) Add `tokyo-night-dark` (highlightjs#3467)
  enh(llvm) Improve number support, add `char.escape` (highlightjs#3471)
  (chore) simplify brainfuck grammar
  fix(brainfuck) fix highlighting of initial ++/--
  Minor change to TypeScript types and TypeScript-specific keywords (highlightjs#3466)
  fix(angelscript) Fix highlighting of int8, int16, int32, int64 (highlightjs#3464)
  enh(php) named arguments and fix php constants (highlightjs#3459)
  themes: add new felipec theme (highlightjs#3441)
  (chore) release 10.4.0
  enh(arcade) Add missing keywords for Arcade v1.16
  chore(arcade) eslint --fix, explode keywords
  enh(php) Left and right-side of double colon (highlightjs#3422)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants