Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enh(php) Switch highlighter to partially case-insensitive #3426

Merged

Conversation

wkania
Copy link
Contributor

@wkania wkania commented Dec 9, 2021

Changes

After discussion related to PHP partial case-insensitive .

Checklist

  • Added markup tests
  • Updated the changelog at CHANGES.md

@joshgoebel
Copy link
Member

Actually let me rewrite the keyword lists to arrays first so the PR will be easier to follow... then you would just need a bothCases helper function that would take an array and spit out another array with both variants, ie:

  • ["then"] => ["then", "THEN"]
  • ["FILE"] => ["FILE", "file"]

Assuming the magic constants are case-insensitive.

@joshgoebel
Copy link
Member

#3427

@wkania
Copy link
Contributor Author

wkania commented Dec 10, 2021

Actually let me rewrite the keyword lists to arrays first so the PR will be easier to follow... then you would just need a bothCases helper function that would take an array and spit out another array with both variants, ie:

* ["then"]  => ["then", "THEN"]

* ["**FILE**"]  => ["**FILE**", "**file**"]

Assuming the magic constants are case-insensitive.

Great but there are more valid cases like:
If(true) { }; iF(true) { };

Is there an option to overwrite global case sensitivity with the match: /if/i ?

@joshgoebel
Copy link
Member

joshgoebel commented Dec 10, 2021

Great but there are more valid cases like: If(true) { }; iF(true) { };

But do people do that? That seems very UGLY/bad practice to me. (esp the latter)

Is there an option to overwrite global case sensitivity with the match: /if/i ?

We don't have any easy way to do that... all the grammar regex must be /i or not /i (because they are combined into a single HUGE regex), the two types cannot be mixed, the sensitivity is controlled by the setting as I mentioned earlier.

keywords is done by raw string matching so it's easy to hack for BLAH and blah. We could possibly also calculate a "Blah", but I probably would NOT do it by default... I'd wait and see how common that was in the wild. This also breaks our keyword stats/auto-detection though since if, IF, and If would be counted as 3 keywords not one... although perhaps code that's so badly inconsistent is an edge case.

@wkania
Copy link
Contributor Author

wkania commented Dec 10, 2021

But do people do that? That seems very UGLY/bad practice to me. (esp the latter)

People rather use lowercase. I saw some cases of uppercase: TRUE, FALSE, NULL. So we can ignore it.

@joshgoebel
Copy link
Member

Oh I wonder if it's more common for literals, we could just support those if those are common...

@joshgoebel
Copy link
Member

Can you give me commit access to your fork?

@wkania wkania force-pushed the php/partially-case-insensitive branch from 06493e5 to ba2b233 Compare December 10, 2021 18:12
@joshgoebel
Copy link
Member

Do you think we could match function dispatch similar to how we do in CPP so we just catch all invocations?

https://github.com/highlightjs/highlight.js/blob/main/src/languages/cpp.js#L401

@wkania
Copy link
Contributor Author

wkania commented Dec 10, 2021

Do you think we could match function dispatch similar to how we do in CPP so we just catch all invocations?

https://github.com/highlightjs/highlight.js/blob/main/src/languages/cpp.js#L401

Looks good. Invocations of?

@joshgoebel
Copy link
Member

Invocations of?

Any function... if it's a function we can highlight it as title.function...

@wkania
Copy link
Contributor Author

wkania commented Dec 10, 2021

Invocations of?

Any function... if it's a function we can highlight it as title.function...

That what I thought. Currently, the highlighter does not support function invocation, only declarations. Do you want this to be part of this MR?

@joshgoebel
Copy link
Member

joshgoebel commented Dec 10, 2021

Well you had "function names (case-insensitive)" on your list... so that's the first thing I jumped too... overall we're trying to move away from complex parsing an do simpler pattern matching... so for many languages the same rule can be used to detect function definition and function invocation.

function blah() ...
blah(32)

The same [ident]( rule hits both.

@joshgoebel
Copy link
Member

If 99% of the time people use lower case keywords I'm ok with just trying that and seeing how it works (with the exception of the literals)... would that complete keywords (both) then?

@wkania
Copy link
Contributor Author

wkania commented Dec 10, 2021

If 99% of the time people use lower case keywords I'm ok with just trying that and seeing how it works (with the exception of the literals)... would that complete keywords (both) then?

This should be fine. Will look at it tomorrow.

@wkania wkania changed the title (WIP) enh(php) Switch highlighter to partially case-insensitive enh(php) Switch highlighter to partially case-insensitive Dec 12, 2021
@wkania wkania force-pushed the php/partially-case-insensitive branch from 957e4b3 to 629912a Compare December 12, 2021 18:20
@wkania
Copy link
Contributor Author

wkania commented Dec 12, 2021

@joshgoebel Ready for review.
Keywords match|0 and Error|0 were problematic.

@joshgoebel
Copy link
Member

Keywords match|0 and Error|0 were problematic.

How so?

src/languages/php.js Outdated Show resolved Hide resolved
src/languages/php.js Outdated Show resolved Hide resolved
@joshgoebel
Copy link
Member

This was my first approach ^.

Ugh, ok - that's a lot... :-) We can keep it as-is and see if it works ok I suppose.

@wkania wkania force-pushed the php/partially-case-insensitive branch from 629912a to 6b39cb6 Compare December 13, 2021 19:26
Comment on lines 297 to 302
const result = [];
items.forEach(item => {
item.replace(/\|0$/, "");
result.push(item);
});
return result;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use map for transforms?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also this needs to match \d+ not just 0... so it's future proof.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is all I see. I might step in after and make a negative look-ahead list helper... but this looks pretty clean right now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joshgoebel done

I think this is all I see. I might step in after and make a negative look-ahead list helper... but this looks pretty clean right now.

Good idea

@wkania wkania force-pushed the php/partially-case-insensitive branch from 6b39cb6 to f5c885f Compare December 13, 2021 22:26
const FUNCTION_INVOKE = {
relevance: 0,
match: [
/(?:->|::|\s|\(|\\)/,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is just an attempt to eliminate false positives - if so what would an example or two be? Why do we need these at all vs just relying on [ident](...?

If this is critical please provide commented code examples of all the variants you're using here.

Copy link
Contributor Author

@wkania wkania Dec 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is just an attempt to eliminate false positives - if so what would an example or two be? Why do we need these at all vs just relying on [ident](...?

If this is critical please provide commented code examples of all the variants you're using here.

All the correct cases are in the Function invoke tests. I also added a comment and test for First-class Callable Syntax,
which was working correctly before this PR (because there was no function invoke). Now it would be a false positive, so I added all the rules.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not following... Isn't this just an issue with the non-function guard? If I do this, all tests pass:

diff --git a/src/languages/php.js b/src/languages/php.js
index 9f45841a..297bd55b 100644
--- a/src/languages/php.js
+++ b/src/languages/php.js
@@ -321,7 +321,8 @@ export default function(hljs) {
     relevance: 0,
     match: [
       // to prevent First-class Callable Syntax from being confused as the function title
-      /(?:->|::|\s|\(|\\)/,
+      // /(?:->|::|\s|\(|\\)/,
+      /\b/,
       // to prevent keywords from being confused as the function title
       regex.concat("(?!fn\\b|function\\b|", normalizeKeywords(KWS).join("\\b|"), "|", normalizeKeywords(BUILT_INS).join("\\b|"), "\\b)"),
       /\w+/,
@@ -375,7 +376,8 @@ export default function(hljs) {
       FUNCTION_INVOKE,
       {
         // swallow composed identifiers to avoid parsing them as keywords
-        begin: /(::|->)+[a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]*/
+        begin: /(::|->)+[a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]*(?!\s*\()(?![a-zA-Z0-9_\x7f-\xff])/,
+        // scope:"wrong"
       },
       CONSTRUCTOR_CALL,
       {

IE, fixing the guard and using only a word boundary to match the start of function invokes... thoughts? This using multi-match as "look-behind" is dangerous and I really don't want to encourage it at all if it can be easily avoided.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joshgoebel Strange because I applied your fix and they fall.

1 failing

  1) php
       should not cause polynomial backtracking:
     Error: php/[10]/begin: Polynomial backtracking. By repeating any character that matches /\xa0/i, an attack string can be created.

    [a-zA-Z0-9_\x7f-\xff]*(?!\s*
    ^~~~~~~~~~~~~~~~~~~~~~   ^~~

Full pattern:
/(::|->)+[a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]*(?!\s*\()(?![a-zA-Z0-9_\x7f-\xff])/m

After changing \s* to \s I have a below error and a few offers.

-DateTimeImmutable::createFromMutable(<span class="hljs-keyword">new</span> <span class="hljs-title class_">\DateTime</span>(<span class="hljs-string">&#x27;now&#x27;</span>));
      +DateTimeImmutable::<span class="hljs-function title_ invoke__">createFromMutable</span>(<span class="hljs-keyword">new</span> <span class="hljs-title class_">\DateTime</span>(<span class="hljs-string">&#x27;now&#x27;</span>));

This regex do not detect static function invoke:
DateTimeImmutable::createFromMutable();
or method invoke
$date->format();
and we do not want to higlight First-class Callable Syntax as a function
$test();

P.S. I'm run tests inside Docker.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should not cause polynomial backtracking:

That can be ignored [for now] and fixed. I didn't have any markup test errors. Your'e saying you're seeing the same result right - all tests pass with my patch , other than the security issue?

So if we can fix that, then can this be simplified as I'm suggesting?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can names TRULY contain non-breaking spaces (0xA0)? That sounds like a bug to me...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should not cause polynomial backtracking:

That can be ignored [for now] and fixed. I didn't have any markup test errors. Your'e saying you're seeing the same result right - all tests pass with my patch , other than the security issue?

So if we can fix that, then can this be simplified as I'm suggesting?

It will be annoying

Can names TRULY contain non-breaking spaces (0xA0)? That sounds like a bug to me...

Yes, but I never saw anyone using it. Looks bad for me.

Copy link
Contributor Author

@wkania wkania Dec 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joshgoebel fixed it with * because only space is valid after the name of the function and before opening bracket.
Did the same in the constructor code.
I hope that's all, thx for this solution.

@wkania wkania force-pushed the php/partially-case-insensitive branch 3 times, most recently from 2eb5d1f to f9afb8f Compare December 14, 2021 19:03
@@ -374,7 +374,8 @@ export default function(hljs) {
FUNCTION_INVOKE,
{
// swallow composed identifiers to avoid parsing them as keywords
begin: /(::|->)+[a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]*/
begin: /(::|->)+[a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]*(?! *\()(?![a-zA-Z0-9_\x7f-\xff])/,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure they all needed to change, unless the [ ] is more precise? (it may likely be) IIRC the issue was with this line alone because \xA0* overlaps with \s*.... which is fixed easily enough by changing the look-ahead \s to a space, if that works for our use cases. :-) The character groups in the regex here also could have been fixed. (I might just do this)

So, is this more precise? (in which case I'll likely change them to [ ] for readability.

Any final thoughts? Thanks for all the hard work on this. I think the finish line is just ahead. :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure they all needed to change, unless the [ ] is more precise? (it may likely be) IIRC the issue was with this line alone because \xA0* overlaps with \s*.... which is fixed easily enough by changing the look-ahead \s to a space, if that works for our use cases. :-) The character groups in the regex here also could have been fixed. (I might just do this)

So, is this more precise? (in which case I'll likely change them to [ ] for readability.

Any final thoughts? Thanks for all the hard work on this. I think the finish line is just ahead. :)

new    
\DateTime   
();

This is also allowed so I will introduce const SEPARATOR [ \t\n]. This should cover all the cases. Probably this will be better in all those places, if non-breaking space may be part of the name and \s select it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also allowed

I imagined it might be. :-)

so I will introduce const SEPARATOR

I think "WHITESPACE" might be better (and add a comment saying it purposely excludes 0xA0), etc...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think "WHITESPACE" might be better (and add a comment saying it purposely excludes 0xA0), etc...

Done @joshgoebel

@wkania wkania force-pushed the php/partially-case-insensitive branch from a8c3799 to d6e220e Compare December 15, 2021 21:42
- also add `title.function.invoke` to the official list
  (Rust was already using)
@joshgoebel joshgoebel merged commit 248b109 into highlightjs:main Dec 15, 2021
@joshgoebel
Copy link
Member

@wkania Thanks for all the effort. I might release 11.4 soon just to see how this case change plays out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants