Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discuss: Keyword boundaries #2429

Open
joshgoebel opened this issue Feb 25, 2020 · 11 comments
Open

Discuss: Keyword boundaries #2429

joshgoebel opened this issue Feb 25, 2020 · 11 comments
Labels
autoclose Flag things to future autoclose. discuss/propose Proposal for a new feature/direction enhancement An enhancement or new feature parser

Comments

@joshgoebel
Copy link
Member

joshgoebel commented Feb 25, 2020

Is your request related to a specific problem you're having?

#2420
#2428

The solution you'd prefer / feature you'd like to see added...

I think perhaps languages need to be able to define their own keyword boundaries... we are considering adding \b|[space] to fix an issue with Clojure (because it allows - in keywords)... and then really (most of the time) keywords shouldn't be followed by . (then it's a method dispatch, not a keyword). Also, they really shouldn't be proceeded by . either (method name), but that would require negative look-behind [though eventually we'll get that.

So as I'm considering adding these things into the CORE highlighter as keyword boundaries it strikes me that I'm making some big assumptions and quite possibly adding grammar specific things to the core, which seems quite bad. If we support languages as diverse as Brainfuck then we should have as little "generic" knowledge of languages as possible baked into the core highlighter.

Even \b as a separator for ALL languages is wrong as we're starting to see cracks with that already with Clojure, etc.

As languages can already configure lexemes I think they also should be able to configure boundaries.

return {
  lexemes: /[a-z-]+/, // allow `-`
  lexemesBoundary: /\b/,
  // or
  lexemesBoundaryStart: /\b/,
  lexemesBoundaryEnd: /(\b|(?=\s))(?!\.)/,
};
@joshgoebel joshgoebel added enhancement An enhancement or new feature parser labels Feb 25, 2020
@joshgoebel
Copy link
Member Author

joshgoebel commented Feb 25, 2020

In the recent past the parser assumed any expression like .keyword or keyword. could not be a keyword... recent parser changes broke this behavior (see #2428) but I'm not sure that was correct core behavior to begin with (though it might be a reasonable default). Definitely there were issues with how it was implemented and the fact that it was completely undocumented.

@joshgoebel
Copy link
Member Author

joshgoebel commented Feb 25, 2020

Here is my issue with the prior behavior . keyword behavior:

You write:

contains: [
  {beginKeywords: "mud dirt gook"}
]

But what you really get [behavior] is essentially:

contains: [
  { begin: /(mud|dirt|gook)\./ }, // first eat up any keywords.
  { begin: /\.(mud|dirt|gook)/ }, // first eat up any .keywords
  {beginKeywords: "mud dirt gook"
]

That solves the problem but it's implicit vs explicit and I don't think the behavior is expected/predicable. There is no obvious reason why this shouldn't work:

contains: [
  { beginKeywords: "mud dirt gook" },
  { begin: ".dirt", className: meta" } 
]

But it will fail, because the first mode will eat the . and advance the cursor 1 forward when it fails to find a match. Trial and error might teach you to put it first, but it's not clear at all WHY that would be so from reading the grammar itself. Actually the behavior may well have worked with the prior engine because of how it had to retry every regex after a match was found.

So the first rule matches via the terminate regex (which includes the .), but when retrying the first rule fails because this time the regex contains no dot (that's how the trick worked). So it would move on to the next rule, which would result in a match. This same behavior isn't available with the current engine though.

Vs say:

contains: [
  hljs.METHOD_GUARD,
  { beginKeywords: "mud dirt gook" },
  { begin: ".dirt", className: meta" } // the .dirt macro has special meaning
]

Now at least the problem is visible when one looks at what METHOD_GUARD does. I think what I'm getting at here is that hidden rules should never eat up content. And this was the only place in the whole parser where that was the case.

@joshgoebel
Copy link
Member Author

joshgoebel commented Feb 25, 2020

I'm not suggesting METHOD_GUARD is how we should solve this. Both solutions are kludgy. Honestly you'd want this to 'just work'. But for the prefix case we really need negative look-behinds in order to do this all with regex. Otherwise we're left building our own custom negative look-behind stuff - and that makes letting users override it with regex (as the original idea suggests) much harder.

For example you might imagine:

contains: [
  {beginKeywords: "mud dirt gook", notAfterChar: "." }
]

And of course beginKeywords could default notAfterChar... but I hate to add more options to the grammars that we know are just going to be killed in a few years by Javascript moving forward. And actually this idea is broken too because of how we aggregate the rules into a single regex... there isn't a great way to dynamically filter out an given rule (to have it match but then not match and continue checking other rules)... the whole engine is based on the premise that regex is enough to get the job done.

@joshgoebel
Copy link
Member Author

joshgoebel commented Feb 26, 2020

@egor-rogov @allejo Any thoughts? CC @marcoscaceres

I'm surprised we have no tests for this behavior (when we sort it out we should add some)... but it wouldn't just be Java that is broken... any grammar relying on the previous magic . beginKeywords behavior could now be broken (to varying degrees). The fact that no one has reported it until now is either a really good or really bad sign. :-/

We can fix the . postfix behavior with a simple negative lookahead if we want, but the sticky problem is handling the prefix case (. out front of a keyword).

We could probably (short-term) add the previously problematic behavior (or some variant) back if we had to, but ugh. I'd love if there were some alternative suggestions.


MANY MANY grammars use beginKeywords, only 2 use METHOD_GUARD presently.

@joshgoebel
Copy link
Member Author

joshgoebel commented Feb 26, 2020

Another note, these boundaries (speaking of the magic . boundaries) have only ever been applied to beginKeywords... in fact I think that's a key differentiator from the normal keywords, but I wonder if that actually makes sense or if most people are aware of it. IE, keywords: "false" would highlight false in all the following lines:

.false.
.false
false.

From the docs:

Used instead of ``begin`` for modes starting with keywords to avoid needless repetition:

::
  {
    begin: '\\b(extends|implements) ',
    keywords: 'extends implements'
  }

Where-as I'd wager that 99%+ of the time what you really want in ALL cases is actually [boundary]keyword[boundary]. Where boundary included the "not dot" provision. The exception would be using keywords to match method dispatch, but you could do this with a custom boundary - and it would probably be better to specify which keywords were actual keywords and which were method names.


Slight change of topic. The existing keyword system already matches word boundaries since the default lexeme is \w+ and it's implied that the "character" immediately to either side of \w+ is going to be \b... When I made this explicit vs implicit (ie, applying it to ALL lexemes) only 5 tests broke.

So actually, with keywords you already have an opportunity to specify custom boundaries (by including them in the lexeme regex), so maybe that is a clue here... hmmm... Although that only works if other modes haven't eaten the before and after content. (since keywords are only processed inside of a rule's un-matched content - or between matches).

All of them weird exceptions that could be thought about and handled if we were going to much about with what keywords and boundaries meant on a deeper level.

  • .keyword
  • %keyword
  • %% (maxima)
  • @keyword
  • + (scheme, it's debatable if this should even be a keyword)

IE, cases where a custom boundary would need to be defined since a word boundary doesn't exist between these symbols and whitespace (or other symbols).

@joshgoebel
Copy link
Member Author

joshgoebel commented Feb 26, 2020

Further notes:

The original \b boundaries for beginKeywords were only added as a result of deficiencies in the original parsing engine (that we have since fixed): 9dca4cc

Wrap "begin" derived from "beginKeywords" in \b instead of doing it only for terminators. If "begin" doesn't have it testRe doesn't see it and can recognize a shorter "begin" (e.g. beginKeywords: 'def' passes testRe when beginKeywords: 'defmodule' is found by terminators).

And the original . fixes context:

Then a series of commits to fix:

The last one being when the \. pinning on either side of the regex was added. So the behavior was definitely intentional, but still very much undocumented anywhere (that I can see).

@egor-rogov
Copy link
Collaborator

I'm afraid it's all too deep for me to understand without spending days digging through the parser...
I just want to say that most of the languages use simple alphabetic keywords. Of course it would be great to have a solution that "just works" all the time, but on the other hand It doesn't look totally unfair if you have to use some "tricks" for several fancy languages.

@joshgoebel
Copy link
Member Author

joshgoebel commented Feb 27, 2020

In this case I'm not sure Java is a "fancy language"... I'll presume you're referring to the complexities of Clojure when you say "fancy".

So if we want to try and start breaking this down to simpler/smaller questions:

  • Should the prior behavior of beginKeywords excluding keywords starting with . and ending with . be preserved?
  • IF SO, isn't it strange that beginKeywords has entirely different behavior than keywords?
  • I think one problem is in many languages we use keywords specifically for method calls. But I'm not sure this is purely a docs issue. Though perhaps now this stuff is so "legacy" that it's too late to change it.

One proposal:

  • Make keywords and beginKeyword more similar, ie follow the same . rules.
  • Add a way to purposely say you're adding method keywords (don't get hung up on the syntax):
keywordRules: [
 { name: "built_ins", begin: '\.', excludeBegin: true, list: "toUpperCase toLowerCase map filter ..." }

This would allow keyword matchers to be mini-modes, rather than just a single regex, which is actually a useful idea in it's own right. This is also slightly annoying though in that it's ALMOST solvable with just lexemes if we had regex look-behind. IE, you could just do:

{ lexemes: '\.\w+` }

Though you still need a way to say "don't highlight the .".

This is probably a lot of effort though regarding legacy understanding and updating all existing grammars, etc.

Another idea:

  • deprecate beginKeyword and come up with a BETTER name and document how it's significantly different from keywords

This becomes a simple search and replace type operation, though I'm not 100% sure what to propose as better naming.

@joshgoebel
Copy link
Member Author

I think I'd first like to think about what makes the most sense (trying to ignore legacy) if we weren't limited technically - and then see what that type of solution might look like. To me that means keywords need to be somehow more configurable in general.

@joshgoebel
Copy link
Member Author

joshgoebel commented Feb 27, 2020

Actually this is a pretty great idea:

{ lexemes: '\.(\w+)` }

First capture group gets the highlight. Although most languages would need multiple lexemes expressions, so you could specify different matches for different types of keywords.

@joshgoebel
Copy link
Member Author

This thread might be one of the longer lived ones. I'm going to circle back and see about finding some temp fix to re-add the old . behavior to beginKeyword for now, but long-term still think there is a lot of room for improvement with regard to keywords.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
autoclose Flag things to future autoclose. discuss/propose Proposal for a new feature/direction enhancement An enhancement or new feature parser
Projects
None yet
Development

No branches or pull requests

2 participants