Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discuss: Custom token callbacks, or mode value property? #1133

Closed
foo123 opened this issue Mar 27, 2016 · 27 comments
Closed

Discuss: Custom token callbacks, or mode value property? #1133

foo123 opened this issue Mar 27, 2016 · 27 comments
Labels
parser plugin Specific plugin or plugin discussion

Comments

@foo123
Copy link

foo123 commented Mar 27, 2016

Hello, i'm playing around with highlight.js (9.2.0) and want to integrate with a -grammar add-on (following previous work on syntax-highlighting, for example here) which enables to syntax-highlight code by defining a grammar specification for the language (e.g in BNF form).

i have already made some integration code (to be uploaded here), but so far, in order to use the grammar parser and integrate with hljs core highlighter some mode boilerplate code (for example multiple modes inside contains and dummy ;lexemesRe, beginRe, endRe functions). While it would be easier (and more flexible) if there was some property that allowed a callback or even a static value with the lexeme (i,e token) to be directly available from the mode itself (and passed directly to highlighter to be wrapped in <span>[token value]</span> for highlight).

To be more explicit, consider the fragment from hljs highlight method below:

// ..
    var mode_buffer = '';
    var relevance = 0;
    try {
      var match, count, index = 0;
      while (true) {
        top.terminators.lastIndex = index;
        // it seems the only way is to override mode.terminators.exec
       // in order to hook here with the token
       // maybe accepting an already parsed value/callback would make all this more flexibe
      // e.g if ( mode.value ) count = processLexeme(value.substr(index, match.index - index), mode.value());
      // or sth like that
        match = top.terminators.exec(value);
        if (!match)
          break;
        count = processLexeme(value.substr(index, match.index - index), match[0]);
        index = match.index + count;
      }
      processLexeme(value.substr(index));
      for(current = top; current.parent; current = current.parent) { // close dangling modes
        if (current.className) {
          result += '</span>';
        }
      }
      return {
        relevance: relevance,
        value: result,
        language: name,
        top: top
      };
    } catch (e) {

// ..

Nikos

@Sannis Sannis added the code label Mar 28, 2016
@foo123
Copy link
Author

foo123 commented Mar 30, 2016

Anyway i've managed to integrate the grammar add-on with highlight.js
and going to update a repo shortly with the code for anyone interested (see screenshot below for xml parsing including highlighting errors, if defined, as unstyled code)

Not sure if and how an option for a mode to return a ready token to be highlighted directly has any merit or ease of implementation (judging from the hljs code, it would probably need changes in more than one place)

snap-2016-03-30-at-23 13 26

Plus i'm only able to make it work, using the un-packed highlight.js, it breaks when i use the packed highlight.pack.js file from the download package (without any languages, just the highlighter).
Can you help me resolve this one?

@foo123
Copy link
Author

foo123 commented Mar 30, 2016

The code is on this repository https://github.com/foo123/highlightjs-grammar, for anyone interested

@joshgoebel
Copy link
Member

joshgoebel commented Oct 16, 2019

Wouldn't you have more luck hooking into the keywords pipeline? And just dynamically adding keywords? I'm not sure I 100% follow what you're trying to do here. A more general (or perhaps more specific) example would be great.

If you want more than just keywords though (which are highlighted in the "gaps") then yeah you get into the end expression compiling and having to recompile those expressions on the fly while parsing, etc... and it seems from the second snap that you want to selectively NOT highlight tags based on whether they've been seen before or not...

Not a lot of fun.

Really don't rules/modes need some type of callback system really to do this sort of thing well? I haven't looked at any of your code, just read this issue. I think that would work much better than CHANGING the rules on the fly... just let them continue to match and then dynamically decide what you want to do about it.

@joshgoebel joshgoebel added parser and removed code labels Oct 16, 2019
@foo123
Copy link
Author

foo123 commented Oct 18, 2019

The whole point of the gramnmar addon is that a user can specify a grammar for a language and then highlight based on that grammar. This is a very general approach. Have implemented the add-on for various syntax-highlighter available including highlight.js. The problem (issue is quite old and havent worked on it since then) is that I found no way of adding the parsed tokens into the flow of the highlight them in order to be highlighted. So I made sth of a little hack. I dont know if a new version exists that eases the burden of this. It is working as is and you can see the online interactive example available.

The problem is that highlight.js assumes from the start that some regexes will be used and is hardcoded into that mindframe disallowing more general parsers (that dont use simply a set of regular expressions) to be used (like -grammar addon does). So only way I found (for the version I am refereing) was to make a hack and override the regular expressions, while I think some modular system of taking a general callback (as an option, regexes can still be used, but if callback passed use the callback to retrieve the token) would be more general and modular.

@joshgoebel
Copy link
Member

The problem is that highlight.js assumes from the start that some regexes will be used and is hardcoded into that mindframe disallowing more general parsers (that dont use simply a set of regular expressions) to be used (like -grammar addon does)

Well because Hightlight JS is an integrated unit. It's a parser-highlighter. It's not two separate pieces. You can't just highlight an arbitrary stream of things.

I'm interested in the grammars perhaps being more complex with callbacks and such things. It's be helpful for a very specific example of what you're trying to do, how you hacked it, and how you imagined it'd work in a perfect world.

You mention tokens but I'm not sure what tokens you mean.. an actual concrete example might be helpful.

@joshgoebel
Copy link
Member

Does your project already have it's own parser that spits out tokens or an AST that you use with other things? IE if we exposed JUST the highlighter (and not the parser) would that be helpful?

@foo123
Copy link
Author

foo123 commented Oct 19, 2019

Yes what I mean (like the code example posted on my first comments) is that highlight.js works only with a language that defines a set of regular expressions and then parses them on its own and highlights them. There is no way a custom parser (whatever that may be, a full-blown parser, other regexes, not following same rationale as highlight.js assumes) can be used and hjs take the result and highlight it. So initially I suggested for some way the parser can be made autonomous and maybe ovewritten as well. For example by allowing a custom callback which takes the current state as argument and returns the next token (string + class). That would be more modular and allow for the addon to work transparently instead of me making this ugly hack. The only way to understand is to check the code of the addon which makes the hack and decide for the best way to make it modular.

https://github.com/foo123/highlightjs-grammar/blob/master/src/main.js#L58

You see in the code of the addon, in order to make the integration and use the addon's parser I had to make custom functions that simulate regular expressions and pass this hacky object as the language definition in hjs (as that is what hjs assumes by default and no way of changing that). That is my point. If you can think of any way, for example a callback, can be used to de-couple the parsing from highlighting and make it more general and modular. The default way can still be used (eg if a language defines a set of regular expressions) but allow for overwriting of tokens by some extra parameter and if present delegate parsing to that method (a token is a unit of code that is highlighted by itself, eg an identifier). Hope this is clear.

@joshgoebel
Copy link
Member

joshgoebel commented Oct 19, 2019

Well, I don't think it's a high priority but one could imagine in the future separate the two pieces... so you'd have the highlighter and the parser... so in your case you'd parse code HOWEVER you wanted and then you'd pass an AST and ask for it to be highlighted.

It sounds like you'd rather just take over the parsing process entirely.

Once you have an AST though is it really that hard to just generate the markup yourself? It seems that is actually the easy part... and you could just leverage our CSS themes for the "look"...

@foo123
Copy link
Author

foo123 commented Oct 20, 2019

I have already solved the issue with a hack. It would be good (like some other highlighters do) to have some process which could be overwriten in some modular way and not be hardcoded.

My initial intention was to create the add-ons for as many syntax-highlighters I was aware of. The benefit is that people create only a single grammar and can use it throught all highlighters and editors of their choice without any modification (except the styling names, the rest remains the same). So that was the intention and that is why I added support for hjs. Doing so I noticed this issue and suggested some workaround.

I leave it up to you if you want to close this issue.
Cheers!

@joshgoebel
Copy link
Member

joshgoebel commented Oct 20, 2019

Also I'm not sure you really answered me... don't you really just want an AST from the parser? Or are you doing things the parser isn't technically capable of? In that case it sounds like you'd want your own parser and JUST our highlighting engine. I'm not sure I completely understand the desire to turn our parser into a more general purpose parser.

@joshgoebel
Copy link
Member

joshgoebel commented Oct 20, 2019

I don't think I understand your goals. Lets start with the stated end goal:

The benefit is that people create only a single grammar and can use it throught all highlighters and editors of their choice without any modification

Ok... but what you'd quickly find is that the parsers are NOT all equal at all. Our approach is VERY different than Prism.js approach, for example. Because they're a bit more minimalistic in what they choose to parse, for one - which results in a very different design of the parser- and limits what they can do vs what we can do, for example.

So I'd think it would be difficult to impossible to achieve "write once run anywhere" simply because all the parsers you work with are so very different (or not even full blown parsers at all, in the sense of the type of parsing toolchain you'd find in a full blown compiler, for example). Highlight.js and Prism.js are really very, very advanced tokenizers.

So I truly don't understand why you wouldn't pivot to writing your own parser that did EVERYTHING you wanted, and then just hook to the highlighters at the very end - or perhaps none of them are easily designed to do that? BUT...

Speaking for Highlight.js the highlighting part is trivial, vs the parsing (and grammars). If we were ever to split the parser/highlighter the highlighting (parse tree -> HTML) code is likely 20-30 lines... you're just looping over a nested tree and turning it into linear HTML.

So if your goal was to have the "Highlight.js look" or be compatible with our themes, all you need is your own parser and a tiny shim to convert to HTML. If you have your own grammar files you don't really want our parser as far as I can see.

So I guess I'm wondering why you didn't go that route as it would seem to be MUCH simpler than what you're trying to do.

@foo123
Copy link
Author

foo123 commented Oct 21, 2019

All the *-grammar addons have a full-blown parser of their own (in fact the exact same parser engine, only the integration code changes). The integration with existing editors and highlighters is what is done. So the parser of the addon is used and is integrated in some way into the highlighter or editor. The user simply defines a grammar and can be used throughout all highlighters and editors of their choice for that language as is. This is the intention and how it works.

Of course I can create my own syntax highlighting framework (and use my ready made grammars and parsers) but that is another question. The issue here is integration with existing highlighters as an addon.

I am aware that some highlighters are narrowly implemented in that parsing and highlighting is a single process, strictly coupled and hardcoded usualy with regular expressions (which solve the parsing problem only partialy at best).

Consider the grammar addon integration as an addon for your own framework. It is only in that purpose that is made as an addon which integrates with your own framework as well as others so people can choose freely and use it with your own framework. A kind of kudos, if you like, for this framework. Nothing more, nothing less.

I only made a suggestion that maybe the procedure can de-couple the parser from the highlighter so that other parsers (for example the parser of the grammar addon) can be plugged in in some modular way (while still allowing for the default parser and behaviour to work as is).

I have already resolved this issue with a hack. I wish I did not have to hack it, but it doesnt matter. it works (packaged version does not work though, only unpackaged version of the framework). So if you like to dismiss this issue, it is fine with me.

@joshgoebel
Copy link
Member

I learn by discussing things, hence this conversation. Plus this is related to:

#1086

I guess I just don't see why you thought you had to use our code at all. That's what perplexes me. You could still say "add-on for Highlight.js" if all you did was piggyback themes and write the 20 LOC or so wrapper to translate from your parsers output to highlight.js style HTML.

That wouldn't do 100% of what we do, because we have some weird features, but that'd be like 95% - for what I'd imagine would have been a lot less effort. So I'm trying to ask if you considered that before you actually went the route you did. In hopes that I might learn something to educate my own undertaking. :-)

@foo123
Copy link
Author

foo123 commented Oct 22, 2019

I totally support the request from that other issue #1086. In fact the two issues can be grouped together. De-couple parsing from highlighting, allow plugged-in parser to function if given and make html only one output format from possibly other output formats (eg pdf). I totally agree.

When sth is integrated into sth else, it has to hook somewhere (to some existing code) in order to function transparently. So the addon I made has to use and hook into the code of the parent framework. is this what you dont understand fully? This makes the integration stable.

Maybe hjs offers features not covered by my hjs addon. This is fine. Users that want to use the addon with hjs can make a choice of what they need. it is only optional in order to save time and complexity if they want to highlight a language of their own, for example, where no ready-made language definition exists and they only simply define a grammar (much easier than writing a parser for a language, even if only regular expressions are used).

However I am still not sure you understand the original issue and its suggestion. It is really very easy to add some option or callback in the mode which if given the highlight routine retrieves the next token from that parameter or callback instead of relying on the mode defining regular expressions and handling them directly. Really it is so simple.

@joshgoebel
Copy link
Member

Really it is so simple.

I'd be happy to review a pr (proof of concept), but we have to watch for "simple to add" things. Not everything that can be added should be added. And we also don't want to add things quickly without thought... if we're adding an API that's going to stick around a LONG time we'd rather take it slow and get it right. What you're describing it a pretty huge edge case for most users of Highlight.js.

Also, anything we add we have to maintain for the long-term, and troubleshoot, and answer questions about, etc. So it's also possible this kind of thing gets easier, but it may never be officially supported.

which if given the highlight routine retrieves the next token from that parameter or callback instead of relying on the mode defining regular expressions and handling them directly. Really it is so simple.

How would it even know what the next token is without the regex? The regexes define the tokens and modes.

allow plugged-in parser to function

I'm not sure that's an important goal, but there would be benefits of splitting the two processes and it would make this kind of thing easier. You could just grab which ever part of the pipeline you wanted and use that... that wouldn't be the same as "plug-in" though, at least in my mind.

If you already have a parser I think you'd just want to use a highlightParsetree function or some such - there is no need for the whole pipeline.

@foo123
Copy link
Author

foo123 commented Oct 23, 2019

The thing with the callback option is that it allows the mode to have its own way of tokenizing the input stream, maybe not using regular expressions, or not using this kind of regular expressions that hjs assumes by default.

So the callback is simply an entry point of the mode's custom tokenizer into the hjs parsing and highlighting routine. If not callback is given then the default tokenizer of hjs as is right now (but made into its own routine) is used based on the regexes that the mode defines (everything that worked before continues to work).

This was my original point. And of course making the html rendering routine simply an option as well, hjs can allow rendering output into other formats as well (also given by mode for example, via a custom render function).

You can think about it, after all this is your framework and you have the last word.
Cheers!

@joshgoebel
Copy link
Member

You can think about it, after all this is your framework and you have the last word.

No, I'm just a maintainer. :-)

maybe not using regular expressions

That just sounds like making us so generic that now we do almost anything. Maybe we'll get there eventually with refactoring, but I'm not sure that seems like a reasonable goal to start with. It feels a little like you're describing a framework/library that we'd build Highlight.js on TOP of... not what Highlight.js actually is or wants to be. I agree the idea sounds cool, I just think perhaps you're describing a NEW project. :-)

Although along those lines perhaps you'd find this interesting though: #2212

Although I was imagining "Recompiling" the grammars, not just slurping them in by building the Prism parsing engine into Highlight.js. :-)


I'm still curious though...

Could you give a working example or walk me thru it? What would the callback do? Be passed the string and position and return a token? I can visualize the concept but not the detail. Or perhaps the details aren't really that fleshed out?

OTTOMH in still sounds like you'd want to use your OWN parser and then just pipe that into our "covert to styled HTML" pipeline (if we were to make that easier to do).


Do you have an example of a C++ or JSX grammar for your parser thingy?

@joshgoebel
Copy link
Member

The kind of things I imagine being useful are things that fit into the existing modes/rules/regex model... ie, before/after match hooks, or things that allowed you to change the rules slightly while you parse or perhaps decide how and when particular rules should be applied (or not).

IE, seeing a function definition and then later knowing when you saw that identifier that it was a function you'd seen earlier, etc.

@foo123
Copy link
Author

foo123 commented Oct 24, 2019

I already replied to that other issue and mentioned this addon which works the same both for highlightjs and prism (and others). Maybe the author of the issue will find it useful. After all this is the kind of use cases that the addon targets, maximum portability and ease of creating highly detailed language definitions.

It is very easy to have this optional callback. I provide an example based on the code fragment of highlightjs in my first comment.

function hjs_tokenizer(code, mode, state)
{
    // maybe first time entering the tokenizer, init state    
    state.top = state.top || mode;
    state.index = state.index || 0;
    // maybe add more things in state object if needed
    // ..
    var mode_buffer = '';
    var relevance = 0;
    try {
        var match, count;
        while (true) {
            state.top.terminators.lastIndex = index;
            match = state.top.terminators.exec(code);
            if (!match) break;
            count = processLexeme(code.substr(state.index, match.index - state.index), match[0]);
            state.index = match.index + count;
        }
        processLexeme(code.substr(state.index));
        for(current = state.top; current.parent; current = current.parent) { // close dangling modes
            if (current.className) {
                result += '</span>';
            }
        }
        return {
            relevance: relevance,
            value: result,
            language: name
        };
    } catch (e) { /*..*/ }
}


// then inside parsing and highlighting routine check if custom tokenizer given else use default
// ..
var tokenizer = mode.tokenizer || hjs_tokenizer, state = {} /* initialy state is a blank object, tokenizer can add whatever it needs in this object to keep state between calls*/;
// parse
do {
    // call tokenizer repeatedly, untill all tokens are exhausted
    token = tokenizer(code, mode, state);
    // process token
    // ..
} while (token);

This is rough but you get the idea, hopefully. The trick is to pass an empty object representing the tokenizer state. Initialy it is blank, so mode knows this is the first time called for this string of code. Then it is initialised and on subsequent calls state is kept and it proceeds normaly as every tokenizer can do. Tokenizer habndles its own state, what it needs to store bnetween subsequent calls. The state is tokenizer-specific. The framework simply facilitates this by passing an empty object which the tokenizer handles as needed by itself. State being an object, persists between calls

Alternatively the tokenizer can parse the whole code with one call only (not calling it repeatedly). This is fine as well, for my use case, for example, I can use both approaches. In fact they are equivalent if a buffer is used to buffer results and return them all at once. So no big difference, if tokenizer is called repeatedly or just once and for all.

@joshgoebel
Copy link
Member

joshgoebel commented Oct 24, 2019

Alternatively the tokenizer can parse the whole code with one call only (not calling it repeatedly).

Now this is something else entirely - in this case you're saying you don't need the parser at all (which is true in your case, as I think we've both mentioned). In someone else's case I dunno how it would work since the tokens themselves are derived from the regex matches... if you take away the regex (such as our plaintext grammar) then the WHOLE thing text becomes a single huge "token" anyways - so there is really no iteration going on.

Take a look at:

#1492

Couldn't your whole project be built as an input plugin that just dumps the "code" (which you've already transformed however you want) thru plaintext (which wouldn't change it one bit)?

hljs.usePlugin("someOtherParser", {language: "pascal"})
hljs.highlight("plaintext", code)

And of course you could wrap those two lines to give it a nicer API... Ok, actually it's not that simple since we still have to figure out the parseTree/HTML division of labor and where that happens... hmmm...

@joshgoebel
Copy link
Member

This is rough but you get the idea, hopefully.

I get the idea except for what is considered a "token" by the lever without any regex rules... if you'd like to elaborate that might be helpful but it's obviously not super relevant in your case where you could just take the content whole.

@foo123
Copy link
Author

foo123 commented Oct 25, 2019

Hmm, you are confused about a couple of things and maybe this is my fault.

A token (or lexeme or however you want to call it, but token is what is used in parser parlance) is simply a unit or fragment of text that is standalone and highlighted by itself. For example a string is one token, an identifier is another token, a reserved keyword is another token and so on..

So the tokenizer receives a steram of text, and breaks it up into tokens according to some rules.

For example the following javascript code:

var foo = "bar";

is split into the following tokens:

[
{token: "var", type: "keyword"},
{token: "foo", type: "identifier"},
{token: "=", type: "operator"},
{token: "\"bar\"", type: "string"},
{token: ";", type: "delimiter"}
]

Hope all is clear so far.

So the default tokenizer of hsj (hjs_tokenizer above) uses some regex rules to make this spliting into tokens and highlight each one (eg by surrounding it with <span></span tags).

But the mode can define its own tokenizer which uses some other way to split the input text stream into tokens (ie mode.tokenizer above). No problem, for the highlight routine these are equivalent, the routine does not ask "how do you split into tokens since you dont give me any regexes?". It simply lets the mode's own custom tokenizer do that. But in order to treat both approaches uniformly we have to define some way or contract by which tokenizers are called (and make default tokenizer conform to that contract).

This is also simple, the tokenizer can be called repeatedly as long as it finds tokens in the text, or be called once and return all the tokens at once (actually both approaches are equivalent, dont be confused by this, they are simply two ways of doing the same thing). One approach can be deterministicaly transformed into the other (that is why they are quivalent).

I present the repeated approach (rough) where a state object is used to track state between subsequent calls to the same tokenizer. Each call to tokenizer returns one token with its value and type (and possibly other info), as the above tokenizing example demonstrates.

Then (unfortunately this part is not well demonstrated in my previous comment) the highlight routine takes the token and creates an output by rendering it (eg in html by wrapping it between <span class="token-type">token-value</span> tags). This part is shown as being made inside the tokenizer and appended each time. In fact the hjs_tokenizer should return simply an object representing a single token (as above tokenizing example shows) or if called once and for all, an array of objects representing all tokens in the input text. It should not try to format to some output, this should be a separate step further down.

Couldn't your whole project be built as an input plugin that just dumps the "code" (which you've already transformed however you want) thru plaintext (which wouldn't change it one bit)?

hljs.usePlugin("someOtherParser", {language: "pascal"})
hljs.highlight("plaintext", code)

This issue is quite old and am not aware of such functionality. If this is newer and it helps I will give it a look. Can you explain how this works?

@joshgoebel joshgoebel added the plugin Specific plugin or plugin discussion label Oct 26, 2019
@joshgoebel
Copy link
Member

joshgoebel commented Feb 16, 2020

See:

#2404
#2395

Soon it should be a LOT easier to do this than in the past... You'd use a before callback on highlight to insert your custom parser/tokenizer and render your parsed HTML. You're still responsible for doing all the tokenizing and HTML rendering yourself.

There is still no way to tie DIRECTLY into the existing tokenizer real-time via a simple plugin, but if someone really needed to do that you can replace the whole token tree/html renderer now by swapping two lines of code in the source. The key lines being:

var emitter = new TokenTree();
// ...
result = new HTMLRenderer(emitter, options).value();

One could even imagine allowing to configure this:

configure({
  emitterClass: TokenTree,
  htmlRenderer:(emitter, options) => { new HTMLRenderer(emitter, options).value(); }
})

I'm not sure we want to do that (yet or ever), but I'm thinking about it. The API is nice (I think) but it exposes a lot of internals and would make it harder to change the internals in the future I think. I do think we'll expose the parse tree emitter somehow (right now it's exposed as emitter in the result)... so if someone wanted to play with the token tree afterwards, or replace the build-in HTML renderer, it'd be pretty easy to do that. We already have an issue for that.

@joshgoebel
Copy link
Member

This issue is quite old and am not aware of such functionality. If this is newer and it helps I will give it a look. Can you explain how this works?

Yes, it's very new. Read the plugin docs and check out the PR regarding callbacks for highlight itself. The callbacks for highlightBlock are already in master.

@joshgoebel
Copy link
Member

You might also find the plugin example here interesting

#2391

@joshgoebel joshgoebel changed the title Custom token callbacks, or mode value property? Discuss: Custom token callbacks, or mode value property? Feb 17, 2020
@foo123
Copy link
Author

foo123 commented Feb 17, 2020

+1 For developing a plugin-friendly culture! I will definately check out the docs when I get some time

@joshgoebel
Copy link
Member

Closing due to new functionality and lack of any activity on this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parser plugin Specific plugin or plugin discussion
Projects
None yet
Development

No branches or pull requests

3 participants