Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Prism.js grammar parsing but use the Highlight.js HTML/theme pipeline #3619

Open
joshgoebel opened this issue Sep 13, 2022 · 5 comments
Labels
enhancement An enhancement or new feature good first issue Should be easier for first time contributors help welcome Could use help from community parser

Comments

@joshgoebel
Copy link
Member

joshgoebel commented Sep 13, 2022

Related #2212.

I'm not sure this belongs in core yet, and I don't necessarily want a hard or soft Prism dependency. I suppose perhaps you could also pass Prism into the function itself though so it's late binding?

Caveats

  • Using custom parsers is still private API since the Emitter API is public yet. Hence the __emitTokens name of the property.
  • See private __emitTokens API #3620

What needs to be done

Example Usage

To replace our JavaScript support with Prism's:

import Prism from 'prismjs'
const prismJS = fromPrism(Prism.languages.javascript)
hljs.registerLanguage("javascript", prismJS)

Code

function prismTypeToScope(name) {
  // very poor results, must be improved
  return name;
}

function prismTokensToEmitter(tokens, emitter) {
  tokens.forEach(token => {
    if (typeof(token) === "string") {
      emitter.addText(token)
    } else if (token.type) {
      let scope = prismTypeToScope(token.type)
      if (typeof(token.content) === "string") {
        emitter.addKeyword(token.content, scope)
      } else { // array of tokens
        emitter.openNode(scope)
        prismTokensToEmitter(token.content, emitter)
        emitter.closeNode()
      }
    }
  })
}

function prismParserWrapper(code, emitter) {
  const tokens = Prism.tokenize(code, this._prism)
  prismTokensToEmitter(tokens, emitter)
}

function fromPrism(prismGrammar) {
  return function(hljs) {
    let self = {}

    Object.assign(self, {
      _prism: prismGrammar,
      __emitTokens: prismParserWrapper.bind(self)
    })
    return self
  }
}
@joshgoebel joshgoebel added enhancement An enhancement or new feature help welcome Could use help from community good first issue Should be easier for first time contributors parser labels Sep 13, 2022
@joshgoebel joshgoebel changed the title Support Prism.js grammar parsing but use the Highligh.js HTML/theme pipeline Support Prism.js grammar parsing but use the Highlight.js HTML/theme pipeline Sep 13, 2022
@joshgoebel
Copy link
Member Author

joshgoebel commented Sep 13, 2022

@RunDevelopment If I wanted to package this as a small stand-alone ESM module, would I just import Prism and let people's bundlers/client-side figure it out or would it be simpler to just pass in the Prism object that the user is responsible to import on their own? I'm spoiled by the fact that Highlight.js has zero real runtime dependencies.

(not caring about CJS for this ATM)

Or perhaps Prism.languages.javascript has some backreference to Prism itself? That'd be useful here.

Thoughts? Actually I don't even know if Prism is ESM client-side yet, perhaps not?

@RunDevelopment
Copy link

Actually I don't even know if Prism is ESM client-side yet

We are going to be. ESM is very much planned as the only module system. We are likely also going to have monolithic files for compatibility, but that's about it.

would I just import Prism

Hopefully not. Prism v2 will be very explicit about instances. One of the problem we had with v1 was that Prism was a global namespace. This made things like testing and typing really difficult. In v2, Prism is a class. There will be a global instance (for compatibility and convenience), but you shouldn't assume that people are going to use it.

So the fromPrism function should probably look like this:

// Take a Prism instance and the id of the language to adapt.
function fromPrism(prism: Prism, id: string) {
  return function(hljs) {
    return {
      __emitTokens(code, emitter) {
        let grammar = prism.components.getLanguage(id)
        if (!grammar) {
          // Decide how to handle missing grammars. I'm just gonna create an empty grammar.
          grammar = {} 
        }
        const tokens = prism.tokenize(code, grammar)
        prismTokensToEmitter(tokens, emitter)
      }
    }
  }
}

// or

// Take a component proto and add it to your own Prism instance.
function fromPrism(proto: import("prismjs").ComponentProto) {
  const prism = getHLJSPrismInstance()
  prism.components.add(proto)
  // same as the above
  return fromPrism(prism, proto.id);
}

Also, language grammars are lazily evaluated in v2. They might even be re-evaluated later because of optional dependencies. So no matter what, your API must not take grammar objects. Use either ids or component protos.

@joshgoebel
Copy link
Member Author

getHLJSPrismInstance()

I'm not sure this would be a thing (or that I see the need?) If someone wanted a single prism instance they should just create one and always use it with fromPrism... if they wanted one Prism per grammar for some reason, they could do that... not sure we should care?

If they wanted to get the prism instance "attached" to a specific grammar we could expose those on the returned grammar object and then could just query it:

hljs.getLanguage("javascript")._prismInstance

@RunDevelopment
Copy link

I'm not sure this would be a thing (or that I see the need?)

Same. I just wanted to show how an API that only takes a component proto would be implemented. I just wasn't sure in which direction you want to take this.

@joshgoebel
Copy link
Member Author

joshgoebel commented Sep 13, 2022

I'm not sure. I'm hopeful someone comes along who's interested in the capability. I'm not really looking to maintain further pieces of HLJS outside of core... so right now we'd need a good reason to have it in core - or it's fair game for anyone who wants to come along and just make a minimal wrapper library and release it. Right now it definitely feels like more of a plugin/add-on.

Long term I'm very curious what support like this will do for the bigger picture.

And of course it already works - I tested it. It just needs to be packaged up nicely with some tiny amount of docs, etc... (and of course the scope <-> class mapping effort)

I'm currently slightly more interested in wrapping CodeMirror's JSX/TSX Lexer since JSX/TSX is (IME) so hard to get right with pure regex and not a full parser. Though that's a size price to be paid for all that power.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement An enhancement or new feature good first issue Should be easier for first time contributors help welcome Could use help from community parser
Projects
None yet
Development

No branches or pull requests

2 participants