Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[plugin] Idea: Support YAML front matter in Markdown files #2391

Closed
gregives opened this issue Feb 5, 2020 · 31 comments
Closed

[plugin] Idea: Support YAML front matter in Markdown files #2391

gregives opened this issue Feb 5, 2020 · 31 comments
Labels
language plugin Specific plugin or plugin discussion

Comments

@gregives
Copy link

gregives commented Feb 5, 2020

It's common to include YAML front matter at the top of a Markdown file, for example when using Jekyll. Currently, highlight.js parses the last line of YAML as a second-level heading because of the --- three dashes below it. Although YAML front matter isn't actually part of any Markdown specification, would it be possible to add this to the Markdown definition?

GitHub highlights this correctly:

---
title: Example Title
date: 2020-02-05
---

Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Highlight.js highlights this incorrectly:

image

@joshgoebel
Copy link
Member

joshgoebel commented Feb 6, 2020

It's common to include YAML front matter at the top of a Markdown file

But that doesn't make the content Markdown. We also can't differential between the top of a file and the middle of a file, so contextual things like this are a bit outside our grasp.

GitHub highlights this correctly:

Does it, really? All it does it color the date, no matter how it's included, YAML or just inline text... I don't think it has anything to do with the YAML.


I think this would be better solved with a plugin designed specifically for YAML + Markdown. Take a look at: https://github.com/highlightjs/highlight.js/blob/master/docs/plugin-api.rst

You'd detect if the file was YAML + markdown BEFORE highlighting, then split it yourself, highlight both chunks separate then paste them back together and return that as the result. I think that would be possible with the new plugin support. All the glue might not be there, but if you wanted to work on a plug-in I'd look into the glue.

I'd guess 20-30 lines of Javascript maybe.

@gregives
Copy link
Author

gregives commented Feb 6, 2020

But that doesn't make the content Markdown. We also can't differential between the top of a file and the middle of a file, so contextual things like this are a bit outside our grasp.

Okay, I thought that might be the case.

Does it, really? All it does it color the date, no matter how it's included, YAML or just inline text... I don't think it has anything to do with the YAML.

I suppose I meant that GitHub doesn't seem to highlight it incorrectly, for that small example I posted.


I wasn't aware there was plugin support for highlight.js! I will take a look at writing a plugin for this. I've just had a quick look at the docs and noticed that before:highlightBlock doesn't have a return value: how would I return the result as you mention?

@joshgoebel
Copy link
Member

joshgoebel commented Feb 6, 2020

I suppose I meant that GitHub doesn't seem to highlight it incorrectly, for that small example I posted.

Now that would be a better question. That might be fixable. Does a header require a blank line before it? We could check for that but then the problem becomes I don't think JS has \A and \Z so there is now way to say "blank line OR start of file"... since many times the very first line is a header. So I think you can have it one way or the other, but not both.

@allejo Does that sound right to you?

Thoughts?

I've just had a quick look at the docs and noticed that before:highlightBlock doesn't have a return value: how would I return the result as you mention?

Ok, so you can't coop the process completely currently, but you can still do it. You need to hook before and after. You can effect changes with after simply by modifying the object it was passed. Before just needs to store the original text in the object state so it can parse it when after is called.

@allejo
Copy link
Member

allejo commented Feb 6, 2020

Now that would be a better question. That might be fixable. Does a header require a blank line before it? We could check for that but then the problem becomes I don't think JS has \A and \Z so there is now way to say "blank line OR start of file"... since many times the very first line is a header. So I think you can have it one way or the other, but not both.

@allejo Does that sound right to you?

Thoughts?

My first thought would be to introduce front matter as a new language definition instead of trying to introduce complexity to the markdown syntax by detecting headers or the start of files. My thought would be to allow modifiers/combinations to languages. e.g.

  • frontmatter--md
  • frontmatter--html
  • frontmatter--jsx

Depending on the tool using the front matter files, the body of it can be in a number of languages which is why I would be hesitant to introduce it just to markdown.

@joshgoebel
Copy link
Member

joshgoebel commented Feb 6, 2020

Thanks for helping me clarify why this rubbed me the wrong way. This (front matter) is really more of a "concept" than a "language" itself. It's a convention I've seen used with blogging software but I assume it's also used elsewhere since it's kind of a useful pattern.

Also front matter doesn't necessarily have to be limited to YAML either... really if we had configurable grammars you could do this with something like:

// you could alias it however you wanted, I just used "markdown" here to overwrite markdown, as per the original request - if say that was the ONLY context you were using HLJS
registerLanguage("markdown", frontmatter("yaml",{ content: "html" } )

As I said before though I don't think you could do this purely with a grammar now though - I think you could do it with a grammar + plugin though. The grammar I think would be really light and would lean on the plugin for doing the actual work (so it can run parse code, split the content, etc). Honestly I see grammar + plugin as a way to do all sorts of crazy complex behavior that you can't do inside grammars themselves.

Is this the best thing long-term, I dunno... It probably needs some thought so we can come up with a common pattern for 3rd party grammars who want to do this. Actually I think this might be a good way to START things and then eventually you simply fold the plugins into the grammar itself... so a grammar would have a beforeHightlight and afterHighlight key... and they would function the same as plugins - but grammar specific plugins.

That would require adding hooks for highlight also, but that's on the TODO list anyways.

@joshgoebel
Copy link
Member

@gregives It'd be quite cool if you build a plugin for this. That'd probably motivate me to go in and actually add support for highlight also not just highlightBlock... Then I could walk you thru publishing it as a 3rd party plugin/grammar if you thought it'd be useful for other people and we could test out this whole idea. :-)

@joshgoebel
Copy link
Member

joshgoebel commented Feb 6, 2020

Something like:

#2395

(very raw)

I need to think about how this works in light of autoHighlight (which would call your plugin a zillion times) - but perhaps that is a problem for plugin authors?

And also need to think about sublanguage and how this handles continuations... that's really why I didn't tackle it at first and went with the much simpler highlightBLock instead.

@gregives
Copy link
Author

gregives commented Feb 6, 2020

Thanks for the great conversation around this! So at the moment, are we thinking a plugin is the best way forward for now? I can take a look at a plugin this weekend if so.

Out of interest, how does highlighting work for other 'nested' languages? For example, does highlight.js use the JavaScript definition for highlighting JavaScript within some HTML? Or is it duplicated within the HTML definition?

@joshgoebel
Copy link
Member

joshgoebel commented Feb 6, 2020

@gregives Do you see how you might go about this? Define a dummy grammar and then some flag you store there enables the plug-in...

So given the example earlier:

registerLanguage("markdown", frontmatter("yaml",{ content: "html" } )

Perhaps frontmatter would return something like:

return {
  name: "markdown with yaml front matter",
  frontmatterIs: "yaml",
  bodyLanguage: "markdown",
  usePlugin: "frontmatter",
  // and perhaps long-term
  "before:highlight" : function () {},
  "after:highlight" : function () {},
}

When your plugin runs then it's looking for the usePlugin == 'frontmatter' to know it should do it's own magic.

@joshgoebel
Copy link
Member

joshgoebel commented Feb 6, 2020

So at the moment, are we thinking a plugin is the best way forward for now?

Well a plugin that could also be/provide/generate it's own grammar. :-) I think over time the line will blur, as mentioned above. We're playing with the future here. :-) I kind of showed you how you might go about it in the previous message.

Out of interest, how does highlighting work for other 'nested' languages? For example, does highlight.js use the JavaScript definition for highlighting JavaScript within some HTML? Or is it duplicated within the HTML definition?

Both, but often we do it right and use sublanguage which tells Highlight.js to switch modes in the middle of a file and process part of it with a whole other grammar. It's pretty complex and one of the few pieces I still don't completely understand in and out. It's also not perfect and doesn't work for all cases (see open issues).

And theoretically you could use sublanguage here IF you could properly detect the front matter at the beginning of the file... that'd probably be the "correct" way to do it... but per discussion above I don't think we'd do this by changing the markdown grammar, it would be another grammar entirely or an auto-generated grammar.

If JS supported \A this would be trivial to do with just the grammar stuff we already support.

@joshgoebel
Copy link
Member

  // and perhaps long-term
  "before:highlight" : function () {},
  "after:highlight" : function () {},

Actually one could add this type functionality via plugin even, without even changing core. LOL.

@joshgoebel joshgoebel changed the title (markdown) YAML front matter in Markdown files [plugin] Idea: Support YAML front matter in Markdown files Feb 6, 2020
@allejo
Copy link
Member

allejo commented Feb 7, 2020

I agree that this should be a plugin. My main reason is because this concept can be expanded to highlighting embedded any language without complicating or bloating a language grammar. Say we want to highlight code in markdown blocks just like GitHub does:

# Hello World

I'm a markdown document and want to show some embedded HTML:

```html
<div>
  <p data-testid="highlightjs">Hi from HTML!</p>
</div>
```

Unless this is already possible?

@joshgoebel
Copy link
Member

joshgoebel commented Feb 7, 2020

Well, I actually think that's a whole different case - and not comparable to the original example, but I have some very mixed feelings on that specifically. I think GitHub is only confusing the issue because it's actually RENDERING markdown, not highlighting it.

To me highlighting markdown would mean that you'd have markup like:

<div class="hljs-string">
```html
&lt;div&gt;
  &lt;p data-testid="highlightjs"&gt;Hi from HTML!&lt;/p&gt;
&lt;/div&gt;
```
</div>

Within the context of highlighting markdown that is really just a multi-line string... now when the markdown is actually rendered that string might be highlighted as code... but I see that as a job of the renderer... and we are not a renderer, we're a highlighter.

This is also related to how some people expect us to handle markdown... some people expect us to actually have styling that makes bold parts bold and italic parts italic... but that's not our job. We're a highlighter, NOT a renderer.

Even thinking about it makes my head hurt a little. :) Maybe I'm thinking about it wrong though - I know my text editor syntax highlighting does colorize markdown code snippets...

Unless this is already possible?

You can't do it dynamically, but you could "guess" or trust the auto-detect... look at how XML handles <script>, for example:

        begin: '<script(?=\\s|>)', end: '>',
        keywords: {name: 'script'},
        contains: [TAG_INTERNALS],
        starts: {
          end: '\<\/script\>', returnEnd: true,
          subLanguage: ['actionscript', 'javascript', 'handlebars', 'xml']
        }

@joshgoebel
Copy link
Member

joshgoebel commented Feb 7, 2020

Or see http, which throws the whole "body" to auto-detect and lets it figure it out:

{
  begin: '\\n\\n',
  starts: {subLanguage: [], endsWithParent: true}
}

@allejo
Copy link
Member

allejo commented Feb 7, 2020

To me, I would think this is still the job of the highlighter. See this example where I'm forcing the language within the GitHub highlighting:

```
<div>
  <p data-testid="highlightjs">Hi from HTML!</p>
</div>
```

```html
<div>
  <p data-testid="highlightjs">Hi from HTML!</p>
</div>
```

```python
<div>
  <p data-testid="highlightjs">Hi from HTML!</p>
</div>
```

```go
<div>
  <p data-testid="highlightjs">Hi from HTML!</p>
</div>
```

I saw this example related to front matter because the behavior seemed to be the same for me:

Front matter Markdown
Detect --- switch language to YAML (configurable) Detect ``` switch language to X
Detect --- switch back to the body language Detect ``` switch back to markdown

@joshgoebel
Copy link
Member

joshgoebel commented Feb 7, 2020

Well the big difference is that language snippets is actually a syntactic feature of the markdown language (or at least the GitHub variant)... where-as "front matter" isn't a feature of any language. It's a concept for front-loading meta-data about any type of textual content.

So it's the same kind of thing, but quite different conceptually. So if we had better support for this kind of thing I could potentially imagine the code snippet support being added to Markdown, but we wouldn't add front-matter support... because as pointed out earlier that's not specific to Markdown...

Where-as if we decided Markdown snippets should be highlighted as their declared code type, that would be a simply an improvement to the existing Markdown highlighting - to better support the Markdown language.

@taufik-nurrohman
Copy link
Member

taufik-nurrohman commented Feb 7, 2020

Or see http, which throws the whole "body" to auto-detect and lets it figure it out:

{
  begin: '\\n\\n',
  starts: {subLanguage: [], endsWithParent: true}
}

Agree with this. Consider to look for the ... marker as the end of the YAML stream too (see document suffix). It just like double line-break in the HTTP syntax above. This kind of spec could be improved in the YAML highlighter itself without using plugin hooks as ... marker is very specific. No idea about how to highlight any characters next to --- though, as they should be treated as other YAML documents based on the spec.

Also see how CommonMark written its metadata header.

@joshgoebel
Copy link
Member

YAML is ridiculously complex. :-) Yet what we have seems to be working OK for most people. :-)

@joshgoebel
Copy link
Member

I mean we could tag the "..." as something, but I'm not sure it would really change anything since after the ... evidently you can have even more YAML... so it might already "just work". I've never seen an example like that before.

@taufik-nurrohman
Copy link
Member

since after the ... evidently you can have even more YAML.

As long as it starts with another --- AFAIK.

@gregives
Copy link
Author

gregives commented Feb 8, 2020

I was thinking a bit more about how to solve this problem, specifically YAML front matter in Markdown files, and I encountered the following in the docs.

Things we support now that we did not always:

As lookbehind matching is supported, albeit by around 70% of browsers, we can use a fairly simple regular expression to match --- at the start of the file:

{
  begin: '(?<!\\n)^---\\n', end: '\\n---\\n',
  subLanguage: 'yaml',
  relevance: 0
}

However, I'm aware that other formats of front matter are available, for example, Hugo supports four formats for front matter; would it be naive to add each format as I suggested, or would it be suitable for now? If this solution seems okay then I'd be happy to create a pull request.

@joshgoebel
Copy link
Member

joshgoebel commented Feb 8, 2020

As lookbehind matching is supported, albeit by around 70% of browsers, we can use a fairly simple regular expression to match --- at the start of the file:

That is a neat trick.

I think one 3rd party language author is using look behind (but they test for it and use alternative regex if it's not available - but in this case there are no alternatives) , but I'm not sure how I feel about adding a feature to core that's only support by 70% of green-field browsers. It seems very bad to me for Highlight.js to have different behavior in one modern browser than another.

Hugo supports four formats for front matter; would it be naive to add each format as I suggested

Couldn't auto-detect try to figure it out? Are they all enclosed the same?

If this solution seems okay then I'd be happy to create a pull request.

I'm not sure what you're asking. As we discussed already "frontmatter" isn't a concept unique to markdown so I'm not sure where you are suggesting that we add it. One could conceive of a frontmatter grammar (using your trick above) that use auto-detect for BOTH parts of the file, but we don't currently add new grammars to core - so that would make it a 3rd party language module or plugin.

@joshgoebel
Copy link
Member

joshgoebel commented Feb 8, 2020

If you wrote a plugin that was small/simple enough it could possibly be included in plugin-recipes. It's intended to be a showcase of what's possible with small/simple plugins and our plugin system in general.

@gregives
Copy link
Author

I've had a quick go at making a plugin for this, feedback would be very much appreciated! The plugin revolves around a regular expression which has three matching groups:

  • The beginning delimiter, e.g. ---
  • The content of the front matter, e.g. the YAML
  • The closing delimiter

The plugin separates the front matter into these three parts, highlights the content of the front matter, and then concatenates them back together, along with the original content of the Markdown. By default, it works with --- and +++ delimiters.

class FrontMatterPlugin {
  constructor(options) {
    this.regexp = (options && options.regexp) || /(^[-+]{3}\n)([\s\S]*?)(\n\1\n)/;
    this.language = options && options.language;
    this.subLanguage = options && options.subLanguage;
  }

  'before:highlightBlock'({block, language}) {
    if (this.language && this.language !== language) {
      return;
    }

    var content = block.innerText;
    var frontMatter = content.match(this.regexp);
    var frontMatterContent = frontMatter[2];

    if (this.sublanguage) {
      var frontMatterResult = hljs.highlight(this.subLanguage, frontMatterContent);
    } else {
      var frontMatterResult = hljs.highlightAuto(frontMatterContent);
    }

    this.frontMatterBegin = frontMatter[1];
    this.frontMatterResult = frontMatterResult.value;
    this.frontMatterEnd = frontMatter[3];
    block.innerText = content.replace(frontMatter[0], '');
  }

  'after:highlightBlock'({block, result}) {
    if (this.frontMatterResult) {
      result.value = this.frontMatterBegin + this.frontMatterResult + this.frontMatterEnd + result.value;
    }
  }
}

There are definitely some things I haven't considered yet, for example,

  • Passing options to highlight and highlightAuto
  • What happens if the highlighting goes wrong
  • Accepting language aliases for language and subLanguage
  • Coding standard for plugins?

Here's an example of how you'd use this plugin with AsciiDoc and JSON front matter (if that's even a thing):

hljs.addPlugin(new FrontMatterPlugin({
  regexp: /(^)({\n[\s\S]*?\n})(\n)/,
  language: 'asciidoc',
  subLanguage: 'json'
}));

You can see in this case that the first matching group is just (^), this is because we want the opening bracket of the JSON { to be passed to highlight() and the same for the closing bracket.

@gregives
Copy link
Author

I think one 3rd party language author is using look behind but they test for it and use alternative regex if it's not available

It would be possible in this plugin to check if lookbehind matching was supported and simply change the grammar if it was, otherwise fall back to the plugin itself. Would there be any advantage or disadvantage in doing this?

@joshgoebel
Copy link
Member

joshgoebel commented Feb 10, 2020

Well you probably wouldn't "change" anything, but you could do a check first and then decide whether to install a plugin at all or simply to auto-generate a language grammar with negative look-behind and register it - but WHY would you do that? You're just making things twice as complex with no real upside - and creating the possibility of subtly differences in behavior between the two different ways of doing the same thing. A single solution is best, IMHO.

Grammars aren't necessarily better than plugins.

@joshgoebel
Copy link
Member

joshgoebel commented Feb 10, 2020

Passing options to highlight and highlightAuto

Did you mean "pass thru"? There is no need. highlightBlock takes no options.

What happens if the highlighting goes wrong

I'd think if you couldn't find the front matter you'd just highlight the whole content normally.

Accepting language aliases for language and subLanguage

Not sure this is necessary but not difficult.

Coding standard for plugins?

I'm a believer in clean code. I'd have broken your plugin down into smaller functions... you have a whole class, take advantage of it to have some small helper functions to make your code easier to read.

Just one example:

  'after:highlightBlock'({block, result}) {
      result.value = this.highlightedFrontMatter() + result.value;
    }

Highlighted front matter returns the front portion, or a blank string... and you've pushed that complexity down a layer. The before callback could probably be broken into 2 or 3 smaller well named functions also.

@joshgoebel
Copy link
Member

joshgoebel commented Feb 10, 2020

I've had a quick go at making a plugin for this, feedback would be very much appreciated!

That's how you'd do it with highlightBlock, but really you'd probably want to hook highlight itself... and if you follow the discussion over in the other thread this would mean making sure you don't get into an infinite loop when you call highlight from within your callback. Otherwise it'd look about the same.

Although now I'm wondering if the callback system itself should protect from recursive plugins...

@gregives
Copy link
Author

You're just making things twice as complex with no real upside

I agree, a plugin seems the way to go.

Did you mean "pass thru"? There is no need. highlightBlock takes no options.

For example, if you knew that your front matter was going to be either YAML or TOML, it would be nice to pass through languageSubset where the plugin calls highlightAuto.

you have a whole class, take advantage of it to have some small helper functions to make your code easier to read.

Thanks for the feedback, I will refactor it a bit.

and if you follow the discussion over in the other thread this would mean making sure you don't get into an infinite loop when you call highlight from within your callback

I've had a read of the other thread — in my opinion, recursive plugins seem like they might be useful, although I can't think of a use case off the top of my head.

In the case of this plugin, if you have a Markdown file with YAML front matter, you could specify to only run the plugin if the language is markdown. That would stop the recursion when you call highlight on the YAML.

@joshgoebel
Copy link
Member

For example, if you knew that your front matter was going to be either YAML or TOML, it would be nice to pass through languageSubset where the plugin calls highlightAuto.

You might have to invent some of that yourself since now you're inventing things that only have to do with your plugin, not Highlight.js itself. I'd probably use data attributes in the HTML and then fetch configuration from there (if this was client-side).

I've had a read of the other thread — in my opinion, recursive plugins seem like they might be useful, although I can't think of a use case off the top of my head.

I'm not sure it's useful for a plugin to be self-recursive. I mean there is no need - you could build it yourself inside your own plugin... and it's sure a pain for every plugin to add a check just to avoid recursion. And I think calling highlight from within a plugin might be a pretty common pattern.

On the other hand multiple plugins can nest within each other (which seems very useful)... so that would "just work". The only issue would be the order the plugins were registered, and I don't know how you avoid that.

if you have a Markdown file with YAML front matter, you could specify to only run the plugin if the language is markdown. That would stop the recursion when you call highlight on the YAML.

Ah, true. I forgot you're using the actual original highlight for the "base" content... another way to do it would be to call highlight twice yourself and skip the base highlight. In that case you would have a recursion problem.

But you're right that you avoid the issue the way you're doing it. :-)

@joshgoebel
Copy link
Member

Closing this as an issue since (as mentioned earlier) this is not an actual issue with HLJS or the Markdown grammar. More than happy to continue the plugin discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
language plugin Specific plugin or plugin discussion
Projects
None yet
Development

No branches or pull requests

4 participants