Rework Lexer to use extendable array of tokenizer functions #1872

calculuschild · 2020-12-13T17:44:22Z

Description

An attempt at #1695. Not sure about speed or elegance here, but wanted some feedback to see if this is even a reasonable route to take. If so, I would appreciate some troubleshooting to get this cleaner and to fix the issue with the broken test case.

Ideally, this would allow users to extend the Lexer by plugging in custom tokenizers at a chosen space in the lexer pipeline, and the params object exposes all of the required parameters to make functions with different signatures work.

Notes:

I only included the block tokenizers here to start.
This seems to break something with the def tokenizer. I don't fully understand what, but it makes Commonmark Tiny enhancement issue#185 and README improvements #187 fail.

Contributor

Test(s) exist to ensure functionality and minimize regression (if no tests added, list tests covering this PR); or,
no tests required for this PR.
If submitting new feature, it has been documented in the appropriate places.

Committer

In most cases, this should be a different person than the contributor.

CI is green (no forced merge required).
Squash and Merge PR following conventional commit guidelines.

vercel · 2020-12-13T17:44:27Z

This pull request is being automatically deployed with Vercel (learn more).
To see the status of your deployment, click below or on the icon next to each commit.

🔍 Inspect: https://vercel.com/markedjs/markedjs/v16o5rhx5
✅ Preview: https://markedjs-git-fork-calculuschild-tokenizerarray.markedjs.now.sh

src/Lexer.js

calculuschild · 2020-12-14T23:56:18Z

Anyone want to take a whack at getting CommonMark #187 working again with this? I'm hitting a wall.

UziTech

links should only be on the top level tokens array

These fixes should fix example 187

src/Lexer.js

calculuschild · 2020-12-15T22:01:36Z

Alright, now it's just the Linter getting mad at my use of a label to break out of nested loops. Can we give a pass to the linter on this one or is there a good alternative?

calculuschild · 2020-12-16T04:29:24Z

Oof, now that it's all converted over the benchmarks are not looking too hot on this one....

CommonMark  : 3878 (ms)
Markdown-it : 3726

Marked Commonmark es6
-----------------------------------
Current Master              : 4394
Block Arrays, Inline Arrays : 4720
Block Maps,   Inline Arrays : 5312
Block Maps,   Inline Maps   : 5846

So it looks like arrays of objects are back on the menu. Not sure what other speedups we could make to this specific code. I suspect rebuilding the "params" object so often (especially on every "inlineTokens" call) is some amount of slowdown?

Much faster

Some variables aren't used until the params object. No need to separately declare them before.

UziTech · 2020-12-18T22:43:27Z

I think a jump to an outer loop is fine to disable the linter for that line.

This definitely needs some work to get it faster before merging. I'll take a look at it soon.

calculuschild · 2020-12-19T05:19:04Z

This definitely needs some work to get it faster before merging. I'll take a look at it soon.

Is there any good Javascript profiling tool that might easily work with Marked?

Monkatraz · 2020-12-22T07:51:43Z

Alright, now it's just the Linter getting mad at my use of a label to break out of nested loops. Can we give a pass to the linter on this one or is there a good alternative?

I found this:

if (this.inlineTokenizers.some(fn => fn.func.call(this, inlineParams))) continue;

On my system this has the same performance as the original loop, although I suspect this is highly variable considering it's using a callback. I'd check on your end (in-fact I suspect it won't work). The performance of various methods is bizarre - if I create a bound (func.bind(this, inlineParams)) array and call those directly, it's actually slower. I'm almost not sure if you can optimize this be significantly faster (* by playing around with the loops).

Also remove i, l, token from params. Generating/accessing these from the params object is slower than just declaring them within the tokenizer.

calculuschild · 2020-12-25T21:11:58Z

@Monkatraz Using array.some does seem to speed up a bit. Thanks for the suggestion!

We are getting there....

CommonMark  : 3878 (ms)
Markdown-it : 3726

Marked Commonmark es6
-----------------------------------
Current Master              : 4394


Previous best               : 4720
                               ↓
Array.some()                : 4569
                               ↓
Remove i, l, from params    : 4510

styfle · 2021-01-27T21:40:42Z

what about a generator function ... very short and sweet in code size

It's short and sweet for ES6 but the emitted ES5 will likely grow much larger.

Monkatraz · 2021-01-28T02:56:44Z

It's short and sweet for ES6 but the emitted ES5 will likely grow much larger.

I suspect more environments will support generator functions than not, so imo it's probably worth the trade, although only if it actually improves performance on ES6.

UziTech · 2021-01-28T03:08:39Z

It's short and sweet for ES6 but the emitted ES5 will likely grow much larger.

I think we are removing the ES5 version in v2 by the time this gets merged anyway.

calculuschild · 2021-01-28T04:02:23Z

Now, we could go the other way... Instead of moving every step of the lexer out into an array of functions, what if we just leave the default functions as the are, but add checks for the existence of custom functions and execute them instead if they do. Something like this (seems to work well enough):

// Custom Newline Function (inserted via Marked.use() )
newLine(src) {
  let token;
  console.log("what");
  if(token = this.tokenizer.space(src)) {
    return token;
  }
}

And then in Lexer.js:

// newline
if(this.newLine && (token = this.newLine(src)) <-- if custom newline Tokenizer exists, execute instead
|| (token = this.tokenizer.space(src))) {
  src = src.substring(token.raw.length);
  if (token.type) {
    tokens.push(token);
  }
  continue;
}

if(this.newLine_after && (token = this.newLine_after(src))) { <-- if custom Tokenizer inserted after newline, execute
  src = src.substring(token.raw.length);
  if (token.type) {
    tokens.push(token);
  }
  continue;
}

It's clunky.... but it at least the original performance is mostly unaffected unless custom functions actually exist.

Unfortunately I'm at my wit's end here trying to get this to work with the original performance, so I'm going to need to take a break from this PR for a bit. @Monkatraz If you are interested in testing out the generator function I'll gladly let you give it a shot. 👍

Monkatraz · 2021-01-28T04:14:38Z

Aight, if I suddenly gain the Awake At 3AM Urge to take a crack at this, I'll try it. Hopefully it works... lol.

calculuschild · 2021-01-28T04:25:56Z

Aight, if I suddenly gain the Awake At 3AM Urge to take a crack at this, I'll try it. Hopefully it works... lol.

Note that any overhead due to const createTokenizerIterator = function* { ... } in the constructor or whatever is going to occur with each individual call to marked() (that is, a new Lexer is created every time a markdown string is parsed) and turned out to be the reason why bind was slowing things down so much. You may want to build off of #1909 so the setup only happens once and the Lexer object persists between calls in Bench.js.

calculuschild · 2021-04-27T21:11:18Z

Well, it's been a while now and I haven't had time to really look at this for months. I feel like we might need to nail down a better structure for how we expect extensions to be made, which would help direct how some of the code in this PR is laid out. Rant incoming. Sorry for the wall of text.

Right now, a user can take a Renderer or Tokenizer object and edit its sub-functions, but there isn't a clear way to add a NEW sub function to handle custom Markdown syntax. The user has to instead rely on intercepting and overriding, say, all the Text tokens before they get rendered and adding your custom handling in the tokenizer there. The user then has to copy/paste the existing tokenizer code in below their custom code to make sure the original functionality also works.

It would be great if there was a clear format for how these extensions are meant to be designed so as to cause minimal interference with other extensions and ease of maintenance.

Say, for simplicity, a user wants to add a custom Markdown symbol : at the beginning of the line. If a line starts with : the whole line needs to be underlined, generating a <u> ... </u> until the next newline character. Going in and editing the text tokens isn't obvious, because the user doesn't want to change the way text is rendered, but rather add an entire new token. But say he does that and hijacks the text tokenizer or text renderer to insert <u> when it detects ^:\s, and then publishes that as an NPM package. How should the NPM package be consumed by Marked.js? I'm assuming we want packages to work in a way that allows marked.use(customNPMpackage) to be all that's necessary.

But, then, user2 comes along and makes another NPM packages that takes { and the beginning of a line and makes it surround the line with a span. Now we have two NPM packages that function by hijacking the same Tokenizer function, and seem like they will overwrite each other if you try to use both.

Not to mention, every time the Text tokenizer is updated here, extension maintainers need to copy and re-paste the code into their extension. I know the Text tokenizer isn't exactly complex, but some other Tokenizers depend on multiple helper functions that the maintainer now has to track down and duplicate in their extension.

So, with speed being kind of the priority of Marked.js, we need to decide which extensibility options we actually want to provide our users, because as we have seen in this PR, certain approaches tend to result in a lot of slowdown, and adding new features later only gets more complicated as Marked.js is pretty tightly interconnected between all its parts.

To start, I'm imagining something like this:

We provide the list of all tokens we currently support, and the order in which they are handled so users know where in the process their custom token needs to appear. This is already in the Docs but the order might not be correct anymore.
We provide an example Tokenizer function users can modify which takes input source text, and spits out a token object. Users currently can only modify the existing tokenizers.
We allow the users to specify where in the processing order their custom tokenizer function executes, which essentially sets their priority (i.e., the user has a custom "table" that needs to be checked before defaulting to the normal table token if it isn't a match). Users also specify if the token is block-level or inline.
We provide an example Renderer function users can take an input token and generate an HTML string as output. Users can currently only modify the existing renderers. The Parser.js can just handle all of these at the end before the default error message. I don't think processing order matters at this point, the resulting HTML is the same. Also the renderer here will handle all of the token processing, basically performing the function of both parser.js and renderer.js.
All of this related data is organized into an object that Marked.js consumes via marked.use() and references later on as it executes.

The extension for new custom Markdown looks like this maybe? When marked.use detects this format, we log it to some customMarkdown object somewhere rather than merging with and overwriting the existing functions. Then users can potentially add multiple custom extensions in this way without worrying about conflicts.

// Custom Markdown
const underline = {
  before : 'paragraph', // Leave blank to run after everything else...?
  level   : 'block',
  tokenizer : (src)=> {
    const rule = /^:.*$/;
    const match = rule.exec(src);
    return {
        type: 'underline',
        raw: match[0],        //This is the text that you want your token to consume from the source
        text: match[0].trim() //You can add additional properties to your tokens to pass along to the renderer
      };
  },
  renderer : (token)=> {
    return `<u>${token.text}</u>`;
  }
};

marked.use({ underline });

Otherwise, the current method for hijacking existing tokenizers/renderers can remain the same.

This is probably missing some key points but if we can narrow down how we want extensions like this to work we can build around that rather than fumbling around here like I was doing in this PR.

UziTech · 2021-04-28T03:08:06Z

That looks good. I would change a few things so we don't have a breaking changes with the current way use works.

We should add a new property extensions (or something else) that underline is added to.

underline would also have to have a name property or some way to know which renderer to call for the underline token type.

marked.use({
  extensions: [
    underline,
  ]
});

UziTech · 2021-04-28T03:37:01Z

That would work for the block extension but we would need to figure out how to prevent the inlineText tokenizer from consuming the start of an inline extension.

I think we would have to give some function that returns the next potential start of the inline token so inlineText can only consume up to that point then try all the inline tokenizers again.

For example if someone wanted to do an inline latex extension instead of overriding codespan:

const latex = {
  before : 'codespan',
  level   : 'inline',
  start : (src) => src.indexOf("$"),
  tokenizer : (src)=> {
    const match = src.match(/^\$+([^\$\n]+?)\$+/);
    if (match) {
      return {
        type: 'codespan',
        raw: match[0],
        text: match[1].trim()
      };
    }
  },
};

marked.use({ extensions: { latex } });

Then we know to end the inlineText tokenizer at or before that start index.

calculuschild · 2021-04-28T04:50:17Z

Let me make sure I understand your start function.

Currently InlineText consumes everything that we know can't be the start of an inline Token using a big RegEx, right? But now, we also make sure InlineText doesn't accidentally consume part of the custom token. So start returns the index of the next character to check for the custom token, and InlineText will work as normal, but only up to that index. Then we check again for the custom token, and if none is found, get a new index from start, and again consume up to that index, following the RegEx along the way.

So InlineText will consume in steps instead of all at once, stopping at the nearest index we get from any start functions in our Extensions object. Is that right? Yeah... And if no extensions are used, that should allow InlineText to run without any slowdowns for normal users.

calculuschild · 2021-04-28T05:07:58Z

If this format is good, I think we can just redo this PR and simplify it down to just inserting a check between each Lexer step as mentioned above. None of this crazy arrays of tokenizers and binding/calling functions that would look cleaner, but unfortunately just slows it all down.

//<- Code Tokenizer

...

//Check for custom Tokenizer    <- duplicate and paste this code in between EVERY TOKENIZER in lexer.js.
if(this.fences_before && (token = this.fences_before(src))) { <-- if custom Tokenizer inserted before Fences, execute
  src = src.substring(token.raw.length);
  tokens.push(token);
  continue;
}

...

//Fences Tokenizer here ->

Although we probably need to account for multiple custom tokenizers inserted at the same location. So we would have to loop over the extensions object and execute each Tokenizer with before : 'fences', in the order they were provided by the user.

Hm... Maybe we can make this check for custom tokenizer bit a function that intelligently updates what position it is at with each call so we can just insert a one-liner at each spot and the function knows if we are at before : fences or if we are now at before : code or wherever.

UziTech · 2021-04-28T05:13:38Z

Pretty much. The way I got my marked-linkify-it extension working was to override inlineText with the following tokenizer:

inlineText(src, ...args) {
  // find the next start index
  const match = linkify.match(src);
  if (match && match.length > 0) {
    // get `src` up to that index
    src = src.substring(0, match[0].index);
  }

	// run `inlineText` on the string up to that index
  return inlineText.call(this, src, ...args);
}

Here I find the next start index then truncate src and only run the inlineText tokenizer on that truncated string.

In my latex example the start function wouldn't necessarily return the actual start of the next custom token but just a potential start just to make the processing faster. Once inlineText found a next potential start of a token it would run src back through all of the inline tokens including the custom ones. We would have to make sure start returns a number greater than 0 since if that is the actual start the custom tokenizer should have caught it before getting to inlineText.

calculuschild · 2021-04-29T16:43:14Z

How do we want to go about testing extensions? Do we have any other simple ones like your marked-linkify-it that we know of that we can plug in? I have some pretty ugly ones I hacked together for my own use but I'd need to rewrite them for this format.

I'm also not as versed in Javascript unit testing so adding an automated way to test extensions might need to fall to someone else. But for now I can manually plug things in and see if they work as this update is being built.

UziTech · 2021-04-29T19:02:24Z

I don't know of any other extensions that change the tokenizers. I couldn't find any with a quick search of npm. Most extensions just change a renderer.

We could create some extensions for Extended Markdown. Some of them, like heading ids, should be pretty simple.

UziTech · 2021-04-29T19:04:17Z

Definition Lists looks like a good example that could use this functionality.

calculuschild · 2021-05-08T05:13:17Z

As I'm working on this, I may have run into an issue with custom block tokenizers. The problem is this: Say I'm using the example from above to underline any line starting with : instead of creating a paragraph, and I want this to behave as a separate block outside of the paragraph:

// Custom Markdown
const underline = {
  before : 'paragraph', // Leave blank to run after everything else...?
  level   : 'block',
  tokenizer : (src)=> {
    const rule = /^:.*$/;
    const match = rule.exec(src);
    if(match) {
      return {
          type: 'underline',
          raw: match[0],        //This is the text that you want your token to consume from the source
          text: match[0].trim() //You can add additional properties to your tokens to pass along to the renderer
      };
    }
  },
  renderer : (token)=> {
    return `<u>${token.text}</u>`;
  }
};

marked.use({ extensions : { underline } });

If I use an input of

No underline
: Underline
No underline

I want to see if I can get the extension to do this:

<p>No underline</p>
<u>Underline</u>
<p>No underline</p>

The default paragraph regex in rules.js will always consume the whole block without ever checking the custom extension since it relies on explicit rules to interrupt the paragraph.

_paragraph: /^([^\n]+(?:\n(?!hr|heading|lheading|blockquote|fences|list|html| +\n)[^\n]+)*)/,

Does this mean any custom tokenizer will also need to override the default paragraph tokenizer? That could get very messy..... Is this something we can treat similarly to the inlineText tokenizer as you did above @UziTech ? I'm having trouble picturing how that should work.

UziTech · 2021-05-08T06:30:47Z

I think that is a problem we are going to run into. We could do the same as the inline text. Basically the extension would give a start index and the other tokenizers would only be given the src up to that index. It might be slow but we aren't guaranteeing any speed with extensions.

It would go like this:

underline tokenizer would be given No underline\n: Underline\nNo underline and it wouldn't find a token.
underline start would give index 13 as the next potential start
all of the other toeknizers would be given src up to that index so No underline\n and paragraph would find a token
when it comes back around underline tokenizer would be given : Underline\nNo underline and it would find a token
and so on

UziTech · 2021-05-08T06:34:24Z

I guess the issue might be that index 13 is not an actual start and should have been part of the paragraph

UziTech · 2021-05-08T06:38:05Z

Ya I'm not sure how that will work. I guess we could start by saying that they have to change paragraph if they want to do something that conflicts with it for now and maybe come up with a solution later. Block level tokens should have a blank line before and after anyways so we would still be able to finish this PR with that as a requirement. Really your example of the underline should be an inline token anyway if you don't want a blank line before and after.

calculuschild · 2021-05-08T06:39:09Z

I made a new PR for this #2043. I successfully got the dumb underline extension working, but I did manually go into the paragraph rules and add : as another interrupter for now.

UziTech · 2021-05-08T06:39:43Z

I really hate how CommonMark makes spec rules describing how CommonMark works not necessarily how markdown should be written.

calculuschild · 2021-05-08T06:43:33Z

Block level tokens should have a blank line before and after anyways

Yes, except there are so many exceptions to this (tables, hr, heading, fences, list, blockquote, html) that it seems like something we should be able to handle. That's the behavior I was hoping to emulate with the underline example.

UziTech · 2021-05-08T07:01:28Z

Realistically those exceptions should be considered garbage in/garbage out. There shouldn't be spec rules for badly written markdown.

Move BlockTokenizers to array

4ce28a2

vercel bot deployed to Preview December 13, 2020 17:44 View deployment

calculuschild requested review from joshbruce, styfle and UziTech December 13, 2020 17:45

UziTech reviewed Dec 13, 2020

View reviewed changes

src/Lexer.js Outdated Show resolved Hide resolved

UziTech added this to In Progress in vNext via automation Dec 14, 2020

UziTech requested changes Dec 15, 2020

View reviewed changes

src/Lexer.js Outdated Show resolved Hide resolved

src/Lexer.js Outdated Show resolved Hide resolved

src/Lexer.js Outdated Show resolved Hide resolved

Array to Map, Fixed Links

63973e6

vercel bot deployed to Preview December 15, 2020 22:00 View deployment

Converted inline Tokenizers over

2595fb8

vercel bot deployed to Preview December 16, 2020 04:27 View deployment

Change Maps to Arrays of Objects

e79cbaf

Much faster

vercel bot deployed to Preview December 16, 2020 05:05 View deployment

Lint

a6741f1

vercel bot deployed to Preview December 16, 2020 05:09 View deployment

Remove redundant variable declarations

196686c

Some variables aren't used until the params object. No need to separately declare them before.

vercel bot deployed to Preview December 16, 2020 05:39 View deployment

UziTech mentioned this pull request Dec 24, 2020

Customize Lexer and Parser #1890

Closed

Array loop to Array.some

df2f852

Also remove i, l, token from params. Generating/accessing these from the params object is slower than just declaring them within the tokenizer.

vercel bot deployed to Preview December 25, 2020 20:57 View deployment

Move src and tokens out of Params

583f18e

UziTech mentioned this pull request Feb 5, 2021

Groups of consecutive underscores in a specific pattern hang/take a long time to convert #1927

Closed

UziTech mentioned this pull request Feb 18, 2021

Bug in latex example in tokenizer extension docs #1948

Closed

UziTech mentioned this pull request Apr 22, 2021

Can't add additional custom tokenizer or renderers? #1693

Closed

UziTech mentioned this pull request Apr 29, 2021

Consider changing Parser's switch cases to methods, for extensibility #2033

Closed

calculuschild mentioned this pull request May 8, 2021

feat: Custom Tokenizer/Renderer extensions #2043

Merged

5 tasks

calculuschild closed this Jun 13, 2021

UziTech moved this from In Progress to Done in vNext Jun 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rework Lexer to use extendable array of tokenizer functions #1872

Rework Lexer to use extendable array of tokenizer functions #1872

calculuschild commented Dec 13, 2020

vercel bot commented Dec 13, 2020 •

edited

calculuschild commented Dec 14, 2020

UziTech left a comment •

edited

calculuschild commented Dec 15, 2020

calculuschild commented Dec 16, 2020 •

edited

UziTech commented Dec 18, 2020

calculuschild commented Dec 19, 2020

Monkatraz commented Dec 22, 2020 •

edited

calculuschild commented Dec 25, 2020

styfle commented Jan 27, 2021

Monkatraz commented Jan 28, 2021

UziTech commented Jan 28, 2021

calculuschild commented Jan 28, 2021 •

edited

Monkatraz commented Jan 28, 2021

calculuschild commented Jan 28, 2021 •

edited

calculuschild commented Apr 27, 2021 •

edited

UziTech commented Apr 28, 2021

UziTech commented Apr 28, 2021 •

edited

calculuschild commented Apr 28, 2021

calculuschild commented Apr 28, 2021

UziTech commented Apr 28, 2021

calculuschild commented Apr 29, 2021

UziTech commented Apr 29, 2021

UziTech commented Apr 29, 2021

calculuschild commented May 8, 2021 •

edited

UziTech commented May 8, 2021 •

edited

UziTech commented May 8, 2021

UziTech commented May 8, 2021

calculuschild commented May 8, 2021

UziTech commented May 8, 2021

calculuschild commented May 8, 2021 •

edited

UziTech commented May 8, 2021

Rework Lexer to use extendable array of tokenizer functions #1872

Rework Lexer to use extendable array of tokenizer functions #1872

Conversation

calculuschild commented Dec 13, 2020

Description

Contributor

Committer

vercel bot commented Dec 13, 2020 • edited

calculuschild commented Dec 14, 2020

UziTech left a comment • edited

Choose a reason for hiding this comment

calculuschild commented Dec 15, 2020

calculuschild commented Dec 16, 2020 • edited

UziTech commented Dec 18, 2020

calculuschild commented Dec 19, 2020

Monkatraz commented Dec 22, 2020 • edited

calculuschild commented Dec 25, 2020

styfle commented Jan 27, 2021

Monkatraz commented Jan 28, 2021

UziTech commented Jan 28, 2021

calculuschild commented Jan 28, 2021 • edited

Monkatraz commented Jan 28, 2021

calculuschild commented Jan 28, 2021 • edited

calculuschild commented Apr 27, 2021 • edited

UziTech commented Apr 28, 2021

UziTech commented Apr 28, 2021 • edited

calculuschild commented Apr 28, 2021

calculuschild commented Apr 28, 2021

UziTech commented Apr 28, 2021

calculuschild commented Apr 29, 2021

UziTech commented Apr 29, 2021

UziTech commented Apr 29, 2021

calculuschild commented May 8, 2021 • edited

UziTech commented May 8, 2021 • edited

UziTech commented May 8, 2021

UziTech commented May 8, 2021

calculuschild commented May 8, 2021

UziTech commented May 8, 2021

calculuschild commented May 8, 2021 • edited

UziTech commented May 8, 2021

vercel bot commented Dec 13, 2020 •

edited

UziTech left a comment •

edited

calculuschild commented Dec 16, 2020 •

edited

Monkatraz commented Dec 22, 2020 •

edited

calculuschild commented Jan 28, 2021 •

edited

calculuschild commented Jan 28, 2021 •

edited

calculuschild commented Apr 27, 2021 •

edited

UziTech commented Apr 28, 2021 •

edited

calculuschild commented May 8, 2021 •

edited

UziTech commented May 8, 2021 •

edited

calculuschild commented May 8, 2021 •

edited