Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework Lexer to use extendable array of tokenizer functions #1872

Closed
wants to merge 11 commits into from

Conversation

calculuschild
Copy link
Contributor

Description

An attempt at #1695. Not sure about speed or elegance here, but wanted some feedback to see if this is even a reasonable route to take. If so, I would appreciate some troubleshooting to get this cleaner and to fix the issue with the broken test case.

Ideally, this would allow users to extend the Lexer by plugging in custom tokenizers at a chosen space in the lexer pipeline, and the params object exposes all of the required parameters to make functions with different signatures work.

Notes:

Contributor

  • Test(s) exist to ensure functionality and minimize regression (if no tests added, list tests covering this PR); or,
  • no tests required for this PR.
  • If submitting new feature, it has been documented in the appropriate places.

Committer

In most cases, this should be a different person than the contributor.

@vercel
Copy link

vercel bot commented Dec 13, 2020

This pull request is being automatically deployed with Vercel (learn more).
To see the status of your deployment, click below or on the icon next to each commit.

🔍 Inspect: https://vercel.com/markedjs/markedjs/v16o5rhx5
✅ Preview: https://markedjs-git-fork-calculuschild-tokenizerarray.markedjs.now.sh

src/Lexer.js Outdated Show resolved Hide resolved
@UziTech UziTech added this to In Progress in vNext via automation Dec 14, 2020
@calculuschild
Copy link
Contributor Author

Anyone want to take a whack at getting CommonMark #187 working again with this? I'm hitting a wall.

Copy link
Member

@UziTech UziTech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

links should only be on the top level tokens array

These fixes should fix example 187

src/Lexer.js Outdated Show resolved Hide resolved
src/Lexer.js Outdated Show resolved Hide resolved
src/Lexer.js Outdated Show resolved Hide resolved
@calculuschild
Copy link
Contributor Author

Alright, now it's just the Linter getting mad at my use of a label to break out of nested loops. Can we give a pass to the linter on this one or is there a good alternative?

@calculuschild
Copy link
Contributor Author

calculuschild commented Dec 16, 2020

Oof, now that it's all converted over the benchmarks are not looking too hot on this one....

CommonMark  : 3878 (ms)
Markdown-it : 3726

Marked Commonmark es6
-----------------------------------
Current Master              : 4394
Block Arrays, Inline Arrays : 4720
Block Maps,   Inline Arrays : 5312
Block Maps,   Inline Maps   : 5846

So it looks like arrays of objects are back on the menu. Not sure what other speedups we could make to this specific code. I suspect rebuilding the "params" object so often (especially on every "inlineTokens" call) is some amount of slowdown?

Some variables aren't used until the params object. No need to separately declare them before.
@UziTech
Copy link
Member

UziTech commented Dec 18, 2020

I think a jump to an outer loop is fine to disable the linter for that line.

This definitely needs some work to get it faster before merging. I'll take a look at it soon.

@calculuschild
Copy link
Contributor Author

This definitely needs some work to get it faster before merging. I'll take a look at it soon.

Is there any good Javascript profiling tool that might easily work with Marked?

@Monkatraz
Copy link

Monkatraz commented Dec 22, 2020

Alright, now it's just the Linter getting mad at my use of a label to break out of nested loops. Can we give a pass to the linter on this one or is there a good alternative?

I found this:

if (this.inlineTokenizers.some(fn => fn.func.call(this, inlineParams))) continue;

On my system this has the same performance as the original loop, although I suspect this is highly variable considering it's using a callback. I'd check on your end (in-fact I suspect it won't work). The performance of various methods is bizarre - if I create a bound (func.bind(this, inlineParams)) array and call those directly, it's actually slower. I'm almost not sure if you can optimize this be significantly faster (* by playing around with the loops).

Also remove i, l, token from params. Generating/accessing these from the params object is slower than just declaring them within the tokenizer.
@calculuschild
Copy link
Contributor Author

@Monkatraz Using array.some does seem to speed up a bit. Thanks for the suggestion!

We are getting there....

CommonMark  : 3878 (ms)
Markdown-it : 3726

Marked Commonmark es6
-----------------------------------
Current Master              : 4394


Previous best               : 4720
                               ↓
Array.some()                : 4569
                               ↓
Remove i, l, from params    : 4510

@styfle
Copy link
Member

styfle commented Jan 27, 2021

what about a generator function ... very short and sweet in code size

It's short and sweet for ES6 but the emitted ES5 will likely grow much larger.

@Monkatraz
Copy link

It's short and sweet for ES6 but the emitted ES5 will likely grow much larger.

I suspect more environments will support generator functions than not, so imo it's probably worth the trade, although only if it actually improves performance on ES6.

@UziTech
Copy link
Member

UziTech commented Jan 28, 2021

It's short and sweet for ES6 but the emitted ES5 will likely grow much larger.

I think we are removing the ES5 version in v2 by the time this gets merged anyway.

@calculuschild
Copy link
Contributor Author

calculuschild commented Jan 28, 2021

Now, we could go the other way... Instead of moving every step of the lexer out into an array of functions, what if we just leave the default functions as the are, but add checks for the existence of custom functions and execute them instead if they do. Something like this (seems to work well enough):

// Custom Newline Function (inserted via Marked.use() )
newLine(src) {
  let token;
  console.log("what");
  if(token = this.tokenizer.space(src)) {
    return token;
  }
}

And then in Lexer.js:

// newline
if(this.newLine && (token = this.newLine(src)) <-- if custom newline Tokenizer exists, execute instead
|| (token = this.tokenizer.space(src))) {
  src = src.substring(token.raw.length);
  if (token.type) {
    tokens.push(token);
  }
  continue;
}

if(this.newLine_after && (token = this.newLine_after(src))) { <-- if custom Tokenizer inserted after newline, execute
  src = src.substring(token.raw.length);
  if (token.type) {
    tokens.push(token);
  }
  continue;
}

It's clunky.... but it at least the original performance is mostly unaffected unless custom functions actually exist.

Unfortunately I'm at my wit's end here trying to get this to work with the original performance, so I'm going to need to take a break from this PR for a bit. @Monkatraz If you are interested in testing out the generator function I'll gladly let you give it a shot. 👍

@Monkatraz
Copy link

Aight, if I suddenly gain the Awake At 3AM Urge to take a crack at this, I'll try it. Hopefully it works... lol.

@calculuschild
Copy link
Contributor Author

calculuschild commented Jan 28, 2021

Aight, if I suddenly gain the Awake At 3AM Urge to take a crack at this, I'll try it. Hopefully it works... lol.

Note that any overhead due to const createTokenizerIterator = function* { ... } in the constructor or whatever is going to occur with each individual call to marked() (that is, a new Lexer is created every time a markdown string is parsed) and turned out to be the reason why bind was slowing things down so much. You may want to build off of #1909 so the setup only happens once and the Lexer object persists between calls in Bench.js.

@calculuschild
Copy link
Contributor Author

calculuschild commented Apr 27, 2021

Well, it's been a while now and I haven't had time to really look at this for months. I feel like we might need to nail down a better structure for how we expect extensions to be made, which would help direct how some of the code in this PR is laid out. Rant incoming. Sorry for the wall of text.

Right now, a user can take a Renderer or Tokenizer object and edit its sub-functions, but there isn't a clear way to add a NEW sub function to handle custom Markdown syntax. The user has to instead rely on intercepting and overriding, say, all the Text tokens before they get rendered and adding your custom handling in the tokenizer there. The user then has to copy/paste the existing tokenizer code in below their custom code to make sure the original functionality also works.

It would be great if there was a clear format for how these extensions are meant to be designed so as to cause minimal interference with other extensions and ease of maintenance.

Say, for simplicity, a user wants to add a custom Markdown symbol : at the beginning of the line. If a line starts with : the whole line needs to be underlined, generating a <u> ... </u> until the next newline character. Going in and editing the text tokens isn't obvious, because the user doesn't want to change the way text is rendered, but rather add an entire new token. But say he does that and hijacks the text tokenizer or text renderer to insert <u> when it detects ^:\s, and then publishes that as an NPM package. How should the NPM package be consumed by Marked.js? I'm assuming we want packages to work in a way that allows marked.use(customNPMpackage) to be all that's necessary.

But, then, user2 comes along and makes another NPM packages that takes { and the beginning of a line and makes it surround the line with a span. Now we have two NPM packages that function by hijacking the same Tokenizer function, and seem like they will overwrite each other if you try to use both.

Not to mention, every time the Text tokenizer is updated here, extension maintainers need to copy and re-paste the code into their extension. I know the Text tokenizer isn't exactly complex, but some other Tokenizers depend on multiple helper functions that the maintainer now has to track down and duplicate in their extension.

So, with speed being kind of the priority of Marked.js, we need to decide which extensibility options we actually want to provide our users, because as we have seen in this PR, certain approaches tend to result in a lot of slowdown, and adding new features later only gets more complicated as Marked.js is pretty tightly interconnected between all its parts.

To start, I'm imagining something like this:

  1. We provide the list of all tokens we currently support, and the order in which they are handled so users know where in the process their custom token needs to appear. This is already in the Docs but the order might not be correct anymore.

  2. We provide an example Tokenizer function users can modify which takes input source text, and spits out a token object. Users currently can only modify the existing tokenizers.

  3. We allow the users to specify where in the processing order their custom tokenizer function executes, which essentially sets their priority (i.e., the user has a custom "table" that needs to be checked before defaulting to the normal table token if it isn't a match). Users also specify if the token is block-level or inline.

  4. We provide an example Renderer function users can take an input token and generate an HTML string as output. Users can currently only modify the existing renderers. The Parser.js can just handle all of these at the end before the default error message. I don't think processing order matters at this point, the resulting HTML is the same. Also the renderer here will handle all of the token processing, basically performing the function of both parser.js and renderer.js.

  5. All of this related data is organized into an object that Marked.js consumes via marked.use() and references later on as it executes.

The extension for new custom Markdown looks like this maybe? When marked.use detects this format, we log it to some customMarkdown object somewhere rather than merging with and overwriting the existing functions. Then users can potentially add multiple custom extensions in this way without worrying about conflicts.

// Custom Markdown
const underline = {
  before : 'paragraph', // Leave blank to run after everything else...?
  level   : 'block',
  tokenizer : (src)=> {
    const rule = /^:.*$/;
    const match = rule.exec(src);
    return {
        type: 'underline',
        raw: match[0],        //This is the text that you want your token to consume from the source
        text: match[0].trim() //You can add additional properties to your tokens to pass along to the renderer
      };
  },
  renderer : (token)=> {
    return `<u>${token.text}</u>`;
  }
};

marked.use({ underline });

Otherwise, the current method for hijacking existing tokenizers/renderers can remain the same.

This is probably missing some key points but if we can narrow down how we want extensions like this to work we can build around that rather than fumbling around here like I was doing in this PR.

@UziTech
Copy link
Member

UziTech commented Apr 28, 2021

That looks good. I would change a few things so we don't have a breaking changes with the current way use works.

We should add a new property extensions (or something else) that underline is added to.

underline would also have to have a name property or some way to know which renderer to call for the underline token type.

marked.use({
  extensions: [
    underline,
  ]
});

@UziTech
Copy link
Member

UziTech commented Apr 28, 2021

That would work for the block extension but we would need to figure out how to prevent the inlineText tokenizer from consuming the start of an inline extension.

I think we would have to give some function that returns the next potential start of the inline token so inlineText can only consume up to that point then try all the inline tokenizers again.

For example if someone wanted to do an inline latex extension instead of overriding codespan:

const latex = {
  before : 'codespan',
  level   : 'inline',
  start : (src) => src.indexOf("$"),
  tokenizer : (src)=> {
    const match = src.match(/^\$+([^\$\n]+?)\$+/);
    if (match) {
      return {
        type: 'codespan',
        raw: match[0],
        text: match[1].trim()
      };
    }
  },
};

marked.use({ extensions: { latex } });

Then we know to end the inlineText tokenizer at or before that start index.

@calculuschild
Copy link
Contributor Author

Let me make sure I understand your start function.

Currently InlineText consumes everything that we know can't be the start of an inline Token using a big RegEx, right? But now, we also make sure InlineText doesn't accidentally consume part of the custom token. So start returns the index of the next character to check for the custom token, and InlineText will work as normal, but only up to that index. Then we check again for the custom token, and if none is found, get a new index from start, and again consume up to that index, following the RegEx along the way.

So InlineText will consume in steps instead of all at once, stopping at the nearest index we get from any start functions in our Extensions object. Is that right? Yeah... And if no extensions are used, that should allow InlineText to run without any slowdowns for normal users.

@calculuschild
Copy link
Contributor Author

If this format is good, I think we can just redo this PR and simplify it down to just inserting a check between each Lexer step as mentioned above. None of this crazy arrays of tokenizers and binding/calling functions that would look cleaner, but unfortunately just slows it all down.

//<- Code Tokenizer

...

//Check for custom Tokenizer    <- duplicate and paste this code in between EVERY TOKENIZER in lexer.js.
if(this.fences_before && (token = this.fences_before(src))) { <-- if custom Tokenizer inserted before Fences, execute
  src = src.substring(token.raw.length);
  tokens.push(token);
  continue;
}

...

//Fences Tokenizer here ->

Although we probably need to account for multiple custom tokenizers inserted at the same location. So we would have to loop over the extensions object and execute each Tokenizer with before : 'fences', in the order they were provided by the user.

Hm... Maybe we can make this check for custom tokenizer bit a function that intelligently updates what position it is at with each call so we can just insert a one-liner at each spot and the function knows if we are at before : fences or if we are now at before : code or wherever.

@UziTech
Copy link
Member

UziTech commented Apr 28, 2021

Pretty much. The way I got my marked-linkify-it extension working was to override inlineText with the following tokenizer:

inlineText(src, ...args) {
  // find the next start index
  const match = linkify.match(src);
  if (match && match.length > 0) {
    // get `src` up to that index
    src = src.substring(0, match[0].index);
  }

	// run `inlineText` on the string up to that index
  return inlineText.call(this, src, ...args);
}

Here I find the next start index then truncate src and only run the inlineText tokenizer on that truncated string.

In my latex example the start function wouldn't necessarily return the actual start of the next custom token but just a potential start just to make the processing faster. Once inlineText found a next potential start of a token it would run src back through all of the inline tokens including the custom ones. We would have to make sure start returns a number greater than 0 since if that is the actual start the custom tokenizer should have caught it before getting to inlineText.

@calculuschild
Copy link
Contributor Author

How do we want to go about testing extensions? Do we have any other simple ones like your marked-linkify-it that we know of that we can plug in? I have some pretty ugly ones I hacked together for my own use but I'd need to rewrite them for this format.

I'm also not as versed in Javascript unit testing so adding an automated way to test extensions might need to fall to someone else. But for now I can manually plug things in and see if they work as this update is being built.

@UziTech
Copy link
Member

UziTech commented Apr 29, 2021

I don't know of any other extensions that change the tokenizers. I couldn't find any with a quick search of npm. Most extensions just change a renderer.

We could create some extensions for Extended Markdown. Some of them, like heading ids, should be pretty simple.

@UziTech
Copy link
Member

UziTech commented Apr 29, 2021

Definition Lists looks like a good example that could use this functionality.

@calculuschild
Copy link
Contributor Author

calculuschild commented May 8, 2021

As I'm working on this, I may have run into an issue with custom block tokenizers. The problem is this: Say I'm using the example from above to underline any line starting with : instead of creating a paragraph, and I want this to behave as a separate block outside of the paragraph:

// Custom Markdown
const underline = {
  before : 'paragraph', // Leave blank to run after everything else...?
  level   : 'block',
  tokenizer : (src)=> {
    const rule = /^:.*$/;
    const match = rule.exec(src);
    if(match) {
      return {
          type: 'underline',
          raw: match[0],        //This is the text that you want your token to consume from the source
          text: match[0].trim() //You can add additional properties to your tokens to pass along to the renderer
      };
    }
  },
  renderer : (token)=> {
    return `<u>${token.text}</u>`;
  }
};

marked.use({ extensions : { underline } });

If I use an input of

No underline
: Underline
No underline

I want to see if I can get the extension to do this:

<p>No underline</p>
<u>Underline</u>
<p>No underline</p>

The default paragraph regex in rules.js will always consume the whole block without ever checking the custom extension since it relies on explicit rules to interrupt the paragraph.

_paragraph: /^([^\n]+(?:\n(?!hr|heading|lheading|blockquote|fences|list|html| +\n)[^\n]+)*)/,

Does this mean any custom tokenizer will also need to override the default paragraph tokenizer? That could get very messy..... Is this something we can treat similarly to the inlineText tokenizer as you did above @UziTech ? I'm having trouble picturing how that should work.

@UziTech
Copy link
Member

UziTech commented May 8, 2021

I think that is a problem we are going to run into. We could do the same as the inline text. Basically the extension would give a start index and the other tokenizers would only be given the src up to that index. It might be slow but we aren't guaranteeing any speed with extensions.

It would go like this:

  1. underline tokenizer would be given No underline\n: Underline\nNo underline and it wouldn't find a token.
  2. underline start would give index 13 as the next potential start
  3. all of the other toeknizers would be given src up to that index so No underline\n and paragraph would find a token
  4. when it comes back around underline tokenizer would be given : Underline\nNo underline and it would find a token
  5. and so on

@UziTech
Copy link
Member

UziTech commented May 8, 2021

I guess the issue might be that index 13 is not an actual start and should have been part of the paragraph

@UziTech
Copy link
Member

UziTech commented May 8, 2021

Ya I'm not sure how that will work. I guess we could start by saying that they have to change paragraph if they want to do something that conflicts with it for now and maybe come up with a solution later. Block level tokens should have a blank line before and after anyways so we would still be able to finish this PR with that as a requirement. Really your example of the underline should be an inline token anyway if you don't want a blank line before and after.

@calculuschild
Copy link
Contributor Author

I made a new PR for this #2043. I successfully got the dumb underline extension working, but I did manually go into the paragraph rules and add : as another interrupter for now.

@UziTech
Copy link
Member

UziTech commented May 8, 2021

I really hate how CommonMark makes spec rules describing how CommonMark works not necessarily how markdown should be written.

@calculuschild
Copy link
Contributor Author

calculuschild commented May 8, 2021

Block level tokens should have a blank line before and after anyways

Yes, except there are so many exceptions to this (tables, hr, heading, fences, list, blockquote, html) that it seems like something we should be able to handle. That's the behavior I was hoping to emulate with the underline example.

@UziTech
Copy link
Member

UziTech commented May 8, 2021

Realistically those exceptions should be considered garbage in/garbage out. There shouldn't be spec rules for badly written markdown.

@UziTech UziTech moved this from In Progress to Done in vNext Jun 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
vNext
Done
Development

Successfully merging this pull request may close these issues.

None yet

4 participants