New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rework Lexer to use extendable array of tokenizer functions #1872
Conversation
This pull request is being automatically deployed with Vercel (learn more). 🔍 Inspect: https://vercel.com/markedjs/markedjs/v16o5rhx5 |
Anyone want to take a whack at getting CommonMark #187 working again with this? I'm hitting a wall. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
links
should only be on the top level tokens array
These fixes should fix example 187
Alright, now it's just the Linter getting mad at my use of a label to break out of nested loops. Can we give a pass to the linter on this one or is there a good alternative? |
Oof, now that it's all converted over the benchmarks are not looking too hot on this one....
So it looks like arrays of objects are back on the menu. Not sure what other speedups we could make to this specific code. I suspect rebuilding the "params" object so often (especially on every "inlineTokens" call) is some amount of slowdown? |
Much faster
Some variables aren't used until the params object. No need to separately declare them before.
I think a jump to an outer loop is fine to disable the linter for that line. This definitely needs some work to get it faster before merging. I'll take a look at it soon. |
Is there any good Javascript profiling tool that might easily work with Marked? |
I found this: if (this.inlineTokenizers.some(fn => fn.func.call(this, inlineParams))) continue; On my system this has the same performance as the original loop, although I suspect this is highly variable considering it's using a callback. I'd check on your end (in-fact I suspect it won't work). The performance of various methods is bizarre - if I create a bound ( |
Also remove i, l, token from params. Generating/accessing these from the params object is slower than just declaring them within the tokenizer.
@Monkatraz Using We are getting there....
|
It's short and sweet for ES6 but the emitted ES5 will likely grow much larger. |
I suspect more environments will support generator functions than not, so imo it's probably worth the trade, although only if it actually improves performance on ES6. |
I think we are removing the ES5 version in v2 by the time this gets merged anyway. |
Now, we could go the other way... Instead of moving every step of the lexer out into an array of functions, what if we just leave the default functions as the are, but add checks for the existence of custom functions and execute them instead if they do. Something like this (seems to work well enough):
And then in Lexer.js:
It's clunky.... but it at least the original performance is mostly unaffected unless custom functions actually exist. Unfortunately I'm at my wit's end here trying to get this to work with the original performance, so I'm going to need to take a break from this PR for a bit. @Monkatraz If you are interested in testing out the generator function I'll gladly let you give it a shot. 👍 |
Aight, if I suddenly gain the Awake At 3AM Urge to take a crack at this, I'll try it. Hopefully it works... lol. |
Note that any overhead due to |
Well, it's been a while now and I haven't had time to really look at this for months. I feel like we might need to nail down a better structure for how we expect extensions to be made, which would help direct how some of the code in this PR is laid out. Rant incoming. Sorry for the wall of text. Right now, a user can take a Renderer or Tokenizer object and edit its sub-functions, but there isn't a clear way to add a NEW sub function to handle custom Markdown syntax. The user has to instead rely on intercepting and overriding, say, all the Text tokens before they get rendered and adding your custom handling in the tokenizer there. The user then has to copy/paste the existing tokenizer code in below their custom code to make sure the original functionality also works. It would be great if there was a clear format for how these extensions are meant to be designed so as to cause minimal interference with other extensions and ease of maintenance. Say, for simplicity, a user wants to add a custom Markdown symbol But, then, user2 comes along and makes another NPM packages that takes Not to mention, every time the Text tokenizer is updated here, extension maintainers need to copy and re-paste the code into their extension. I know the Text tokenizer isn't exactly complex, but some other Tokenizers depend on multiple helper functions that the maintainer now has to track down and duplicate in their extension. So, with speed being kind of the priority of Marked.js, we need to decide which extensibility options we actually want to provide our users, because as we have seen in this PR, certain approaches tend to result in a lot of slowdown, and adding new features later only gets more complicated as Marked.js is pretty tightly interconnected between all its parts. To start, I'm imagining something like this:
The extension for new custom Markdown looks like this maybe? When
Otherwise, the current method for hijacking existing tokenizers/renderers can remain the same. This is probably missing some key points but if we can narrow down how we want extensions like this to work we can build around that rather than fumbling around here like I was doing in this PR. |
That looks good. I would change a few things so we don't have a breaking changes with the current way We should add a new property
marked.use({
extensions: [
underline,
]
}); |
That would work for the block extension but we would need to figure out how to prevent the I think we would have to give some function that returns the next potential start of the inline token so inlineText can only consume up to that point then try all the inline tokenizers again. For example if someone wanted to do an inline latex extension instead of overriding codespan: const latex = {
before : 'codespan',
level : 'inline',
start : (src) => src.indexOf("$"),
tokenizer : (src)=> {
const match = src.match(/^\$+([^\$\n]+?)\$+/);
if (match) {
return {
type: 'codespan',
raw: match[0],
text: match[1].trim()
};
}
},
};
marked.use({ extensions: { latex } }); Then we know to end the inlineText tokenizer at or before that |
Let me make sure I understand your Currently So |
If this format is good, I think we can just redo this PR and simplify it down to just inserting a check between each Lexer step as mentioned above. None of this crazy arrays of tokenizers and binding/calling functions that would look cleaner, but unfortunately just slows it all down.
Although we probably need to account for multiple custom tokenizers inserted at the same location. So we would have to loop over the Hm... Maybe we can make this |
Pretty much. The way I got my inlineText(src, ...args) {
// find the next start index
const match = linkify.match(src);
if (match && match.length > 0) {
// get `src` up to that index
src = src.substring(0, match[0].index);
}
// run `inlineText` on the string up to that index
return inlineText.call(this, src, ...args);
} Here I find the next start index then truncate In my latex example the |
How do we want to go about testing extensions? Do we have any other simple ones like your I'm also not as versed in Javascript unit testing so adding an automated way to test extensions might need to fall to someone else. But for now I can manually plug things in and see if they work as this update is being built. |
I don't know of any other extensions that change the tokenizers. I couldn't find any with a quick search of npm. Most extensions just change a renderer. We could create some extensions for Extended Markdown. Some of them, like heading ids, should be pretty simple. |
Definition Lists looks like a good example that could use this functionality. |
As I'm working on this, I may have run into an issue with custom block tokenizers. The problem is this: Say I'm using the example from above to underline any line starting with
If I use an input of
I want to see if I can get the extension to do this:
The default
Does this mean any custom tokenizer will also need to override the default paragraph tokenizer? That could get very messy..... Is this something we can treat similarly to the |
I think that is a problem we are going to run into. We could do the same as the inline text. Basically the extension would give a start index and the other tokenizers would only be given the It would go like this:
|
I guess the issue might be that index |
Ya I'm not sure how that will work. I guess we could start by saying that they have to change paragraph if they want to do something that conflicts with it for now and maybe come up with a solution later. Block level tokens should have a blank line before and after anyways so we would still be able to finish this PR with that as a requirement. Really your example of the underline should be an inline token anyway if you don't want a blank line before and after. |
I made a new PR for this #2043. I successfully got the dumb |
I really hate how CommonMark makes spec rules describing how CommonMark works not necessarily how markdown should be written. |
Yes, except there are so many exceptions to this (tables, hr, heading, fences, list, blockquote, html) that it seems like something we should be able to handle. That's the behavior I was hoping to emulate with the |
Realistically those exceptions should be considered garbage in/garbage out. There shouldn't be spec rules for badly written markdown. |
Description
An attempt at #1695. Not sure about speed or elegance here, but wanted some feedback to see if this is even a reasonable route to take. If so, I would appreciate some troubleshooting to get this cleaner and to fix the issue with the broken test case.
Ideally, this would allow users to extend the Lexer by plugging in custom tokenizers at a chosen space in the lexer pipeline, and the
params
object exposes all of the required parameters to make functions with different signatures work.Notes:
def
tokenizer. I don't fully understand what, but it makes Commonmark Tiny enhancement issue#185 and README improvements #187 fail.Contributor
Committer
In most cases, this should be a different person than the contributor.