Tokenizer #1637

UziTech · 2020-04-03T14:46:32Z

Description

This PR builds on #1627 to create a Tokenizer class that can be extended to influence the Lexer.

TODO:

Write unit tests
Make benchmarks faster (currently same as master)
Update docs

Contributor

Test(s) exist to ensure functionality and minimize regression (if no tests added, list tests covering this PR); or,
no tests required for this PR.
If submitting new feature, it has been documented in the appropriate places.

Committer

In most cases, this should be a different person than the contributor.

Draft GitHub release notes have been updated.
CI is green (no forced merge required).
Merge PR

vercel · 2020-04-03T14:46:37Z

This pull request is being automatically deployed with ZEIT Now (learn more).
To see the status of your deployment, click below or on the icon next to each commit.

🔍 Inspect: https://zeit.co/markedjs/markedjs/gnsdjmlxq
✅ Preview: https://markedjs-git-fork-uzitech-tokenizer.markedjs.now.sh

calculuschild · 2020-04-03T15:06:32Z

Nice! Looks like you got it going. I think I'll go ahead and close #1632 then.

UziTech · 2020-04-03T17:32:11Z

This should probably be cleaned up. Right now the Tokenizer functions get passed an object that has a src and out property that needs to be edited along with returning a token. It seems like a weird way to make the user edit the src and out.

calculuschild · 2020-04-03T20:10:58Z

What does the out property do?

UziTech · 2020-04-03T21:40:08Z

it provides the text for inline tokens

UziTech · 2020-04-03T21:55:50Z

After some refactoring it looks like we could remove the out property if we dont care to have the text for inline tokens that can have other inline tokens inside of them (strong, em, and del).

calculuschild · 2020-04-04T14:18:01Z

It seems like a weird way to make the user edit the src and out.

I would agree that the Tokenizer functions should probably just focus on generating a token. Making a user also handle all these side effects seems like it would make extension more convoluted than it needs to be.

In my attempt #1632 I also removed editing the src from the Token functions. I imagine the same could be done here, so src is only really needed to do the initial regex capture and then if a token is returned the lexer will update src appropriately.

UziTech · 2020-04-05T16:47:09Z

Yes but I think the original regex capture is important to be able to extend/change. And how would we deal with backtracking? If we want to be 100% spec compliant I think we will need to do more of that.

calculuschild · 2020-04-05T18:46:03Z

I think the original regex capture is important to be able to extend/change.

Right, performing the regex capture makes sense, so we pass in src for that, but I don't think it's necessary to edit src in the tokenizer. Then again, isn't that why the rules is already extendable?

And how would we deal with backtracking?

In #1632 I found ways around the backtracking in the list token. If that's not enough, perhaps we can simply use the length of the "raw" text in the tokens and consume that much of src.

UziTech · 2020-04-05T22:19:52Z

I like the idea of using raw length 👍. We will still need to pass the src to the tokenizer functions but they won't need to update it.

UziTech · 2020-04-06T15:40:15Z

I think this is just about ready. A few things to point out:

The block text function is called text and the inline text function is called inlineText.
I left smartypants and mangle as public functions so the user could extend them if they want.
~~Some tokens (em, strong, del, link) don't have a text property since they have inline tokens. We could add a text property from the tokens property but I don't think it is necessary.~~
I am passing src and other variables by value instead of in an object but I didn't see any slow down in the benchmarks.

calculuschild · 2020-04-06T16:30:26Z

I am passing src and other variables by value instead of in an object but I didn't see any slow down in the benchmarks.

My understanding is objects, strings, and arrays all default to pass by reference in Javascript anyway, so really top is the only one that would make a difference in speed, no?

UziTech · 2020-04-06T16:40:54Z

My understanding is objects, strings, and arrays all default to pass by reference in Javascript anyway, so really top is the only one that would make a difference in speed, no?

Correct.

calculuschild · 2020-04-06T16:56:41Z

Cool. This is awesome work!

UziTech · 2020-04-06T16:57:13Z

The other thing I would like to do for extensibility is to allow multiple extensions to alter different parts of the renderer and tokenizer. For example if one extension only needs to alter code and another extension only affects html they won't override each other.

I am thinking of something like this interface:

// marked_code_extension
module.exports = {
	tokenizer: {
		code(lexer, src, tokens, top) {
			// code tokenizer
		}
	}
	renderer: {
		code(code, infostring, escaped) {
			// code renderer
		}
	}
};

// marked_html_extension
module.exports = {
	tokenizer: {
		html(lexer, src, tokens, top) {
			// html tokenizer
		}
	}
	renderer: {
		html(html) {
			// html renderer
		}
	}
};

const marked = require('marked')
marked.use(require('marked_code_extension'));
marked.use(require('marked_html_extension'));

const html = marked(markdown);
// code and html extensions will both be used

calculuschild · 2020-04-06T17:28:44Z

I am thinking of something like this interface:

This would be handy. Would this maybe fit in a separate PR?

UziTech · 2020-04-06T17:29:50Z

This would be handy. Would this maybe fit in a separate PR?

Yes I would do a separate PR for this but I wanted to get the idea out there before I start updating the docs.

joshbruce

I appreciate where this is going and would like to be able to make more substantive reviews. Looking at it, with limited in-depth knowledge and wanting to make sure where we're heading, I see four major parts:

Parser: Breaks apart the plain text into constituent pieces (content components).
Lexer: Holds the rules to be applied to the parts.
Tokenizer (??): Holds the results of the tokens.
Renderer: Applies the lexer rules to the parsed components.

I'm confused on some of the divisions and why they exist.

UziTech · 2020-04-14T20:14:38Z

@joshbruce

Lexer takes the markdown and sends it to the Tokenizer functions.
Tokenizer uses rules to create tokens.
Parser takes tokens and sends them to the Renderer functions.
Renderer returns html for output.

calculuschild · 2020-04-14T20:16:30Z

I think this is a good compromise. With this, users can

change the rules for capturing text (via extending rules) (is this possible?)
change what that captured text means (via extending tokenizer)
change what output is generated (via extending renderer)

This seems like a pretty reasonable (and more-or-less complete) amount of control while keeping functionality compartmentalized.

UziTech · 2020-04-14T21:27:00Z

change the rules for capturing text (via extending rules) (is this possible?)

They are (not very easily) extendable by monkey patching them from the Lexer currently

marked.Lexer.rules.block.html = // change default html rule

I would like to do a PR after this PR gets merged to make extending the rules, Renderer, Tokenizer much easier. something like:

marked.use({
	rules: {
		block: {
			html: // change default html rule
		}
	}, 
	tokenizer: {
		html: // change html function in tokenizer
	}, 
	renderer: {
		html: // change html function in renderer
	}
});

That would allow extensions to just return that object so users could use marked with extensions like:

marked.use(require("some_marked_extension"));

styfle · 2020-04-14T23:34:23Z

docs/USING_PRO.md

+};
+
+// Run marked
+console.log(marked('$ latex code $', { tokenizer }));


Nice example!

Can we also achieve a dynamic TOC based on headings using a custom tokenizer?

It seems like it would be easier to use the Renderer for that. example

Oh yeah. So then couldn't this LaTeX example also be achieved using the Renderer instead of Tokenizer?

No because the dollar sign ($) isn't a valid code token starter normally so it would just be counted as text.

I'm sure the example code would fail on edge cases but it is more about showing how to extend the Tokenizer than create a robust LaTeX interpreter.

docs/USING_PRO.md

Co-Authored-By: Steven <steven@ceriously.com>

calculuschild · 2020-04-15T02:58:30Z

These lines still mentioned rules as part of the Lexer. They are now only accessible from the Tokenizer, no? (Github is not letting me comment directly on those lines...)

UziTech · 2020-04-15T05:03:34Z

Nice catch. yes the rules are set on the tokenizer based on the options and the Lexer has a static property rules where the user can access all of the rules.

…kenizer

styfle

🥳

UziTech · 2020-04-16T18:21:51Z

@calculuschild have you had a chance to review this PR. I would like your approval before merging it since you had the original idea for the Tokenizer. Does this provide the fix for your use case?

calculuschild · 2020-04-16T18:25:25Z

@UziTech Yes, this should work just fine for what I need. Thank you.

Tokenizer

vercel bot deployed to Preview April 3, 2020 14:46 View deployment

calculuschild mentioned this pull request Apr 3, 2020

Split Lexer into extendable functions #1632

Closed

6 tasks

vercel bot deployed to Preview April 3, 2020 16:07 View deployment

vercel bot deployed to Preview April 3, 2020 17:26 View deployment

vercel bot deployed to Preview April 6, 2020 05:24 View deployment

vercel bot deployed to Preview April 6, 2020 05:32 View deployment

vercel bot deployed to Preview April 6, 2020 05:59 View deployment

UziTech requested a review from styfle April 6, 2020 16:58

vercel bot deployed to Preview April 7, 2020 04:30 View deployment

vercel bot deployed to Preview April 7, 2020 16:43 View deployment

vercel bot deployed to Preview April 7, 2020 16:44 View deployment

add tokenizer

17d8179

joshbruce approved these changes Apr 14, 2020

View reviewed changes

move smartypants, mangle, and rules to lexer

4117ba2

vercel bot deployed to Preview April 14, 2020 21:40 View deployment

spelling

1f5a621

vercel bot deployed to Preview April 14, 2020 21:40 View deployment

build

ad5dace

vercel bot deployed to Preview April 14, 2020 21:41 View deployment

fix example

0f05860

vercel bot deployed to Preview April 14, 2020 21:43 View deployment

add mangle smartypants examples

dd8e549

vercel bot deployed to Preview April 14, 2020 21:53 View deployment

styfle reviewed Apr 14, 2020

View reviewed changes

docs/USING_PRO.md Outdated Show resolved Hide resolved

Update docs/USING_PRO.md

4aecab3

Co-Authored-By: Steven <steven@ceriously.com>

vercel bot deployed to Preview April 15, 2020 00:52 View deployment

fix docs

dfbb016

Merge branch 'tokenizer' of https://github.com/UziTech/marked into to…

b767ede

…kenizer

vercel bot deployed to Preview April 15, 2020 05:04 View deployment

styfle approved these changes Apr 16, 2020

View reviewed changes

UziTech merged commit 904c974 into markedjs:master Apr 16, 2020

UziTech deleted the tokenizer branch April 16, 2020 18:31

UziTech mentioned this pull request Apr 20, 2020

v1.0.0 #1649

Merged

12 tasks

zhenalexfan pushed a commit to zhenalexfan/MarkdownHan that referenced this pull request Nov 8, 2021

Merge pull request markedjs#1637 from UziTech/tokenizer

f3d5a6a

Tokenizer

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenizer #1637

Tokenizer #1637

UziTech commented Apr 3, 2020 •

edited

vercel bot commented Apr 3, 2020 •

edited

calculuschild commented Apr 3, 2020 •

edited

UziTech commented Apr 3, 2020

calculuschild commented Apr 3, 2020

UziTech commented Apr 3, 2020

UziTech commented Apr 3, 2020

calculuschild commented Apr 4, 2020 •

edited

UziTech commented Apr 5, 2020

calculuschild commented Apr 5, 2020

UziTech commented Apr 5, 2020

UziTech commented Apr 6, 2020 •

edited

calculuschild commented Apr 6, 2020

UziTech commented Apr 6, 2020

calculuschild commented Apr 6, 2020

UziTech commented Apr 6, 2020

calculuschild commented Apr 6, 2020

UziTech commented Apr 6, 2020

joshbruce left a comment

UziTech commented Apr 14, 2020 •

edited

calculuschild commented Apr 14, 2020 •

edited

UziTech commented Apr 14, 2020

styfle Apr 14, 2020

styfle Apr 14, 2020

UziTech Apr 15, 2020

styfle Apr 15, 2020

UziTech Apr 15, 2020

calculuschild commented Apr 15, 2020

UziTech commented Apr 15, 2020

styfle left a comment

UziTech commented Apr 16, 2020

calculuschild commented Apr 16, 2020

Tokenizer #1637

Tokenizer #1637

Conversation

UziTech commented Apr 3, 2020 • edited

Description

TODO:

Contributor

Committer

vercel bot commented Apr 3, 2020 • edited

calculuschild commented Apr 3, 2020 • edited

UziTech commented Apr 3, 2020

calculuschild commented Apr 3, 2020

UziTech commented Apr 3, 2020

UziTech commented Apr 3, 2020

calculuschild commented Apr 4, 2020 • edited

UziTech commented Apr 5, 2020

calculuschild commented Apr 5, 2020

UziTech commented Apr 5, 2020

UziTech commented Apr 6, 2020 • edited

calculuschild commented Apr 6, 2020

UziTech commented Apr 6, 2020

calculuschild commented Apr 6, 2020

UziTech commented Apr 6, 2020

calculuschild commented Apr 6, 2020

UziTech commented Apr 6, 2020

joshbruce left a comment

Choose a reason for hiding this comment

UziTech commented Apr 14, 2020 • edited

calculuschild commented Apr 14, 2020 • edited

UziTech commented Apr 14, 2020

styfle Apr 14, 2020

Choose a reason for hiding this comment

styfle Apr 14, 2020

Choose a reason for hiding this comment

UziTech Apr 15, 2020

Choose a reason for hiding this comment

styfle Apr 15, 2020

Choose a reason for hiding this comment

UziTech Apr 15, 2020

Choose a reason for hiding this comment

calculuschild commented Apr 15, 2020

UziTech commented Apr 15, 2020

styfle left a comment

Choose a reason for hiding this comment

UziTech commented Apr 16, 2020

calculuschild commented Apr 16, 2020

UziTech commented Apr 3, 2020 •

edited

vercel bot commented Apr 3, 2020 •

edited

calculuschild commented Apr 3, 2020 •

edited

calculuschild commented Apr 4, 2020 •

edited

UziTech commented Apr 6, 2020 •

edited

UziTech commented Apr 14, 2020 •

edited

calculuschild commented Apr 14, 2020 •

edited