Split Lexer into extendable functions #1632

calculuschild · 2020-03-31T00:00:21Z

List, NPTable, and Table functions slightly rewritten to remove need to access "src" variable within the functions (i.e. no "Backtracking" steps)

Marked version:

8.2

Markdown flavor: Markdown.pl|CommonMark|GitHub Flavored Markdown|n/a

Description

Fixes Extending with custom tags #1373

Contributor

Test(s) exist to ensure functionality and minimize regression (if no tests added, list tests covering this PR); or,
no tests required for this PR.
If submitting new feature, it has been documented in the appropriate places.

Committer

In most cases, this should be a different person than the contributor.

Draft GitHub release notes have been updated.
CI is green (no forced merge required).
Merge PR

List, NPTable, and Table functions slightly rewritten to remove need to access "src" variable within the functions (i.e. no "Backtracking" steps)

vercel · 2020-03-31T00:00:27Z

This pull request is being automatically deployed with ZEIT Now (learn more).
To see the status of your deployment, click below or on the icon next to each commit.

🔍 Inspect: https://zeit.co/markedjs/markedjs/5z6k4mrtb
✅ Preview: https://markedjs-git-fork-calculuschild-extendable-lexer.markedjs.now.sh

calculuschild · 2020-03-31T00:12:16Z

This is my initial attempt at splitting the Lexer into extendable functions for each Markdown feature. Passes all tests, no noticeable slowdown through benchmark (that I can see). Each function returns true if it successfully generated tokens, e.g., the table function might detect an invalid table with headers and delimiter rows with different cell counts, and just continue out without consuming from src. Hope that makes sense.

Every function is passed cap which contains the captured text from the regex in rules.js. Some functions (list and blockquote) also require top as they can recursively contain other elements.

So, further questions:

This also generates changes in marked.js and marked.esm.js. Do I need to commit those?
Since some functions require top, should we pass it to every function to allow users full customization in case they want some weird custom nesting behavior (and for consistency)?
Should we also expose src to each function? I explicitly rewrote a couple sections so it would not be necessary, but perhaps some users would want to be able to view the current text being parsed? It should minimally impact performance as strings are passed by reference.
Should the entire regex rule check (if (cap = this.rules.html.exec(src)) also just be moved to the functions as well?
The remaining 14 copies of src = src.substring(cap[0].length); just bug me. Is there a clean way to eliminate that repetition?

UziTech · 2020-03-31T02:55:20Z

no it will build when the PR is merged.
maybe
maybe we should pass an object with all of these variables?
That might be better.
Probably not without introducing overhead. There are quite a few things we do to keep it as fast as possible

I am working on getting the InlineLexer to output tokens as well in PR #1627 maybe you want to build on that PR to allow the inline tokens (e.g strong, em ,link, etc.) to be extendable as well.

calculuschild · 2020-03-31T17:32:08Z

maybe you want to build on that PR to allow the inline tokens (e.g strong, em ,link, etc.) to be extendable as well.

@UziTech Is that PR pretty stable where it is? Do I just check out a new branch based on that PR?

UziTech · 2020-03-31T18:24:48Z

I'm thinking the easiest way for users to be able to extend the lexers would be to create a Tokenizer class that holds those functions and each function is passed an object with those variables. The functions can return false to continue to the next function or edit the src and return a token to add to the lexer's tokens array.

The user could modify the Tokenizer and add the modified Tokenizer as an option the same way the Renderer is modified for the Parser

calculuschild · 2020-04-01T01:36:20Z

Ok, so let me make sure I have this right:

Add a Tokenizer class into a new Tokenizer.js file, organized similar to Renderer.js?
The Tokenizer will hold functions that are all passed an object with src and top
- note: if rules.js regex is checked within the function, no need to pass cap
Each function will do the following:
1. Compare the current src with regex rules from rules.js
2. If match, parse the matched text into tokens
3. If tokens are successfully added to the Lexer's tokens array, alter src and return true
4. If some error occurs or no token is generated, return false
Lexer will simply loop through each Tokenizer function in sequence until it receives a true to continue to the next iteration, ending when all of src is consumed.

If this makes sense to you I can use #1627 as the starting point for this.

UziTech · 2020-04-01T03:52:22Z

If tokens are successfully added to the Lexer's tokens array, alter src and return true

I think it would be better if the function returns the token if it is successful and alters src. Then the Lexer can add the token to the array and continue the while loop.

It might be easier if we just collaborate on #1627 together than trying to make two PR's to marked. You can contribute to that PR by sending PRs to the UziTech/marked/inline-tokens branch

calculuschild · 2020-04-01T04:01:12Z

I think it would be better if the function returns the token if it is successful and alters src.

Gotcha. Misread your earlier post.
Edit: Some of these markdown elements result in multiple tokens (begin / end tokens). I imagine we would instead return an array containing all the tokens generated?

Ok. I can work on that branch.

calculuschild · 2020-04-03T15:07:40Z

Superseded by #1637 .

Split Lexer into extendable functions

96213ba

List, NPTable, and Table functions slightly rewritten to remove need to access "src" variable within the functions (i.e. no "Backtracking" steps)

vercel bot deployed to Preview March 31, 2020 00:00 View deployment

calculuschild mentioned this pull request Apr 3, 2020

Tokenizer #1637

Merged

9 tasks

calculuschild closed this Apr 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split Lexer into extendable functions #1632

Split Lexer into extendable functions #1632

calculuschild commented Mar 31, 2020 •

edited

vercel bot commented Mar 31, 2020 •

edited

calculuschild commented Mar 31, 2020 •

edited

UziTech commented Mar 31, 2020

calculuschild commented Mar 31, 2020

UziTech commented Mar 31, 2020 •

edited

calculuschild commented Apr 1, 2020 •

edited

UziTech commented Apr 1, 2020

calculuschild commented Apr 1, 2020 •

edited

calculuschild commented Apr 3, 2020

Split Lexer into extendable functions #1632

Split Lexer into extendable functions #1632

Conversation

calculuschild commented Mar 31, 2020 • edited

Description

Contributor

Committer

vercel bot commented Mar 31, 2020 • edited

calculuschild commented Mar 31, 2020 • edited

UziTech commented Mar 31, 2020

calculuschild commented Mar 31, 2020

UziTech commented Mar 31, 2020 • edited

calculuschild commented Apr 1, 2020 • edited

UziTech commented Apr 1, 2020

calculuschild commented Apr 1, 2020 • edited

calculuschild commented Apr 3, 2020

calculuschild commented Mar 31, 2020 •

edited

vercel bot commented Mar 31, 2020 •

edited

calculuschild commented Mar 31, 2020 •

edited

UziTech commented Mar 31, 2020 •

edited

calculuschild commented Apr 1, 2020 •

edited

calculuschild commented Apr 1, 2020 •

edited