Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split Lexer into extendable functions #1632

Closed

Conversation

calculuschild
Copy link
Contributor

@calculuschild calculuschild commented Mar 31, 2020

List, NPTable, and Table functions slightly rewritten to remove need to access "src" variable within the functions (i.e. no "Backtracking" steps)

Marked version:

8.2

Markdown flavor: Markdown.pl|CommonMark|GitHub Flavored Markdown|n/a

Description

Contributor

  • Test(s) exist to ensure functionality and minimize regression (if no tests added, list tests covering this PR); or,
  • no tests required for this PR.
  • If submitting new feature, it has been documented in the appropriate places.

Committer

In most cases, this should be a different person than the contributor.

  • Draft GitHub release notes have been updated.
  • CI is green (no forced merge required).
  • Merge PR

List, NPTable, and Table functions slightly rewritten to remove need to access "src" variable within the functions (i.e. no "Backtracking" steps)
@vercel
Copy link

vercel bot commented Mar 31, 2020

This pull request is being automatically deployed with ZEIT Now (learn more).
To see the status of your deployment, click below or on the icon next to each commit.

🔍 Inspect: https://zeit.co/markedjs/markedjs/5z6k4mrtb
✅ Preview: https://markedjs-git-fork-calculuschild-extendable-lexer.markedjs.now.sh

@calculuschild
Copy link
Contributor Author

calculuschild commented Mar 31, 2020

This is my initial attempt at splitting the Lexer into extendable functions for each Markdown feature. Passes all tests, no noticeable slowdown through benchmark (that I can see). Each function returns true if it successfully generated tokens, e.g., the table function might detect an invalid table with headers and delimiter rows with different cell counts, and just continue out without consuming from src. Hope that makes sense.

Every function is passed cap which contains the captured text from the regex in rules.js. Some functions (list and blockquote) also require top as they can recursively contain other elements.

So, further questions:

  1. This also generates changes in marked.js and marked.esm.js. Do I need to commit those?
  2. Since some functions require top, should we pass it to every function to allow users full customization in case they want some weird custom nesting behavior (and for consistency)?
  3. Should we also expose src to each function? I explicitly rewrote a couple sections so it would not be necessary, but perhaps some users would want to be able to view the current text being parsed? It should minimally impact performance as strings are passed by reference.
  4. Should the entire regex rule check (if (cap = this.rules.html.exec(src)) also just be moved to the functions as well?
  5. The remaining 14 copies of src = src.substring(cap[0].length); just bug me. Is there a clean way to eliminate that repetition?

@UziTech
Copy link
Member

UziTech commented Mar 31, 2020

  1. no it will build when the PR is merged.
  2. maybe
  3. maybe we should pass an object with all of these variables?
  4. That might be better.
  5. Probably not without introducing overhead. There are quite a few things we do to keep it as fast as possible

I am working on getting the InlineLexer to output tokens as well in PR #1627 maybe you want to build on that PR to allow the inline tokens (e.g strong, em ,link, etc.) to be extendable as well.

@calculuschild
Copy link
Contributor Author

maybe you want to build on that PR to allow the inline tokens (e.g strong, em ,link, etc.) to be extendable as well.

@UziTech Is that PR pretty stable where it is? Do I just check out a new branch based on that PR?

@UziTech
Copy link
Member

UziTech commented Mar 31, 2020

I'm thinking the easiest way for users to be able to extend the lexers would be to create a Tokenizer class that holds those functions and each function is passed an object with those variables. The functions can return false to continue to the next function or edit the src and return a token to add to the lexer's tokens array.

The user could modify the Tokenizer and add the modified Tokenizer as an option the same way the Renderer is modified for the Parser

@calculuschild
Copy link
Contributor Author

calculuschild commented Apr 1, 2020

Ok, so let me make sure I have this right:

  1. Add a Tokenizer class into a new Tokenizer.js file, organized similar to Renderer.js?
  2. The Tokenizer will hold functions that are all passed an object with src and top
    • note: if rules.js regex is checked within the function, no need to pass cap
  3. Each function will do the following:
    1. Compare the current src with regex rules from rules.js
    2. If match, parse the matched text into tokens
    3. If tokens are successfully added to the Lexer's tokens array, alter src and return true
    4. If some error occurs or no token is generated, return false
  4. Lexer will simply loop through each Tokenizer function in sequence until it receives a true to continue to the next iteration, ending when all of src is consumed.

If this makes sense to you I can use #1627 as the starting point for this.

@UziTech
Copy link
Member

UziTech commented Apr 1, 2020

If tokens are successfully added to the Lexer's tokens array, alter src and return true

I think it would be better if the function returns the token if it is successful and alters src. Then the Lexer can add the token to the array and continue the while loop.

It might be easier if we just collaborate on #1627 together than trying to make two PR's to marked. You can contribute to that PR by sending PRs to the UziTech/marked/inline-tokens branch

@calculuschild
Copy link
Contributor Author

calculuschild commented Apr 1, 2020

I think it would be better if the function returns the token if it is successful and alters src.

Gotcha. Misread your earlier post.
Edit: Some of these markdown elements result in multiple tokens (begin / end tokens). I imagine we would instead return an array containing all the tokens generated?

Ok. I can work on that branch.

@calculuschild calculuschild mentioned this pull request Apr 3, 2020
9 tasks
@calculuschild
Copy link
Contributor Author

Superseded by #1637 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Extending with custom tags
2 participants