Follow GFM spec on EM and STRONG delimiters #1686

calculuschild · 2020-05-21T17:07:32Z

Starting toward better adherence to the GFM spec on Emphasis, specifically Left-flanking-delimiter-runs.

Marked version: 1.1.0

Markdown flavor: CommonMark|GitHub Flavored Markdown

Description

Fixes Examples:
(em) 341, 367, 368, 371, 372, 379, 390, 406, 417, 441, 444
(strong) 391, 397, 399, 400, 401, 431, 443, 471, 475, 476, 479, 480

What was attempted

Applying the GFM spec for Left-flanking-delimiter-runs and right-flanking-delimiter runs more accurately for EM tags and STRONG tags.

Contributor

Test(s) exist to ensure functionality and minimize regression (if no tests added, list tests covering this PR); or,
no tests required for this PR.
If submitting new feature, it has been documented in the appropriate places.

Committer

In most cases, this should be a different person than the contributor.

Draft GitHub release notes have been updated.
CI is green (no forced merge required).
Merge PR

vercel · 2020-05-21T17:07:36Z

This pull request is being automatically deployed with Vercel (learn more).
To see the status of your deployment, click below or on the icon next to each commit.

🔍 Inspect: https://vercel.com/markedjs/markedjs/qv2lelel1
✅ Preview: https://markedjs-git-fork-calculuschild-emphasisfixes.markedjs.vercel.app

UziTech · 2020-05-22T17:54:41Z

I don't have a lot of time to look at this right now but it could be a conflict with lists since * could be the start of a list as well as em and strong.

Added a check for the previous character to the *em* Tokenizer. Needed to pass any tests where the em block starts with a punctuation character (e.g. commonmark example 368)

calculuschild · 2020-05-29T20:43:05Z

OK, so I figured out what the issue was. I now have it working quite well except for two test cases I could use some help on:

Original strong_and_em_together
~~This test passes fine normally, but when run with the 'pedantic' tag it breaks. I'm not totally sure what the pedantic tag is actually doing so I would like some feedback.~~ Edit: Fixed!
CommonMark Links example 519
I have a pattern in there to skip any * inside \[ *stuff* \] to avoid conflict with links. This last example breaks on things between brackets that aren't actually links. I'm not sure how to handle this because it could be a reflink in certain cases but I can't tell without looking behind?

Note, I removed a line from the New em_2char test that was invalid according to the spec. I don't handle it properly here either, but my code isn't "breaking" it because it wasn't correct to begin with.

Also note, I haven't applied these changes to the _ version of em tags, but I expect several more tests will begin passing with that as well once these two issues are worked out.

UziTech · 2020-05-29T21:10:22Z

The pedantic option uses the pedantic rules to tokenize the markdown.

Basically making marked use the original spec instead of CommonMark.

calculuschild · 2020-05-29T22:00:22Z

OK gotcha. My change to the em tokenizer confused it. I can fix this pretty easily.
Edit: This has been fixed. Just this last bit I'm struggling on:

Any insight on detecting whether or not a set of square brackets is a reflink or not?

calculuschild · 2020-06-01T04:01:55Z

Can anyone help me out with detecting when square brackets are part of a reflink or not for commonmark Example 519?

@UziTech Would you have any suggestion?

UziTech · 2020-06-10T15:03:17Z

Can anyone help me out with detecting when square brackets are part of a reflink or not for commonmark Example 519?

You could try a lookahead or you might need to do some parsing in the em tokenizer.

Modifies the em rule after the block tokens are generated to detect known reflinks and skip over them so they don't get mistakenly italicized.

calculuschild · 2020-06-12T19:47:30Z

Tada!! My solution was to inject known reflink labels into the em rules right after the block sequence in the lexer is finished. This way the em rule can properly skip over any links that might contain *, but still allow emphasis inside of [...] if it is not a link.

I'm hoping this tweaking of the lexer is alright. I'm not sure if there's some way people could inject malicious regex this way by giving their link labels some weird names, but I assume it would just require some further character escaping on the label names before injection.

If this looks good, I can move forward with the _ variant of the em tags as well as **strong** which uses a lot of the same logic.

Now fixes three more cases

calculuschild · 2020-06-12T20:32:24Z

Underscore em rules added. Fixes 3 more examples (371, 372, 406)

I would love some feedback on this PR!

UziTech

This is great work!

test/specs/commonmark/commonmark.0.29.json

src/Lexer.js

… into EmphasisFixes

calculuschild · 2020-07-12T23:09:54Z

What do you guys think? Is this baby ready to go?

davisjam

LGTM for ReDoS

UziTech

Looks great. Thanks for all your hard work. 💯

brainchild0 · 2020-07-14T02:14:39Z

So 424 and 425 are considered out of scope of this change set?

Or more simply:

$ git show -s
commit 6b729ed8cdb98ea75d4031f6218a1f58b9f02d8a (HEAD, EmphasisFixes)
Merge: e27e6f9 ad720c1
Author: Trevor Buckner <calculuschild@gmail.com>
Date:   Thu Jul 9 19:37:22 2020 -0400

    Merge branch 'EmphasisFixes' of https://github.com/calculuschild/marked into EmphasisFixes
$ 
$ echo '**a **b** c**' |  bin/marked 
<p><strong>a **b</strong> c**</p>

(Should be a b c.)

UziTech · 2020-07-14T02:44:47Z

@brainchild0 if you want to fix it I would be happy to review a PR 😁

brainchild0 · 2020-07-14T09:16:24Z

For me the starting point is understanding what makes this particular case more challenging than the ones that have been resolved in this merge. Intuitively they all seem roughly equally complex. Obviously, I make the remark abstractly, having no familiarity with the design.

Or another way to ask the question, starting from the top and moving down: The general rule is to collect a stack of emphasis delimiters, which may be any of *, _, **, or __. As long as this rule may be somehow fully implemented, along with also dropping those same items from the stack, because of close (right flanking) delimiters, then all cases would be supported. What is missing, currently, that leaves out support for the remaining cases?

calculuschild · 2020-07-14T17:20:36Z

What is missing, currently, that leaves out support for the remaining cases?

The current implementation does not use a stack at all. It simply checks for existence of a left delimiter, then if found, find the first available matching right delimiter. Finally, ensure the text between the two is valid, meaning ignore any delimiters found inside links or code spans etc., and any other delimiters inside must occur in even pairs. If not valid, get the next possible end delimiter and check the middle again, until you run out of matching delimiters or you get a valid middle.

We already have regex for the left and right delimiters, so it would just be the extra effort of building up a stack in the tokenizer.

* See: markedjs/marked#1686

fredck · 2020-09-03T08:53:12Z

There is a good chance that this PR introduced #1754.

Follow GFM spec on Left-flanking-delimiter-runs

40493bb

vercel bot deployed to Preview May 21, 2020 17:07 View deployment

calculuschild marked this pull request as draft May 22, 2020 13:45

Now passes several more tests

4e2ec90

Added a check for the previous character to the *em* Tokenizer. Needed to pass any tests where the em block starts with a punctuation character (e.g. commonmark example 368)

vercel bot deployed to Preview May 29, 2020 20:26 View deployment

calculuschild marked this pull request as ready for review May 29, 2020 20:28

Deleted an extra line while removing comments

283ab9c

vercel bot deployed to Preview May 29, 2020 20:34 View deployment

calculuschild changed the title ~~Follow GFM spec on Left-flanking-delimiter-runs~~ Follow GFM spec on EM tags May 29, 2020

Fix Pedantic

c38ee23

vercel bot deployed to Preview May 30, 2020 00:54 View deployment

calculuschild mentioned this pull request Jun 10, 2020

Two different emphasis are recognized as a single one in a paragraph node #1676

Closed

Properly handle reflinks that should be escaped

7c6551e

Modifies the em rule after the block tokens are generated to detect known reflinks and skip over them so they don't get mistakenly italicized.

vercel bot deployed to Preview June 12, 2020 19:30 View deployment

Lint

bc17ded

vercel bot deployed to Preview June 12, 2020 19:37 View deployment

Lint 2

ea203cf

vercel bot deployed to Preview June 12, 2020 19:40 View deployment

Updated rules for underscore em

556070b

Now fixes three more cases

vercel bot deployed to Preview June 12, 2020 20:29 View deployment

UziTech requested changes Jun 12, 2020

View reviewed changes

test/specs/commonmark/commonmark.0.29.json Outdated Show resolved Hide resolved

src/Lexer.js Outdated Show resolved Hide resolved

src/Lexer.js Outdated Show resolved Hide resolved

calculuschild added 2 commits July 9, 2020 19:35

Sorted strong and em into sub-objects

e27e6f9

Merge branch 'EmphasisFixes' of https://github.com/calculuschild/marked…

6b729ed

… into EmphasisFixes

vercel bot deployed to Preview July 9, 2020 23:37 View deployment

UziTech mentioned this pull request Jul 11, 2020

Markdown `*emphasis* inside a paragraph emits incorrect HTML unless it's on the first or last line #1727

Closed

davisjam approved these changes Jul 13, 2020

View reviewed changes

UziTech approved these changes Jul 13, 2020

View reviewed changes

UziTech merged commit dddf9ae into markedjs:master Jul 13, 2020

This was referenced Jul 13, 2020

Single emphasis on comments is not consistent with double emphasis #1679

Closed

Broken after #1666

Closed

UziTech mentioned this pull request Jul 13, 2020

Release v1.1.1 #1731

Merged

12 tasks

stevenjoezhang added a commit to stevenjoezhang/mmp-build that referenced this pull request Sep 2, 2020

Fix marked@1.1.1

6f8f52e

* See: markedjs/marked#1686

fredck mentioned this pull request Sep 4, 2020

Touching italic/bold doesn't render right (v1.1.1 regression) #1754

Closed

UziTech mentioned this pull request Sep 8, 2020

fix underscore adjacent to asterisk #1755

Merged

4 tasks

UziTech mentioned this pull request Oct 22, 2020

Fix backtick for code in links #1794

Closed

5 tasks

UziTech mentioned this pull request Nov 3, 2020

*"Yo"* not outputted correctly #1358

Closed

UziTech mentioned this pull request Nov 17, 2020

fix: em and strong starting with special char #1832

Merged

3 tasks

UziTech mentioned this pull request Dec 4, 2020

em and strong (***〜***) #1860

Closed

calculuschild mentioned this pull request Dec 11, 2020

Add cmd+i and cmd+b for Mac users naturalcrit/homebrewery#1148

Closed

CMaheshBL mentioned this pull request May 6, 2022

Cx816df59e-1cc9 @ Npm-marked-0.3.9 CMaheshBL/NodeGoat#95

Open

cxronen mentioned this pull request May 20, 2022

Cx816df59e-1cc9 @ Npm-marked-0.3.9 cxronen/AST_BookStore#223

Open

cxronen mentioned this pull request Dec 20, 2022

Cx816df59e-1cc9 @ Npm-marked-0.3.9 cxronen/BookStore#310

Open

cxronen mentioned this pull request Feb 17, 2023

Cx816df59e-1cc9 @ Npm-marked-0.3.9 cxronen/BookStore#514

Open

RobertMickleCx mentioned this pull request Mar 7, 2023

Cx816df59e-1cc9 @ Npm-marked-0.3.9 RobertMickleCx/NodeGoat#166

Open

cxronen mentioned this pull request Mar 2, 2023

Cx816df59e-1cc9 @ Npm-marked-0.3.9 cxronen/AST_BookStore#455

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Follow GFM spec on EM and STRONG delimiters #1686

Follow GFM spec on EM and STRONG delimiters #1686

calculuschild commented May 21, 2020 •

edited by UziTech

vercel bot commented May 21, 2020 •

edited

UziTech commented May 22, 2020

calculuschild commented May 29, 2020 •

edited

UziTech commented May 29, 2020

calculuschild commented May 29, 2020 •

edited

calculuschild commented Jun 1, 2020 •

edited

UziTech commented Jun 10, 2020

calculuschild commented Jun 12, 2020 •

edited

calculuschild commented Jun 12, 2020

UziTech left a comment

calculuschild commented Jul 12, 2020

davisjam left a comment

UziTech left a comment

brainchild0 commented Jul 14, 2020 •

edited

UziTech commented Jul 14, 2020

brainchild0 commented Jul 14, 2020

calculuschild commented Jul 14, 2020 •

edited

fredck commented Sep 3, 2020

Follow GFM spec on EM and STRONG delimiters #1686

Follow GFM spec on EM and STRONG delimiters #1686

Conversation

calculuschild commented May 21, 2020 • edited by UziTech

Description

What was attempted

Contributor

Committer

vercel bot commented May 21, 2020 • edited

UziTech commented May 22, 2020

calculuschild commented May 29, 2020 • edited

UziTech commented May 29, 2020

calculuschild commented May 29, 2020 • edited

calculuschild commented Jun 1, 2020 • edited

UziTech commented Jun 10, 2020

calculuschild commented Jun 12, 2020 • edited

calculuschild commented Jun 12, 2020

UziTech left a comment

Choose a reason for hiding this comment

calculuschild commented Jul 12, 2020

davisjam left a comment

Choose a reason for hiding this comment

UziTech left a comment

Choose a reason for hiding this comment

brainchild0 commented Jul 14, 2020 • edited

UziTech commented Jul 14, 2020

brainchild0 commented Jul 14, 2020

calculuschild commented Jul 14, 2020 • edited

fredck commented Sep 3, 2020

calculuschild commented May 21, 2020 •

edited by UziTech

vercel bot commented May 21, 2020 •

edited

calculuschild commented May 29, 2020 •

edited

calculuschild commented May 29, 2020 •

edited

calculuschild commented Jun 1, 2020 •

edited

calculuschild commented Jun 12, 2020 •

edited

brainchild0 commented Jul 14, 2020 •

edited

calculuschild commented Jul 14, 2020 •

edited