Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Total rework of Emphasis/Strong #1864

Merged
merged 15 commits into from Feb 7, 2021
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
2 changes: 2 additions & 0 deletions lib/marked.esm.js
Expand Up @@ -824,6 +824,8 @@ var Tokenizer_1 = class Tokenizer {
strong(src, maskedSrc, prevChar = '') {
let match = this.rules.inline.strong.start.exec(src);

console.log(match);
calculuschild marked this conversation as resolved.
Show resolved Hide resolved

if (match && (!match[1] || (match[1] && (prevChar === '' || this.rules.inline.punctuation.exec(prevChar))))) {
maskedSrc = maskedSrc.slice(-1 * src.length);
const endReg = match[0] === '**' ? this.rules.inline.strong.endAst : this.rules.inline.strong.endUnd;
Expand Down
1 change: 1 addition & 0 deletions lib/marked.js
Expand Up @@ -926,6 +926,7 @@
}

var match = this.rules.inline.strong.start.exec(src);
console.log(match);
calculuschild marked this conversation as resolved.
Show resolved Hide resolved

if (match && (!match[1] || match[1] && (prevChar === '' || this.rules.inline.punctuation.exec(prevChar)))) {
maskedSrc = maskedSrc.slice(-1 * src.length);
Expand Down
2 changes: 1 addition & 1 deletion marked.min.js

Large diffs are not rendered by default.

18 changes: 8 additions & 10 deletions src/Lexer.js
Expand Up @@ -352,11 +352,17 @@ module.exports = class Lexer {
maskedSrc = maskedSrc.slice(0, match.index) + '[' + repeatString('a', match[0].length - 2) + ']' + maskedSrc.slice(this.tokenizer.rules.inline.blockSkip.lastIndex);
}

// Mask out escaped em & strong delimiters
while ((match = this.tokenizer.rules.inline.escapedEmSt.exec(maskedSrc)) != null) {
maskedSrc = maskedSrc.slice(0, match.index) + '++' + maskedSrc.slice(this.tokenizer.rules.inline.escapedEmSt.lastIndex);
}

while (src) {
if (!keepPrevChar) {
prevChar = '';
}
keepPrevChar = false;

// escape
if (token = this.tokenizer.escape(src)) {
src = src.substring(token.raw.length);
Expand Down Expand Up @@ -393,16 +399,8 @@ module.exports = class Lexer {
continue;
}

// strong
if (token = this.tokenizer.strong(src, maskedSrc, prevChar)) {
src = src.substring(token.raw.length);
token.tokens = this.inlineTokens(token.text, [], inLink, inRawBlock);
tokens.push(token);
continue;
}

// em
if (token = this.tokenizer.em(src, maskedSrc, prevChar)) {
// em & strong
if (token = this.tokenizer.emStrong(src, maskedSrc, prevChar)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would definitely be a breaking change since the tokenizers are part of the public API. Can we do this without combining them? Can we just switch the order of em and strong to get <strong><em>a</em></strong> to switch to <em><strong>a</strong></em>?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, they kind of need to be tackled together to get the right sequence of <em><strong> to work, which isn't just a stylistic thing. Even though it renders the same as <strong><em>, the processing to get to that point also clears up several other bugs, especially regarding uneven **text***** delimiters on both sides.

Edit for clarification: processing em/strong in this way allows following more of the CommonMark specs in a "natural" way that I think will be much easier to maintain (instead of a monstrous, fiddly regex). However, this also means you don't really know if the output is going to be an em or a strong until the very end of the process (see the very end of the Tokenizer).

This might be worth putting into a v2.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After researching quite a few dependants I think it should be fine to combine them since most dependants will change the renderer instead of the tokenizer. This will have to be a major bump to v2 though. I do want to get a few other breaking changes together before releasing v2 so it might be a while before I get to fully reviewing this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds appropriate. I need to review the other PRs you have waiting as well that should go out before this anyway...

There are some other changes I've seen in the issues list that I'd like to lump into a v2 bump as well.

src = src.substring(token.raw.length);
token.tokens = this.inlineTokens(token.text, [], inLink, inRawBlock);
tokens.push(token);
Expand Down
65 changes: 36 additions & 29 deletions src/Tokenizer.js
Expand Up @@ -507,46 +507,53 @@ module.exports = class Tokenizer {
}
}

strong(src, maskedSrc, prevChar = '') {
let match = this.rules.inline.strong.start.exec(src);
emStrong(src, maskedSrc, prevChar = '') {
let match = this.rules.inline.emStrong.lDelim.exec(src);
if (!match) return;

if (match && (!match[1] || (match[1] && (prevChar === '' || this.rules.inline.punctuation.exec(prevChar))))) {
maskedSrc = maskedSrc.slice(-1 * src.length);
const endReg = match[0] === '**' ? this.rules.inline.strong.endAst : this.rules.inline.strong.endUnd;
if (match[3] && prevChar.match(/\w/)) return; // _ can't be be between two \w

const nextChar = match[1] || match[2] || '';

if (!nextChar || (nextChar && (prevChar === '' || this.rules.inline.punctuation.exec(prevChar)))) {
const lLength = match[0].length - 1;
let rDelim, rLength, delimTotal = lLength;

const endReg = match[0][0] === '*' ? this.rules.inline.emStrong.rDelimAst : this.rules.inline.emStrong.rDelimUnd;
endReg.lastIndex = 0;

let cap;
maskedSrc = maskedSrc.slice(-1 * src.length + lLength); // Bump maskedSrc to same section of string as src (move to lexer?)

while ((match = endReg.exec(maskedSrc)) != null) {
cap = this.rules.inline.strong.middle.exec(maskedSrc.slice(0, match.index + 3));
if (cap) {
return {
type: 'strong',
raw: src.slice(0, cap[0].length),
text: src.slice(2, cap[0].length - 2)
};
}
}
}
}
rDelim = match[1] || match[2] || match[3] || match[4] || match[5] || match[6];

em(src, maskedSrc, prevChar = '') {
let match = this.rules.inline.em.start.exec(src);
if (!rDelim) continue; // matched the first alternative in rules.js (skip the * in __abc*abc__)

if (match && (!match[1] || (match[1] && (prevChar === '' || this.rules.inline.punctuation.exec(prevChar))))) {
maskedSrc = maskedSrc.slice(-1 * src.length);
const endReg = match[0] === '*' ? this.rules.inline.em.endAst : this.rules.inline.em.endUnd;
rLength = rDelim.length;

endReg.lastIndex = 0;
if (match[3] || match[4]) { // found another left Delim
delimTotal += rLength;
continue;
} else if (match[5] || match[6]) {
if (lLength % 3 && !((lLength + rLength) % 3)) continue; // CommonMark Emphasis Rules 9-10
}

let cap;
while ((match = endReg.exec(maskedSrc)) != null) {
cap = this.rules.inline.em.middle.exec(maskedSrc.slice(0, match.index + 2));
if (cap) {
delimTotal -= rLength;

if (delimTotal > 0) continue; // Haven't found enough closing delimiters

if (Math.min(lLength, rLength) % 2) {
return {
type: 'em',
raw: src.slice(0, cap[0].length),
text: src.slice(1, cap[0].length - 1)
raw: src.slice(0, lLength + match.index + rLength + 1),
text: src.slice(1, lLength + match.index + rLength)
};
}
if (Math.min(lLength, rLength) % 2 === 0) {
return {
type: 'strong',
raw: src.slice(0, lLength + match.index + rLength + 1),
text: src.slice(2, lLength + match.index + rLength - 1)
};
}
}
Expand Down
67 changes: 17 additions & 50 deletions src/rules.js
Expand Up @@ -173,74 +173,41 @@ const inline = {
reflink: /^!?\[(label)\]\[(?!\s*\])((?:\\[\[\]]?|[^\[\]\\])+)\]/,
nolink: /^!?\[(?!\s*\])((?:\[[^\[\]]*\]|\\[\[\]]|[^\[\]])*)\](?:\[\])?/,
reflinkSearch: 'reflink|nolink(?!\\()',
strong: {
start: /^(?:(\*\*(?=[*punctuation]))|\*\*)(?![\s])|__/, // (1) returns if starts w/ punctuation
middle: /^\*\*(?:(?:(?!overlapSkip)(?:[^*]|\\\*)|overlapSkip)|\*(?:(?!overlapSkip)(?:[^*]|\\\*)|overlapSkip)*?\*)+?\*\*$|^__(?![\s])((?:(?:(?!overlapSkip)(?:[^_]|\\_)|overlapSkip)|_(?:(?!overlapSkip)(?:[^_]|\\_)|overlapSkip)*?_)+?)__$/,
endAst: /[^punctuation\s]\*\*(?!\*)|[punctuation]\*\*(?!\*)(?:(?=[punctuation_\s]|$))/, // last char can't be punct, or final * must also be followed by punct (or endline)
endUnd: /[^\s]__(?!_)(?:(?=[punctuation*\s])|$)/ // last char can't be a space, and final _ must preceed punct or \s (or endline)
},
em: {
start: /^(?:(\*(?=[punctuation]))|\*)(?![*\s])|_/, // (1) returns if starts w/ punctuation
middle: /^\*(?:(?:(?!overlapSkip)(?:[^*]|\\\*)|overlapSkip)|\*(?:(?!overlapSkip)(?:[^*]|\\\*)|overlapSkip)*?\*)+?\*$|^_(?![_\s])(?:(?:(?!overlapSkip)(?:[^_]|\\_)|overlapSkip)|_(?:(?!overlapSkip)(?:[^_]|\\_)|overlapSkip)*?_)+?_$/,
endAst: /[^punctuation\s]\*(?!\*)|[punctuation]\*(?!\*)(?:(?=[punctuation_\s]|$))/, // last char can't be punct, or final * must also be followed by punct (or endline)
endUnd: /[^\s]_(?!_)(?:(?=[punctuation*\s])|$)/ // last char can't be a space, and final _ must preceed punct or \s (or endline)
emStrong: {
lDelim: /^(?:\*+(?:([punct_])|[^\s*]))|^(?:_+(?:([punct*])|([^\s_])))/,
// (1) and (2) can only be a Right Delimiter. (3) and (4) can only be Left. (5) and (6) can be either Left or Right.
// () Skip other delimiter (1) #*** (2) a***#, a*** (3) #***a, ***a (4) ***# (5) #***# (6) a***a
rDelimAst: /\_\_[^_]*?\*[^_]*?\_\_|[punct_](\*+)(?=[\s]|$)|[^punct*_\s](\*+)(?=[punct_\s]|$)|[punct_\s](\*+)(?=[^punct*_\s])|[\s](\*+)(?=[punct_])|[punct_](\*+)(?=[punct_])|[^punct*_\s](\*+)(?=[^punct*_\s])/,
rDelimUnd: /\*\*[^*]*?\_[^*]*?\*\*|[punct*](\_+)(?=[\s]|$)|[^punct*_\s](\_+)(?=[punct*\s]|$)|[punct*\s](\_+)(?=[^punct*_\s])|[\s](\_+)(?=[punct*])|[punct*](\_+)(?=[punct*])/ // ^ Not allowed for _
},
code: /^(`+)([^`]|[^`][\s\S]*?[^`])\1(?!`)/,
br: /^( {2,}|\\)\n(?!\s*$)/,
del: noopTest,
text: /^(`+|[^`])(?:(?= {2,}\n)|[\s\S]*?(?:(?=[\\<!\[`*]|\b_|$)|[^ ](?= {2,}\n)))/,
punctuation: /^([\s*punctuation])/
punctuation: /^([\spunctuation])/
};

// list of punctuation marks from common mark spec
// without * and _ to workaround cases with double emphasis
// list of punctuation marks from CommonMark spec
// without * and _ to handle the different emphasis markers * and _
inline._punctuation = '!"#$%&\'()+\\-.,/:;<=>?@\\[\\]`^{|}~';
inline.punctuation = edit(inline.punctuation).replace(/punctuation/g, inline._punctuation).getRegex();

// sequences em should skip over [title](link), `code`, <html>
inline._blockSkip = '\\[[^\\]]*?\\]\\([^\\)]*?\\)|`[^`]*?`|<[^>]*?>';
inline._overlapSkip = '__[^_]*?__|\\*\\*\\[^\\*\\]*?\\*\\*';
inline.blockSkip = /\[[^\]]*?\]\([^\)]*?\)|`[^`]*?`|<[^>]*?>/g;
inline.escapedEmSt = /\\\*|\\_/g;

inline._comment = edit(block._comment).replace('(?:-->|$)', '-->').getRegex();

inline.em.start = edit(inline.em.start)
.replace(/punctuation/g, inline._punctuation)
.getRegex();

inline.em.middle = edit(inline.em.middle)
.replace(/punctuation/g, inline._punctuation)
.replace(/overlapSkip/g, inline._overlapSkip)
.getRegex();

inline.em.endAst = edit(inline.em.endAst, 'g')
.replace(/punctuation/g, inline._punctuation)
.getRegex();

inline.em.endUnd = edit(inline.em.endUnd, 'g')
.replace(/punctuation/g, inline._punctuation)
.getRegex();

inline.strong.start = edit(inline.strong.start)
.replace(/punctuation/g, inline._punctuation)
.getRegex();

inline.strong.middle = edit(inline.strong.middle)
.replace(/punctuation/g, inline._punctuation)
.replace(/overlapSkip/g, inline._overlapSkip)
.getRegex();

inline.strong.endAst = edit(inline.strong.endAst, 'g')
.replace(/punctuation/g, inline._punctuation)
.getRegex();

inline.strong.endUnd = edit(inline.strong.endUnd, 'g')
.replace(/punctuation/g, inline._punctuation)
inline.emStrong.lDelim = edit(inline.emStrong.lDelim)
.replace(/punct/g, inline._punctuation)
.getRegex();

inline.blockSkip = edit(inline._blockSkip, 'g')
inline.emStrong.rDelimAst = edit(inline.emStrong.rDelimAst, 'g')
.replace(/punct/g, inline._punctuation)
.getRegex();

inline.overlapSkip = edit(inline._overlapSkip, 'g')
inline.emStrong.rDelimUnd = edit(inline.emStrong.rDelimUnd, 'g')
.replace(/punct/g, inline._punctuation)
.getRegex();

inline._escapes = /\\([!"#$%&'()*+,\-./:;<=>?@\[\]\\^_`{|}~])/g;
Expand Down
1 change: 1 addition & 0 deletions test/specs/bug/adjacent_lists.html
@@ -0,0 +1 @@
<p><em>foo <strong>bar *baz bim</strong> bam</em></p>
1 change: 1 addition & 0 deletions test/specs/bug/adjacent_lists.md
@@ -0,0 +1 @@
*foo __bar *baz bim__ bam*
30 changes: 10 additions & 20 deletions test/specs/commonmark/commonmark.0.29.json
Expand Up @@ -3135,8 +3135,7 @@
"example": 388,
"start_line": 6648,
"end_line": 6652,
"section": "Emphasis and strong emphasis",
"shouldFail": true
"section": "Emphasis and strong emphasis"
},
{
"markdown": "foo-__(bar)__\n",
Expand Down Expand Up @@ -3288,8 +3287,7 @@
"example": 407,
"start_line": 6831,
"end_line": 6835,
"section": "Emphasis and strong emphasis",
"shouldFail": true
"section": "Emphasis and strong emphasis"
},
{
"markdown": "*foo *bar**\n",
Expand Down Expand Up @@ -3329,8 +3327,7 @@
"example": 412,
"start_line": 6888,
"end_line": 6892,
"section": "Emphasis and strong emphasis",
"shouldFail": true
"section": "Emphasis and strong emphasis"
},
{
"markdown": "*foo **bar***\n",
Expand All @@ -3354,8 +3351,7 @@
"example": 415,
"start_line": 6913,
"end_line": 6917,
"section": "Emphasis and strong emphasis",
"shouldFail": true
"section": "Emphasis and strong emphasis"
},
{
"markdown": "foo******bar*********baz\n",
Expand Down Expand Up @@ -3428,17 +3424,15 @@
"example": 424,
"start_line": 6990,
"end_line": 6994,
"section": "Emphasis and strong emphasis",
"shouldFail": true
"section": "Emphasis and strong emphasis"
},
{
"markdown": "____foo__ bar__\n",
"html": "<p><strong><strong>foo</strong> bar</strong></p>\n",
"example": 425,
"start_line": 6997,
"end_line": 7001,
"section": "Emphasis and strong emphasis",
"shouldFail": true
"section": "Emphasis and strong emphasis"
},
{
"markdown": "**foo **bar****\n",
Expand Down Expand Up @@ -3767,26 +3761,23 @@
"example": 465,
"start_line": 7307,
"end_line": 7311,
"section": "Emphasis and strong emphasis",
"shouldFail": true
"section": "Emphasis and strong emphasis"
},
{
"markdown": "***foo***\n",
"html": "<p><em><strong>foo</strong></em></p>\n",
"example": 466,
"start_line": 7316,
"end_line": 7320,
"section": "Emphasis and strong emphasis",
"shouldFail": true
"section": "Emphasis and strong emphasis"
},
{
"markdown": "_____foo_____\n",
"html": "<p><em><strong><strong>foo</strong></strong></em></p>\n",
"example": 467,
"start_line": 7323,
"end_line": 7327,
"section": "Emphasis and strong emphasis",
"shouldFail": true
"section": "Emphasis and strong emphasis"
},
{
"markdown": "*foo _bar* baz_\n",
Expand All @@ -3810,8 +3801,7 @@
"example": 470,
"start_line": 7348,
"end_line": 7352,
"section": "Emphasis and strong emphasis",
"shouldFail": true
"section": "Emphasis and strong emphasis"
},
{
"markdown": "*foo *bar baz*\n",
Expand Down