Various <em> issues #1546

oliversturm · 2019-09-13T11:55:15Z

Describe the bug
Formatting of blocks like *em text* and _em text_ breaks in various different ways

To Reproduce
Steps to reproduce the behavior:

The difference between the Knockout front-end and the basic jQuery one is the widget instantiation mechanism. With jQuery, I had *div*s with *id*s in the HTML and I used jQuery instructions to instantiate DevExtreme widgets for these *div*s. With Knockout, I use *data-bind* attributes instead

The result is this:

<p>The difference between the Knockout front-end and the basic jQuery one is the widget instantiation mechanism. With jQuery, I had <em>div<em>s with *id</em>s in the HTML and I used jQuery instructions to instantiate DevExtreme widgets for these <em>div</em>s. With Knockout, I use *data-bind</em> attributes instead </p>

Clearly, the <em> tags are mangled (two of them surrounding the first div) or missing.

Commonmark renders this block correctly.

Interesting note: Marked renders the block correctly if you remove the last part (involving *data-bind*).

Now try this text - it's the same as before, just using _ instead of *: (Marked Demo link)

The difference between the Knockout front-end and the basic jQuery one is the widget instantiation mechanism. With jQuery, I had _div_s with _id_s in the HTML and I used jQuery instructions to instantiate DevExtreme widgets for these _div_s. With Knockout, I use _data-bind_ attributes instead

The result is this:

<p>The difference between the Knockout front-end and the basic jQuery one is the widget instantiation mechanism. With jQuery, I had <em>div_s with _id_s in the HTML and I used jQuery instructions to instantiate DevExtreme widgets for these _div_s. With Knockout, I use _data-bind</em> attributes instead </p>

The result is interesting because is is broken in a different way - everything from the first _ to the last one is wrapped in one tag. This reminds me of my previous issue #1390 (but it's not a regression as such - my test cases for that issue still work correctly today.)

With this scenario, Commonmark doesn't render as expected either:

<p>The difference between the Knockout front-end and the basic jQuery one is the widget instantiation mechanism. With jQuery, I had _div_s with _id_s in the HTML and I used jQuery instructions to instantiate DevExtreme widgets for these _div_s. With Knockout, I use <em>data-bind</em> attributes instead</p>

However, their result makes more sense if you want to argue that samples like _div_s are invalid because the ending _ is not followed by whitespace.

Now I'm not a markdown spec expert, but it is my impression that * and _ can be regarded as interchangeable and should render the same result. I may be wrong in this understanding.

Locally I'm using marked version 0.7.0. I assume that's what the Marked Demo also uses at this time.

The text was updated successfully, but these errors were encountered:

UziTech · 2019-09-13T12:47:52Z

it is my impression that * and _ can be regarded as interchangeable and should render the same result

Yes, but * can also act like - in an unordered list so they require different processing.

If you would like to create a PR working on this I would be happy to review it 😁

oliversturm · 2019-09-13T12:57:01Z

Yes, but * can also act like - in an unordered list so they require different processing.

That is a good point. However (without looking at any code) I would imagine that once the unordered list use case of a particular * has been eliminated during parsing, further handling should be the same. Since that use case only involves *s matching \s*\*, I don't think this should influence the samples I tested. But I'm sure I'm underestimating the complexity of the issue :)

Will have to see about time to work on that PR...

oliversturm · 2019-09-13T14:57:44Z

I've looked around a bit and the problem is clearly that the em regex assumes that *emphasis* is invalid when followed by a non-punctuation and non-special character like s (in my test case).

I tried to apply some simple fixes and I found that the test for example 373 explicitly expects this behavior.

The results I got are still peculiar in that

When using *, Commonmark supports my scenario, apparently in spite of the specs
When using _, marked differs from Commonmark, and the Commonmark result seems to make more sense
Both marked and Commonmark render different results for * and _.

There is a clear difference in the regex for the handling of _text_ and *text*.

For _, there are these three regex parts:

^_([^\s_])_(?!_)
^_([^\s<][\s\S]*?[^\s_])_(?!_|[^\spunctuation])
^_([^\s_<][\s\S]*?[^\s])_(?!_|[^\spunctuation])

For * there are also three parts:

^\*([^\s*<\[])\*(?!\*)
^\*([^\s<"][\s\S]*?[^\s\*])\*(?!\*|[^\spunctuation])
^\*([^\s*"<\[][\s\S]*?[^\s])\*(?!\*)

For some reason, these regexes seem to assume that the sets of characters allowed to follow starting and ending _ are quite different from those for *. It would be easy enough to change these so both delimiters behave the same way, but I wouldn't want to do so without understanding first why they were clearly kept separate by whoever wrote these regexes.

UziTech · 2019-09-13T16:18:53Z

You could look at the CommonMark spec to see the differences for * and _

UziTech · 2019-09-13T16:21:17Z

It seems like the difference in *em*s and _em_s is:

A single * character can open emphasis iff (if and only if) it is part of a left-flanking delimiter run.

A single _ character can open emphasis iff it is part of a left-flanking delimiter run and either (a) not part of a right-flanking delimiter run or (b) part of a right-flanking delimiter run preceded by punctuation.

oliversturm · 2019-09-13T21:19:22Z

Right... I see that Example 354 allows emphasis in situations like mine, for *. I'll check again later what changes I made - they may have been incorrectly applied to both * and _, which caused the example 373 test to fail. However, the specific case I was testing failed in spite of all existing tests passing, so there might still be more complicated reasons for that.

Scrum · 2020-04-13T14:05:15Z

@oliversturm @UziTech This is already fixed in the version v0.8.2

UziTech · 2020-04-13T14:52:18Z

The demo does look correct in master. The fix in #1636 will be in the next release.

UziTech added category: inline elements L1 - broken Valid usage causes incorrect output OR a crash AND there is no known workaround for the issue labels Sep 13, 2019

UziTech closed this as completed Apr 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Various <em> issues #1546

Various <em> issues #1546

oliversturm commented Sep 13, 2019 •

edited

UziTech commented Sep 13, 2019

oliversturm commented Sep 13, 2019

oliversturm commented Sep 13, 2019

UziTech commented Sep 13, 2019

UziTech commented Sep 13, 2019

oliversturm commented Sep 13, 2019

Scrum commented Apr 13, 2020

UziTech commented Apr 13, 2020

Various <em> issues #1546

Various <em> issues #1546

Comments

oliversturm commented Sep 13, 2019 • edited

UziTech commented Sep 13, 2019

oliversturm commented Sep 13, 2019

oliversturm commented Sep 13, 2019

UziTech commented Sep 13, 2019

UziTech commented Sep 13, 2019

oliversturm commented Sep 13, 2019

Scrum commented Apr 13, 2020

UziTech commented Apr 13, 2020

oliversturm commented Sep 13, 2019 •

edited