Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Various <em> issues #1546

Closed
oliversturm opened this issue Sep 13, 2019 · 8 comments
Closed

Various <em> issues #1546

oliversturm opened this issue Sep 13, 2019 · 8 comments
Labels
category: inline elements L1 - broken Valid usage causes incorrect output OR a crash AND there is no known workaround for the issue

Comments

@oliversturm
Copy link

oliversturm commented Sep 13, 2019

Describe the bug
Formatting of blocks like *em text* and _em text_ breaks in various different ways

To Reproduce
Steps to reproduce the behavior:

Try this text (Marked Demo link):

The difference between the Knockout front-end and the basic jQuery one is the widget instantiation mechanism. With jQuery, I had *div*s with *id*s in the HTML and I used jQuery instructions to instantiate DevExtreme widgets for these *div*s. With Knockout, I use *data-bind* attributes instead 

The result is this:

<p>The difference between the Knockout front-end and the basic jQuery one is the widget instantiation mechanism. With jQuery, I had <em>div<em>s with *id</em>s in the HTML and I used jQuery instructions to instantiate DevExtreme widgets for these <em>div</em>s. With Knockout, I use *data-bind</em> attributes instead </p>

Clearly, the <em> tags are mangled (two of them surrounding the first div) or missing.

Commonmark renders this block correctly.

Interesting note: Marked renders the block correctly if you remove the last part (involving *data-bind*).

Now try this text - it's the same as before, just using _ instead of *: (Marked Demo link)

The difference between the Knockout front-end and the basic jQuery one is the widget instantiation mechanism. With jQuery, I had _div_s with _id_s in the HTML and I used jQuery instructions to instantiate DevExtreme widgets for these _div_s. With Knockout, I use _data-bind_ attributes instead 

The result is this:

<p>The difference between the Knockout front-end and the basic jQuery one is the widget instantiation mechanism. With jQuery, I had <em>div_s with _id_s in the HTML and I used jQuery instructions to instantiate DevExtreme widgets for these _div_s. With Knockout, I use _data-bind</em> attributes instead </p>

The result is interesting because is is broken in a different way - everything from the first _ to the last one is wrapped in one tag. This reminds me of my previous issue #1390 (but it's not a regression as such - my test cases for that issue still work correctly today.)

With this scenario, Commonmark doesn't render as expected either:

<p>The difference between the Knockout front-end and the basic jQuery one is the widget instantiation mechanism. With jQuery, I had _div_s with _id_s in the HTML and I used jQuery instructions to instantiate DevExtreme widgets for these _div_s. With Knockout, I use <em>data-bind</em> attributes instead</p>

However, their result makes more sense if you want to argue that samples like _div_s are invalid because the ending _ is not followed by whitespace.

Now I'm not a markdown spec expert, but it is my impression that * and _ can be regarded as interchangeable and should render the same result. I may be wrong in this understanding.

Locally I'm using marked version 0.7.0. I assume that's what the Marked Demo also uses at this time.

@UziTech UziTech added category: inline elements L1 - broken Valid usage causes incorrect output OR a crash AND there is no known workaround for the issue labels Sep 13, 2019
@UziTech
Copy link
Member

UziTech commented Sep 13, 2019

it is my impression that * and _ can be regarded as interchangeable and should render the same result

Yes, but * can also act like - in an unordered list so they require different processing.

If you would like to create a PR working on this I would be happy to review it 😁

@oliversturm
Copy link
Author

Yes, but * can also act like - in an unordered list so they require different processing.

That is a good point. However (without looking at any code) I would imagine that once the unordered list use case of a particular * has been eliminated during parsing, further handling should be the same. Since that use case only involves *s matching \s*\*, I don't think this should influence the samples I tested. But I'm sure I'm underestimating the complexity of the issue :)

Will have to see about time to work on that PR...

@oliversturm
Copy link
Author

I've looked around a bit and the problem is clearly that the em regex assumes that *emphasis* is invalid when followed by a non-punctuation and non-special character like s (in my test case).

I tried to apply some simple fixes and I found that the test for example 373 explicitly expects this behavior.

The results I got are still peculiar in that

  • When using *, Commonmark supports my scenario, apparently in spite of the specs
  • When using _, marked differs from Commonmark, and the Commonmark result seems to make more sense
  • Both marked and Commonmark render different results for * and _.

There is a clear difference in the regex for the handling of _text_ and *text*.

For _, there are these three regex parts:

  • ^_([^\s_])_(?!_)
  • ^_([^\s<][\s\S]*?[^\s_])_(?!_|[^\spunctuation])
  • ^_([^\s_<][\s\S]*?[^\s])_(?!_|[^\spunctuation])

For * there are also three parts:

  • ^\*([^\s*<\[])\*(?!\*)
  • ^\*([^\s<"][\s\S]*?[^\s\*])\*(?!\*|[^\spunctuation])
  • ^\*([^\s*"<\[][\s\S]*?[^\s])\*(?!\*)

For some reason, these regexes seem to assume that the sets of characters allowed to follow starting and ending _ are quite different from those for *. It would be easy enough to change these so both delimiters behave the same way, but I wouldn't want to do so without understanding first why they were clearly kept separate by whoever wrote these regexes.

@UziTech
Copy link
Member

UziTech commented Sep 13, 2019

You could look at the CommonMark spec to see the differences for * and _

@UziTech
Copy link
Member

UziTech commented Sep 13, 2019

It seems like the difference in *em*s and _em_s is:

  1. A single * character can open emphasis iff (if and only if) it is part of a left-flanking delimiter run.

  2. A single _ character can open emphasis iff it is part of a left-flanking delimiter run and either (a) not part of a right-flanking delimiter run or (b) part of a right-flanking delimiter run preceded by punctuation.

@oliversturm
Copy link
Author

Right... I see that Example 354 allows emphasis in situations like mine, for *. I'll check again later what changes I made - they may have been incorrectly applied to both * and _, which caused the example 373 test to fail. However, the specific case I was testing failed in spite of all existing tests passing, so there might still be more complicated reasons for that.

@Scrum
Copy link
Contributor

Scrum commented Apr 13, 2020

@oliversturm @UziTech This is already fixed in the version v0.8.2

@UziTech
Copy link
Member

UziTech commented Apr 13, 2020

The demo does look correct in master. The fix in #1636 will be in the next release.

@UziTech UziTech closed this as completed Apr 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: inline elements L1 - broken Valid usage causes incorrect output OR a crash AND there is no known workaround for the issue
Projects
None yet
Development

No branches or pull requests

3 participants