Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect handling of tabs in link resources compared to spaces, newlines #191

Open
wooorm opened this issue May 28, 2020 · 13 comments · May be fixed by #259
Open

Incorrect handling of tabs in link resources compared to spaces, newlines #191

wooorm opened this issue May 28, 2020 · 13 comments · May be fixed by #259

Comments

@wooorm
Copy link

wooorm commented May 28, 2020


The bug probably stems from here:

this.spnl() &&
, which seems to include spaces and newlines but not tabs.


This bug can be reproduced with the following permalink to the dingus: https://spec.commonmark.org/dingus/?text=tab%3A%20%5Bx%5D(%09y)%0Aspace%3A%20%5Bx%5D(%20y)%0A%0Atab%3A%20%5Bx%5D(y%09)%0Aspace%3A%20%5Bx%5D(y%20)%0A%0Atab%3A%20%5Bx%5D(%09%3Cy%3E)%0Aspace%3A%20%5Bx%5D(%20%3Cy%3E)%0A%0Atab%3A%20%5Bx%5D(y%09%22z%22)%0Aspace%3A%20%5Bx%5D(y%20%22z%22)%0A.

The expected behavior is that tabs and spaces behave the same.

@mikeando
Copy link

mikeando commented Jun 23, 2022

It looks like all that would be required is to change one line in inlines.js: https://github.com/commonmark/commonmark.js/blob/master/lib/inlines.js#L74

var reSpnl = /^ *(?:\n *)?/;

to

var reSpnl = /^[ \t]*(?:\n[ \t]*)?/;

but I'v not built commonmark.js before, so I'm not sure how to go about testing it.

@mikeando
Copy link

Actually, that new regex only covers this one case, but the 0.30 spec says we should interpret "whitespace" as:

A Unicode whitespace character is any code point in the Unicode Zs general category, or a tab (U+0009), line feed (U+000A), form feed (U+000C), or carriage return (U+000D).

so while we're in here we should add those.

@mikeando
Copy link

It looks like all of the whitespace definitions in that part of the code could do with a cleanup.

var reWhitespaceChar = /^[ \t\n\x0b\x0c\x0d]/;
var reUnicodeWhitespaceChar = /^\s/;
var reFinalSpace = / *$/;
var reInitialSpace = /^ */;
  • reWhitespaceChar includes (U+000B), which it should not.
  • reUnicodeWhitespacChar \s includes (U+000B), which it should not - and I believe the regex is not unicode aware.
  • Final space and initial space should also allow other space characters.

I'm not even sure that we need to differentiate between unicode-white-space and white-space, the spec seems to have changed between 0.29 and 0.30 to only have the definition for unicode-white-space.

@wooorm
Copy link
Author

wooorm commented Jun 23, 2022

Where did you see that this case should be covered by unicode whitespace?
I don’t believe it is. I believe only emphasis/strong use unicode whitespace.

@mikeando
Copy link

Where did you see that this case should be covered by unicode whitespace?

The spec says "white-space" and the only definition for whitespace is now "unicode whitespace".
In 0.29 there was a definition for whitespace in the same location, but it was removed.

Maybe ascii-whitespace was intended, and the definition should not have been removed...
IIRC in 0.29 whitespace did include (U+000B) so the changes would be fewer.

@wooorm
Copy link
Author

wooorm commented Jun 23, 2022

In 0.29 it says whitespace. In 0.30 it doesn’t. I believe I fixed that confusion. Or, where do you see whitespace?

@mikeando
Copy link

You're right the spec does explicitly call it out as spaces/tabs in links.

These four components may be separated by spaces, tabs, and up to one line ending.

@mikeando
Copy link

In that case, my initial fix should be sufficient I think.

I'm miss-remembering from some other issues I've been digging into for space handling in HTML elements... sorry.

@wooorm
Copy link
Author

wooorm commented Jun 23, 2022

Yeah. I also found this confusing. That’s why I fixed it :)
In markdown, only [ \t], and in some cases \r\n|\n|\r are considered as “whitespace”.
In markdown, only ASCII punctuation is considered as punctuation.
Except for emphasis/strong, which checks for unicode whitespace and unicode punctuation.

@mikeando
Copy link

This is wandering off-topic a little - but this is enlightening for me.

Does this mean the current parsing of <T\b> as an HTML entry is wrong?
The \b cant be part of the tag-name (since it can only be ASCII digits, letters or hypen).
It similarly can't form an attribute-name as that must start with an ascii letter, and it
is also not a space or tab which is allowed between the elements.

cmark, commonmark.js and comrak (rust parser) all accept it as HTML though...

@wooorm
Copy link
Author

wooorm commented Jun 23, 2022

I believe that cmark, commonmark.js, and comrak, are incorrect for accepting it as HTML according to the CommonMark spec.
Note though, that Unicode is allowed in markdown. There are also cases where any character is allowed. That is to say, certain states check for X characters and so something, and then for Y characters to do something else, and finally allow any other character. The unquoted attribute value is such a case.
Though, your case can’t be an unquoted attribute value, as there is not attribute name.

@jgm
Copy link
Member

jgm commented Jun 23, 2022

@mikeando want to submit a PR for your simple fix above?

@mikeando mikeando linked a pull request Jun 23, 2022 that will close this issue
@mikeando
Copy link

@jgm PR here: #259

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants