Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML renderer should distinguish between block and inline HTML #1745

Closed
Martii opened this issue Aug 11, 2020 · 11 comments · Fixed by #2768
Closed

HTML renderer should distinguish between block and inline HTML #1745

Martii opened this issue Aug 11, 2020 · 11 comments · Fixed by #2768
Labels
Projects

Comments

@Martii
Copy link
Contributor

Martii commented Aug 11, 2020

Describe the bug
<kbd>key</kbd> doesn't seem to be generating correctly for us. See final rendered output at https://openuserjs.org/about/Falkon#debugging and source for this file here.

See also:

To Reproduce
Steps to reproduce the behavior:

  1. Backed out the dependency here to 0.8.0 and the issue disappears.
  2. Rechecked our sanitizer and even didn't sanitize html (code point here with plain return of string) since this is what I thought it might initially be. Still doesn't seem to be generating correctly even at release 1.1.1 with marked.
  3. Removed markdown rendering (code point here with plain return of string) and key restored working as expected in the browser.

Expected behavior
A clear and concise description of what you expected to happen.

With anything after 0.8.0 the output gets rendered to html roughly as

<kbd></kbd>
key

... and more specifically at our document page source with html raw copy of inspection i.e. no line break for clarity:

<kbd></kbd>Ctrl + <kbd></kbd>Shift + <kbd></kbd>i

I realize the demo page doesn't appear to be doing this over there however this is one weird issue.

Should be (and used to be with 0.8.0 of marked):

<kbd>key</kbd>

See also:

  • Our server side marked manipulation here with most set options floating around here.
  • Our DOM manipulation settings at here.

Any help on understanding this change from 0.8.0 would be greatly appreciated. :) I also hope this issue is not replicated but I did do a brief search on this repo including code for kbd and only found the smartypants option which we've never used.

Tested in:

  • Mozilla/5.0 (X11; Linux x86_64; rv:79.0) Gecko/20100101 Firefox/79.0 , Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36 and under Windows to rule out any browser issue.
@UziTech
Copy link
Member

UziTech commented Aug 11, 2020

It looks like it is the sanitization that is closing the open tag before the text then removing the close tag after the text.

Opening and closing inline html tags are tokenized separately so sanitize html is seeing <kbd> and changing it to <kbd></kbd> the seeing </kbd> and removing it.

You can see the difference in tokenization in the v0.8.0 demo and the v1.1.1 demo

@Martii
Copy link
Contributor Author

Martii commented Aug 11, 2020

Appreciate the look... I'll retest it again... however...

Re:

It looks like it is the sanitization

But my test in dev is this from right above that code a few lines:

function sanitize(aHtml) {
  //return sanitizeHtml(aHtml, htmlWhitelistPost);
  return aHtml;  // Test disabling sanitizer
}

... and it still comes back with:

<kbd></kbd>Ctrl + <kbd></kbd>Shift + <kbd></kbd>i

Here's the retest with no sanitization in dev. Same issue which seems to be highly pointing to marked:

marked#1745

If I do this here:

exports.renderMd = function (aText) {
  //return marked(aText);
  return aText;
};

... then all is well with the native html (md isn't processed so it's mostly md but the Ctrl + Shift + i is not having issues with sanitizing):

marked#1745 1

@UziTech
Copy link
Member

UziTech commented Aug 12, 2020

it looks like the line hookNode.innerHTML = sanitized; also does the same thing

@Martii
Copy link
Contributor Author

Martii commented Aug 12, 2020

the line hookNode.innerHTML = sanitized; also does the same thing

That's just the final sanitized String results (which are bypassed in dev atm from item 2 above and retested/reshown in the first screen cap above) i.e. no change for this equivalent of documentFragment implementation in node.

I'm thinking more along the lines of the renderer (Camel caps too) portion of the line of var sanitized = sanitize(marked.Renderer.prototype[aType].apply(renderer, arguments));. If i'm bypassing any and all sanitizing by item number 2 up there with just calling the function and returning no manipulation it could either be some Renderer prototype change in this project or perhaps some worse issue in node.

When I get a few extra cycles I'll try it in latest node@14.x and see if it's a latest node@12.x issue specific to something this package utilizes. EDIT: Same issue in node@14.x.

As I mentioned above this is one weird issue. :\


Misc references for myself here:

@UziTech
Copy link
Member

UziTech commented Aug 12, 2020

That's just the final sanitized String results

yes but it does change the string. JSDom must be doing some sort of sanitization when setting innerHTML

hookNode.innerHTML = '<kbd>';
console.log(hookNode.innerHTML); // '<kbd></kbd>'

hookNode.innerHTML = '</kbd>';
console.log(hookNode.innerHTML); // ''

@Martii
Copy link
Contributor Author

Martii commented Aug 13, 2020

JSDom must be doing some sort of sanitization

Interesting guess and still appreciate the thoughts but...

It's just a filter for @mentions. I took the jsdom dep out of the picture early this morning by commenting them out. Same marked issue. Then I redisabled sanitize-html dep and then it went to the output that is on marked demo page (dumped our whole document into marked's server).

It is highly improbable that two extensions/packages would cause this issue not to mention the fact that marked@0.8.0 works perfectly. It's more probable that there is something broken in marked. If given a choice between insecure HTML, losing @mentions, and using the latest marked, we'll pick secure with sanitize-html along with jsdom and unfortunately have to migrate to another markdown processor or lower the version of marked... but I'd like to keep trying here as I like the project.

Those renderer changes are quite notable between 0.8.0 an 0.8.1 and perhaps it can't be utilized the way we are doing it with overriding the rendering or it's broken for extending the renderer in marked or there is a node issue floating around. It's going to take me quite some time to figure it out and I'll have to reread the advanced and pro docs here to see if any changes. Sizzle wrote the most of the initial markdown library so I get to go figure out what he did, any changes since then, and verify it.

I'll try backing out the relevant changes in marked manually in dev to see if that helps (or breaks which is always possible).


NOTE: This will change a bit during debugging.

  1. Replacing 0.8.1's lib/marked.js with 0.8.0's lib/marked.js yields the kbd issue.
  2. Checking marked's dependency changes between 0.8.0 and 0.8.1 ...
@@ -1,6 +1,6 @@
      "version": "7.1.0",
      "resolved": "https://registry.npmjs.org/acorn/-/acorn-7.1.0.tgz",
      "integrity": "sha512-kL5CuoXA/dgxlBbVrflsflzQ3PAas7RYZB52NOm/6839iVYJgKMJ3cQJD+t2i5+qFa8h3MDpEOJiS64E8JLnSQ==",

      "version": "7.1.1",
      "resolved": "https://registry.npmjs.org/acorn/-/acorn-7.1.1.tgz",
      "integrity": "sha512-add7dgA5ppRPxCFJoAGfMDi7PIBXq1RtGo7BhbLaxwrXPOmw8gq48Y9ozT01hUKy9byMjlR20EJhu5zlkErEkg==",

... acorn changed.

  1. Replacing 0.8.1's lib/marked.esm.js with 0.8.0's lib/marked.esm.js yields the kbd issue.
  2. Replacing 0.8.1's bin/marked with 0.8.0's bin/marked yields the kbd issue.
  3. ... Continue replacing one file at a time in 0.8.1's with 0.8.0's ...
  4. BINGO .... marked/src/InlineLexer.js fixes 0.8.1 ... resetting marked dependency and retesting... CONFIRMED

Now I get to go see what changed there.

  1. From send inline html to renderer #1602 :
--- ./InlineLexer.js    1985-10-26 02:15:00.000000000 -0600
+++ ~/tmp/marked.0.8.0/src/InlineLexer.js      1985-10-26 02:15:00.000000000 -0600
@@ -82,11 +82,11 @@
         }
 
         src = src.substring(cap[0].length);
-        out += this.renderer.html(this.options.sanitize
-          ? (this.options.sanitizer
+        out += this.options.sanitize
+          ? this.options.sanitizer
             ? this.options.sanitizer(cap[0])
-            : escape(cap[0]))
-          : cap[0]);
+            : escape(cap[0])
+          : cap[0];
         continue;
       }
  1. Misc note: Some WEIRD file date/times from npmjs.com installs and here. My clock is set by NIST.

@Martii
Copy link
Contributor Author

Martii commented Aug 13, 2020

@UziTech

So now to find this change source from #1602 in the new tree structure for marked@1.1.1. Any quick knowing where? ( #1627 ...) I'll still look in the current HEAD. Will also have to compare the bug at #1601 and understand it fully.

Thanks in advance. :)

@UziTech
Copy link
Member

UziTech commented Aug 13, 2020

It was actually the fix in #1602 that you pointed out that makes this happen.

Here is what is happening:

  • In v0.8.0 inline HTML would not be sent to the renderer so your renderer would just gets called with '<kbd>key</kbd>'.
  • send inline html to renderer #1602 fixed that so in v0.8.1+ your render gets called with '<kbd>' (which your renderer turns into '<kbd></kbd>'), then '</kbd>' (which your renderer turns into ''), and finally '<kbd></kbd>key'

The issue is that your renderer is treating inline html tags as a block of html.

Maybe we could tell the html renderer whether it is inline or not and your render function can ignore inline html tokens?

@Martii
Copy link
Contributor Author

Martii commented Aug 13, 2020

Understand what you mean now. The granularity of the html renderer method changed and increased. This was good for for #1601 but bad for the way we sanitize, mention, and whatever else we add in the future.

A couple of things that need clarification before I attempt to answer your question.

  1. So in v0.8.0- it used to be solely a Block level rendering method and in v0.8.1 it became also an Inline level rendering method. Would this be a fair conclusion to make? If yes, are there other "cross-overs" like this? ❓

  2. We tried back in the day to move sanitizing out of the renderer and it hosed syntax highlighting. There's not enough time and effort to cover 100% of their classes and other attributes they add in to successfully write exceptions to filter whether transformed or ignored.

  3. We tried the text renderer but it was splitting up certain text elements into two, or more pieces depending on other md codes floating around. i.e. misses of hyper linking. Our users/authors have a sense of humor sometimes with showing us bugs in the rendering so mentions would be clipped under certain circumstances.

  4. We ended up with what we've been dealing with during this issue because it was usually distinct complete blocks of HTML code. e.g. <kbd>key</kbd> as we've been using in this issue. The sanitizer is a must and the mentioner needs well formed HTML code or XPath won't be able to create the object needed for parsing.

So basically we've tried just about everything that we know of in positioning the sanitizer and the mentioner in various places.

So to attempt to have an answer for your question of:

Maybe we could tell the html renderer whether it is inline or not and your render function can ignore inline html tokens?

Some sort of flags argument, or equivalent, would definitely help to have an "end stage" value but my concern is still on my question Number 1 above in this comment. If Number 1 is a "yes, that's how it works logically"... Would it be better to have an inlinehtml render method and restore the block level html or use some sort of flag to indicate that html is at a different stage of the process like you mentioned? ❓ i.e. wouldn't want #1601 to recur depending on the answer here.

Another note loosely related to this: When using HTML <hr class="foo"> it never hits the renderers that we extend... I even tried the hr renderer but I presume that is for --- md codes. Without going too far off the topic, what did I miss? If I didn't miss anything... how does this relate to the granularity of html renderer? ❓

@UziTech
Copy link
Member

UziTech commented Aug 14, 2020

  1. Yes the text renderer gets inline and block text.
  2. v1.0.0 introduced the tokenizer that might be a better place to do the sanitization before the tokens are created. Or you could run a function on every token to do the sanitization with the walkTokens option.
  3. That could be a bug but without the offending markdown I can't be sure.
  4. see 2. for better places for the sanitizer and mentioner.

Would it be better to have an inlinehtml render method and restore the block level html

Maybe but that would be a breaking change and we would like to reduce the number of breaking changes due to the number of dependents. If there is a non-breaking way it would be better to do that now and implement the breaking change later.

When using HTML <hr class="foo"> it never hits the renderers that we extend

It does when I try it. It calls the html renderer.

@Martii
Copy link
Contributor Author

Martii commented Aug 14, 2020

It does when I try it.

Hmmm... dev which is node@14.x atm isn't sanitizing it. On production it's sanitizing it (temporarily put it up something similar at the same section of "Debugging" under Falkon) which is node@12.x. Will look into this later. Could be another weird artifact of the breaking change from 0.8.0 to 0.8.1.

Ref:
Mising sanitizing

Thanks for the newer feature recommendations and clarifications on which ones are dual. Will look into tokenizer as soon as I can.

If there is a non-breaking way it would be better to do that now and implement the breaking change later.

Understood. That's why I asked if there were any other renderers that have the same possibilities as html... you said yes so your idea is probably the best esp. if they grow. :)

@UziTech UziTech changed the title Weirdness with kbd tag HTML renderer should distinguish between block and inline HTML Dec 8, 2020
@UziTech UziTech added this to To Do in vNext via automation Dec 8, 2020
@UziTech UziTech moved this from To Do to In Progress in vNext Apr 1, 2023
vNext automation moved this from In Progress to Done May 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
vNext
Done
Development

Successfully merging a pull request may close this issue.

2 participants