Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

String after < is completely removed, if it is not followed by a space #160

Open
ajayRaghav37 opened this issue Jun 14, 2018 · 7 comments
Open

Comments

@ajayRaghav37
Copy link

ajayRaghav37 commented Jun 14, 2018

SNIPPET TO REPRODUCE

const htmlToText = require('html-to-text');

let textResponse = htmlToText.fromString('<p>there are definitely <10,000 terrestrial planets in the universe. Only few of them would be habitable for future human.</p>', {
    wordwrap: false
});

console.log(textResponse);

EXPECTED
there are definitely <10,000 terrestrial planets in the universe. Only few of them would be habitable for future human.

ACTUAL OUTPUT
there are definitely

@mlegenhausen
Copy link
Member

The problem is that < is interpreted as an opening tag you need to replace it with &lt;. This is not a problem of this module the problems seems to be the used html parser.

@ajayRaghav37
Copy link
Author

Cannot replace all < with &lt; as the input is dynamic and I need to preserve both, the HTML text as well as plain text. It wouldn't be a problem with correct use of punctuation by the user 😄

Anyway, I will try to do a workaround on this and will post once I am done.

@KillyMXI
Copy link
Member

This particular example doesn't reproduce in version 7.
htmlparser2 got smarter recently and doesn't consider <10,000 ... as a tag anymore.
Still, it's not perfect and can be confused in other situations, such as <ten thousand ....

@KillyMXI
Copy link
Member

KillyMXI commented Feb 17, 2021

You know what? Even Blink (Chrome's engine) is confused by <ten thousand ....
I suppose it might be a performance optimization - being ready to unroll the parser state when something doesn't make sense might be costly and not worth it on a scale.

HTML spec also doesn't seem to be helpful - it is really permissive about tag attributes and doesn't even ban < character from occurrence inside a tag as far as I can see.

It requires some effort to collect the behavior across numerous JS HTML parsers. So far I know that Angular has a particularly smart parser, but that's probably not a great dependency for a project like html-to-text. The majority seems to allow out-of-spec stuff such as non-alphanumeric tag names, much like Blink.

@KillyMXI
Copy link
Member

KillyMXI commented Feb 21, 2021

Ok, now I'm pretty confident there is no parser to switch to in order to address this issue.
https://astexplorer.net/ contains most of the ones worth looking, and I made a PR there for the only one missing.
There are more projects but those are either unhealthy or reusing one of the parsers such as parse5.

@angular/compiler contains a nice parser but in itself it doesn't look like a good dependency. Forking it might be a way to go but I'm not convinced it is the right way to go. I would prefer not to maintain a parser too...

If there is a nice example on how a certain html fragment should be interpreted according to the spec and how it is different in AST explorer - that better be filed upstream (in the parser repo, htmlparser2).

I'll keep this issue open as a reference but I don't have any more to do about it, for now at least.

@sairupesh
Copy link

sairupesh commented Apr 13, 2021

I am facing the same issue even if my html being passed has $lt; instead of <.
My html:

<div>
    <ul>
        <li><i>Point 1 - this is point 1</i></li>
        <li><span style="font-weight: 700;">Point 2 - &lt;this is point 2&gt;</span></li>
    </ul>
</div>

Output completely skips this is point 2

@KillyMXI
Copy link
Member

@sairupesh I can't reproduce this.
Sounds like you're unescaping html somewhere in your pipeline before calling html-to-text.

const text = htmlToText(
  `<div>
  <ul>
      <li><i>Point 1 - this is point 1</i></li>
      <li><span style="font-weight: 700;">Point 2 - &lt;this is point 2&gt;</span></li>
  </ul>
</div>`
);
console.log(text);
 * Point 1 - this is point 1
 * Point 2 - <this is point 2>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants