Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Malformed HTML parsed differently from browsers #147

Open
demurgos opened this issue Oct 1, 2023 · 0 comments
Open

Malformed HTML parsed differently from browsers #147

demurgos opened this issue Oct 1, 2023 · 0 comments

Comments

@demurgos
Copy link

demurgos commented Oct 1, 2023

I have the following input HTML file:

<html><body><div><a hr</div><div><div></div>
<div><a href="/">bar</a></div></div></body></html>

Notice the unclosed <a tag (this is a minimal repro, in my case it's coming from an accidentally truncated DB value).

If I open it in a browser (Firefox/Chrome) and print its DOM with document.getElementsByTagName("html")[0].outerHTML , I get:

<html><head></head><body>
<div id="div0">
  <a hr="" <="" div="">
</a><div id="div1"><a hr="" <="" div="">
  <div id="div2"></div>
  </a><div id="div3"><a hr="" <="" div="">
    </a><a href="/">bar</a>
  </div>
</div>
</body></html>

With scraper, if I parse it with Html::parse_document and print it with doc.root_element().html(), I get:

<html><head></head><body><div><a hr<="" div=""></a><div><a hr<="" div=""><div></div>
</div>
</div></body></html>

Notice that the anchor tag with text bar is missing!

Running this input with html5ever's example sinks, I get an input close to browsers (but still not the same, see servo/html5ever#512).

It seems to indicate that there's an issue with scraper's TreeSink implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant