Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Malformed HTML parsed differently from browsers #512

Open
demurgos opened this issue Oct 1, 2023 · 1 comment
Open

Malformed HTML parsed differently from browsers #512

demurgos opened this issue Oct 1, 2023 · 1 comment

Comments

@demurgos
Copy link

demurgos commented Oct 1, 2023

I have an HTML file with markup that can be reduced to the following:

<html>
<body>
<div id="div0">
  <a hr
</div>
<div id="div1">
  <div id="div2"></div>
  <div id="div3">
    <a href="/">bar</a>
  </div>
</div>
</body>
</html>

Notice the truncated <a tag on line 4 (caused by an HTML fragment accidentally truncated in the DB).

If I create a file with this content, load it in Firefox and print the resulting DOM with document.getElementsByTagName("html")[0].outerHTML , Firefox returns:

<html><head></head><body>
<div id="div0">
  <a hr="" <="" div="">
</a><div id="div1"><a hr="" <="" div="">
  <div id="div2"></div>
  </a><div id="div3"><a hr="" <="" div="">
    </a><a href="/">bar</a>
  </div>
</div>
</body></html>
  • The truncated link results in 3 nodes in the DOM
  • The well form tag with text bar is still present in the output

However, if I parse the input with html5ever and print back the result, I get:

<html><head></head><body>
<div id="div0">
  <a hr="" <="" div="">
</a><div id="div1"><a hr="" <="" div="">
  <div id="div2"></div>
  </div>


</div></body></html>
  • The truncated link only appears twice
  • The well-formed link with bar completely disappeared!

EDIT: See next message, there are still some differences but the ones here seem to be caused by the TreeSink impl I used, not the parser.

This difference in interpretation between Firefox/Chrome and html5ever is causing me issues when processing these documents to recover them. I'm well aware that the input is broken, but I would expect html5ever to produce the same structure as real browsers.


EDIT: Even smaller repro, removing the newline fixes the mismatch.

<html><body><div><a hr</div><div><div></div>
<div><a href="/">bar</a></div></div></body></html>
@demurgos demurgos changed the title Malformed HTML parsed differently from Firefox and Chrome Malformed HTML parsed differently from browsers Oct 1, 2023
@demurgos
Copy link
Author

demurgos commented Oct 1, 2023

Running the arena example, I actually get a result close to real browsers.

I added Debug to html5ever/examples/arena:

impl<'arena> std::fmt::Debug for Node<'arena> {
    fn fmt(&self, f: &mut Formatter<'_>) -> std::fmt::Result {
        f.debug_struct("Node")
            .field("data", &self.data)
            .field("first_child", &self.first_child)
            .field("next_sibling", &self.next_sibling)
            .finish()
    }
}

And then executed:

$ cat ./malformed.html
<html><body><div><a hr</div><div><div></div>
<div><a href="/">bar</a></div></div></body></html>
$ cargo run --example arena < ./malformed.html

This produced a tree corresponding to:

<document>
  <html>
    <head></head>
    <body>
      <div>
        <a hr<="" div=""></a>
        <div>
          <a hr<="" div="">
            <div></div>
            "\n"
          </a>
          <div>
            <a hr<="" div=""></a>
            <a href="/">bar</a>
          </div>
        </div>
        "\n"
      </div>
    </body>
  </html>
</document>

The difference with real browsers is that:

  • there is a <div> inside the second anchor, while it's empty inside browsers.
  • the broken anchors have two attributes instead of three

Regarding the other differences, they may be caused by my TreeSink, I'm using html5ever through scraper so I'll check there too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant