Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch to Nokogiri::HTML for html writer #46

Closed
wants to merge 1 commit into from

Conversation

hotgazpacho
Copy link
Contributor

By switching to the HTML parser from the XML parser, we can better handle malformed html documents and save out something that at least marginally resembles the input HTML.

Resolves #44

By switching to the HTML parser from the XML parser, we can better handle malformed html documents and save out something that at least marginally resembles the input HTML.
@@ -7,7 +7,7 @@ def extension
end

def format(data)
Nokogiri::XML(data.to_s.strip,&:noblanks).to_xhtml(:indent => 2, :encoding => 'UTF-8')
Nokogiri::HTML(data.to_s.strip,&:noblanks).to_xhtml(:indent => 2, :encoding => 'UTF-8')
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd really like to move to to_html, but only to_xhtml allows for the nice formatting and indenting. Making the switch would be a breaking change.

@markijbema
Copy link
Contributor

Is this no breaking change (that is, is the formatting exactly the same)? I wouldn't mind if it was (I think this is a great improvement), but in that case, please add it to the changelog (though I think also in this case it's worth mentioning in the changelog. We had to discard approvals for html in a project because of this, so I guess it might be interesting for others to hear as well).

@hotgazpacho
Copy link
Contributor Author

As implemented, this is not a breaking change. However, I do have concerns about when the input is say, HTML5; markup that is perfectly valid there isn't necessarily valid XHTML (for example, in HTML5, you need not close certain tags). On the one hand, you're effectively changing the output. On the other hand, this was already happening (in order to get the pretty indenting).

@markijbema
Copy link
Contributor

You still have the choice to compare the exact output using the text comparison. In my perception, the html should say documents are the same if the browser renders them the same. Of course, that isn't feasible to check, so wouldn't "if the dom tree is the same" be a good approximation? If this writer does not change the output, what is the value of it (as in, why use it over text)?

@hotgazpacho
Copy link
Contributor Author

Aye, there's the rub: transliterating from HTML5 to XHTML is no guarantee that the browser will render it the same. I dare say that it will almost guarantee that it won't. Especially if your input is much looser than XHTML allows. Not even sure that anyone uses XHTML anymore.

I guess it really comes down to what your expectations are with this formatter.

@markijbema
Copy link
Contributor

No sure. I was trying to fish for what you expect out of this formatter.

I expect it allows me to detect differences, but ignore small 'formatting' changes (double spaces, where the newlines are, etc). I don't expect it to do much more, because then I'd use a screenshot test. I wouldn't expect it to do much less, because then I would use a text approval.

Basically, I would want those two to be the same:

<div><a>yo</a></div>
<div>
  <a>yo</a>
</div>

But I don't have deep investment into a specific behaviour. I don't see need for formatting as xhtml at all, as long as there is some consistency. So what documents would you expect to be considered the same?

@hotgazpacho
Copy link
Contributor Author

So, the reason I ask is that we only get the pretty formatting/indenting if and only if we export as xhtml (nokogiri under the hood). This has effects on the doctype declaration, the head, the script declarations, boolean attributes (per your original issue), etc.

That said, <p>some paragraph<p>another paragraph is valid HTML, but when we export it to XHTML, it becomes <p>some paragraph</p><p>another paragraph</p>. Semantically the same, but different. But, as you mention, if you're concerned about the exact content, you'd probably want the text formatter.

The test I wrote for this feature uses the HTML fragment you originally reported, and the change transforms it to valid HTML.

@kytrinyx
Copy link
Contributor

kytrinyx commented Apr 4, 2014

I'd be happy to go with something other than nokogiri if there's some way to get:

  • normalizing whitespace
  • no addition/removal of markup

Right now this is all pretty broken (see #7)

@markijbema
Copy link
Contributor

@hotgazpacho ah sorry, I didn't notice that. This change seems sensible to me. I'd say the html interpretation is also very valid.

I vote merge :)

@hotgazpacho
Copy link
Contributor Author

So, if the goal is really to freeze some legacy stuff with potentially invalid markup, then normalizing whitespace is not going to be a good idea. In cases with invalid markup, it throws the browser into quirks mode, and whitespace can often be very significant.

Throwing it into nokogiri is also not going to be a good idea in this scenario, because nokogiri is a parser, and it is going to try to make sense of it. Then, when you go to output, it has to transform it back from its internal representation of the nodes into XHTML. Unless you started with valid XHTML, you're pretty much guaranteed to change the output.

I don't know that the goals of normalizing whitespace while not touching the markup can be achieved with this formatter. Nor can they really be achieved in combination with the goal of making a golden master of the existing invalid markup. You'd really have to use the text formatter and a really good visual diff tool in this scenario.

Perhaps a better idea might be to warn the user when we detect invalid markup, and suggest that they switch to the text formatter, while outlining the reasoning for it.

@kytrinyx
Copy link
Contributor

kytrinyx commented Apr 5, 2014

I like the idea of warning about invalid markup.

OK, I agree -- the goal for this formatter is (henceforth :-)) as you've described.

It would be lovely to get some documentation about this in the README eventually (there's a lot of "it would be lovely" to be had in the README).

Anyway: Merging. ❤️

kytrinyx added a commit that referenced this pull request Apr 5, 2014
@kytrinyx kytrinyx closed this Apr 5, 2014
@kytrinyx
Copy link
Contributor

kytrinyx commented Apr 5, 2014

I've released v0.0.15 with this change in it.

@hotgazpacho
Copy link
Contributor Author

Nice :)

I'll look into adding the warning bits in a new issue and pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Html approver mangles html with boolean attributes
3 participants