Switch to Nokogiri::HTML for html writer #46

hotgazpacho · 2014-04-04T03:37:39Z

By switching to the HTML parser from the XML parser, we can better handle malformed html documents and save out something that at least marginally resembles the input HTML.

Resolves #44

By switching to the HTML parser from the XML parser, we can better handle malformed html documents and save out something that at least marginally resembles the input HTML.

hotgazpacho · 2014-04-04T03:39:40Z

lib/approvals/writers/html_writer.rb

@@ -7,7 +7,7 @@ def extension
      end

      def format(data)
-        Nokogiri::XML(data.to_s.strip,&:noblanks).to_xhtml(:indent => 2, :encoding => 'UTF-8')
+        Nokogiri::HTML(data.to_s.strip,&:noblanks).to_xhtml(:indent => 2, :encoding => 'UTF-8')


I'd really like to move to to_html, but only to_xhtml allows for the nice formatting and indenting. Making the switch would be a breaking change.

markijbema · 2014-04-04T08:16:36Z

Is this no breaking change (that is, is the formatting exactly the same)? I wouldn't mind if it was (I think this is a great improvement), but in that case, please add it to the changelog (though I think also in this case it's worth mentioning in the changelog. We had to discard approvals for html in a project because of this, so I guess it might be interesting for others to hear as well).

hotgazpacho · 2014-04-04T11:14:17Z

As implemented, this is not a breaking change. However, I do have concerns about when the input is say, HTML5; markup that is perfectly valid there isn't necessarily valid XHTML (for example, in HTML5, you need not close certain tags). On the one hand, you're effectively changing the output. On the other hand, this was already happening (in order to get the pretty indenting).

markijbema · 2014-04-04T11:37:27Z

You still have the choice to compare the exact output using the text comparison. In my perception, the html should say documents are the same if the browser renders them the same. Of course, that isn't feasible to check, so wouldn't "if the dom tree is the same" be a good approximation? If this writer does not change the output, what is the value of it (as in, why use it over text)?

hotgazpacho · 2014-04-04T13:05:17Z

Aye, there's the rub: transliterating from HTML5 to XHTML is no guarantee that the browser will render it the same. I dare say that it will almost guarantee that it won't. Especially if your input is much looser than XHTML allows. Not even sure that anyone uses XHTML anymore.

I guess it really comes down to what your expectations are with this formatter.

markijbema · 2014-04-04T14:41:54Z

No sure. I was trying to fish for what you expect out of this formatter.

I expect it allows me to detect differences, but ignore small 'formatting' changes (double spaces, where the newlines are, etc). I don't expect it to do much more, because then I'd use a screenshot test. I wouldn't expect it to do much less, because then I would use a text approval.

Basically, I would want those two to be the same:

<div><a>yo</a></div>

<div>
  <a>yo</a>
</div>

But I don't have deep investment into a specific behaviour. I don't see need for formatting as xhtml at all, as long as there is some consistency. So what documents would you expect to be considered the same?

hotgazpacho · 2014-04-04T15:04:43Z

So, the reason I ask is that we only get the pretty formatting/indenting if and only if we export as xhtml (nokogiri under the hood). This has effects on the doctype declaration, the head, the script declarations, boolean attributes (per your original issue), etc.

That said, some paragraphanother paragraph is valid HTML, but when we export it to XHTML, it becomes some paragraphanother paragraph. Semantically the same, but different. But, as you mention, if you're concerned about the exact content, you'd probably want the text formatter.

The test I wrote for this feature uses the HTML fragment you originally reported, and the change transforms it to valid HTML.

kytrinyx · 2014-04-04T15:33:29Z

I'd be happy to go with something other than nokogiri if there's some way to get:

normalizing whitespace
no addition/removal of markup

Right now this is all pretty broken (see #7)

markijbema · 2014-04-05T09:33:47Z

@hotgazpacho ah sorry, I didn't notice that. This change seems sensible to me. I'd say the html interpretation is also very valid.

I vote merge :)

hotgazpacho · 2014-04-05T13:13:32Z

So, if the goal is really to freeze some legacy stuff with potentially invalid markup, then normalizing whitespace is not going to be a good idea. In cases with invalid markup, it throws the browser into quirks mode, and whitespace can often be very significant.

Throwing it into nokogiri is also not going to be a good idea in this scenario, because nokogiri is a parser, and it is going to try to make sense of it. Then, when you go to output, it has to transform it back from its internal representation of the nodes into XHTML. Unless you started with valid XHTML, you're pretty much guaranteed to change the output.

I don't know that the goals of normalizing whitespace while not touching the markup can be achieved with this formatter. Nor can they really be achieved in combination with the goal of making a golden master of the existing invalid markup. You'd really have to use the text formatter and a really good visual diff tool in this scenario.

Perhaps a better idea might be to warn the user when we detect invalid markup, and suggest that they switch to the text formatter, while outlining the reasoning for it.

kytrinyx · 2014-04-05T18:11:44Z

I like the idea of warning about invalid markup.

OK, I agree -- the goal for this formatter is (henceforth :-)) as you've described.

It would be lovely to get some documentation about this in the README eventually (there's a lot of "it would be lovely" to be had in the README).

Anyway: Merging. ❤️

kytrinyx · 2014-04-05T18:15:23Z

I've released v0.0.15 with this change in it.

hotgazpacho · 2014-04-05T19:12:57Z

Nice :)

I'll look into adding the warning bits in a new issue and pull request.

Switch to Nokogiri::HTML for html writer

823d770

By switching to the HTML parser from the XML parser, we can better handle malformed html documents and save out something that at least marginally resembles the input HTML.

hotgazpacho reviewed Apr 4, 2014
View reviewed changes

kytrinyx added a commit that referenced this pull request Apr 5, 2014

Merge pull request #46 from hotgazpacho/html-writer

89a4552

kytrinyx closed this Apr 5, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch to Nokogiri::HTML for html writer #46

Switch to Nokogiri::HTML for html writer #46

hotgazpacho commented Apr 4, 2014

hotgazpacho Apr 4, 2014

markijbema commented Apr 4, 2014

hotgazpacho commented Apr 4, 2014

markijbema commented Apr 4, 2014

hotgazpacho commented Apr 4, 2014

markijbema commented Apr 4, 2014

hotgazpacho commented Apr 4, 2014

kytrinyx commented Apr 4, 2014

markijbema commented Apr 5, 2014

hotgazpacho commented Apr 5, 2014

kytrinyx commented Apr 5, 2014

kytrinyx commented Apr 5, 2014

hotgazpacho commented Apr 5, 2014

Switch to Nokogiri::HTML for html writer #46

Switch to Nokogiri::HTML for html writer #46

Conversation

hotgazpacho commented Apr 4, 2014

hotgazpacho Apr 4, 2014

Choose a reason for hiding this comment

markijbema commented Apr 4, 2014

hotgazpacho commented Apr 4, 2014

markijbema commented Apr 4, 2014

hotgazpacho commented Apr 4, 2014

markijbema commented Apr 4, 2014

hotgazpacho commented Apr 4, 2014

kytrinyx commented Apr 4, 2014

markijbema commented Apr 5, 2014

hotgazpacho commented Apr 5, 2014

kytrinyx commented Apr 5, 2014

kytrinyx commented Apr 5, 2014

hotgazpacho commented Apr 5, 2014