Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug] Node#to_xhtml does not have self-closing tags when using libxml 2.9.11 and later #2324

Closed
haxxxton opened this issue Sep 6, 2021 · 7 comments

Comments

@haxxxton
Copy link

haxxxton commented Sep 6, 2021

What problem are you trying to solve?

We have been using nokogiri in combination with approvals files in our rspecs. We've just version bumped up to 1.11.4 and found that all our self-closing tags have changed to include closing tags. Is there a configuration option im missing, or is this the way things should be now? (is this changed in a later version is should move to more recently?)

Please show your code!
Note the <col /> to <col></col>

-<table id="invoice-items"><col width="80" /><col width="60" /><col /><col width="100" /><col width="100" /><col width="100" /><thead><tr><th class="service-date">Service date</th>
+<table id="invoice-items"><col width="80"></col><col width="60"></col><col></col><col width="100"></col><col width="100"></col><col width="100"></col><thead><tr><th class="service-date">Service date</th>

Environment

# Nokogiri (1.11.4)
    ---
    warnings: []
    nokogiri:
      version: 1.11.4
      cppflags:
      - "-I/Users/REDACTED/.rbenv/versions/2.5.9/lib/ruby/gems/2.5.0/gems/nokogiri-1.11.4-x86_64-darwin/ext/nokogiri"
      - "-I/Users/REDACTED/.rbenv/versions/2.5.9/lib/ruby/gems/2.5.0/gems/nokogiri-1.11.4-x86_64-darwin/ext/nokogiri/include"
      - "-I/Users/REDACTED/.rbenv/versions/2.5.9/lib/ruby/gems/2.5.0/gems/nokogiri-1.11.4-x86_64-darwin/ext/nokogiri/include/libxml2"
      ldflags: []
    ruby:
      version: 2.5.9
      platform: x86_64-darwin19
      gem_platform: x86_64-darwin-19
      description: ruby 2.5.9p229 (2021-04-05 revision 67939) [x86_64-darwin19]
      engine: ruby
    libxml:
      source: packaged
      precompiled: true
      patches:
      - 0001-Remove-script-macro-support.patch
      - 0002-Update-entities-to-remove-handling-of-ssi.patch
      - 0003-libxml2.la-is-in-top_builddir.patch
      - 0004-use-glibc-strlen.patch
      - 0005-avoid-isnan-isinf.patch
      - 0006-update-automake-files-for-arm64.patch
      libxml2_path: "/Users/REDACTED/.rbenv/versions/2.5.9/lib/ruby/gems/2.5.0/gems/nokogiri-1.11.4-x86_64-darwin/ext/nokogiri"
      iconv_enabled: true
      compiled: 2.9.12
      loaded: 2.9.12
    libxslt:
      source: packaged
      precompiled: true
      patches:
      - 0001-update-automake-files-for-arm64.patch
      - 0002-Fix-xml2-config-check-in-configure-script.patch
      compiled: 1.1.34
      loaded: 1.1.34
    other_libraries:
      zlib: 1.2.11
      libiconv: '1.15'
@flavorjones
Copy link
Member

flavorjones commented Sep 6, 2021

Hi @haxxxton, thanks for opening this issue, and sorry you're seeing something unexpected.

Without working code, I can't confidently diagnose what you're seeing. However, v1.11.4 upgraded libxml2 from 2.9.10 to 2.9.12 which may have introduced some behavior changes in serialization.

My guess is that you're using #to_xhtml to serialize a document? I can reproduce what you're seeing if that's what I try. After some git-bisecting, commit https://gitlab.gnome.org/GNOME/libxml2/-/commit/dc6f009 looks like it may have affected this behavior.

Because Nokogiri is just a thin wrapper around the underlying parser, there's nothing we can easily to do change this behavior in the gem.

The best thing to do might be to send a bug report upstream and see what the maintainers say -- perhaps this was an unintended side effect of that commit.

@haxxxton
Copy link
Author

haxxxton commented Sep 6, 2021

@flavorjones thanks for the speedy reply, apologies for missing the code block that actually calls nokogiri. You are correct, we are using to_xhtml. I have raised this in their issues section (linking here for cross reference)

Nokogiri::HTML5(data.to_s.strip, Nokogiri::XML::ParseOptions::NOBLANKS).to_xhtml(indent: 2, encoding: 'UTF-8')

As an aside, all the docs and code i can see suggest that i should be passing NOBLANKS as the 4th arg into HTML5, but when i do i get an error:

Nokogiri::HTML5(data.to_s.strip, nil, nil, Nokogiri::XML::ParseOptions::NOBLANKS)

ArgumentError:
       wrong number of arguments (given 4, expected 1..3)

Is this the correct place to be putting NOBLANKS? is my environment potentially still looking at an older version of nokogiri?

@flavorjones
Copy link
Member

@haxxxton Ugh, so I spent some more time looking at this this morning, and the good news is that I have a workaround for you.

First, a bit of explanation and context-setting.

Here's a script that reproduces the issue:

#! /usr/bin/env ruby
require 'nokogiri'
doc = Nokogiri::HTML("<html><body><table><colgroup><col span='2'>")
puts doc.to_xhtml

With Nokogiri v1.10.10 using libxml 2.9.10, the output is:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
  <body>
    <table>
      <colgroup>
        <col span="2" />
      </colgroup>
    </table>
  </body>
</html>

With Nokogiri v1.12.3 using libxml 2.9.12, the output is:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
  <body>
    <table>
      <colgroup>
        <col span="2"></col>
      </colgroup>
    </table>
  </body>
</html>

The method Node#to_xhtml calls Node#serialize using specific Nokogiri::XML::SaveOptions suitable for emitting XHTML. You can see these default options at:

https://github.com/sparklemotion/nokogiri/blob/v1.12.4/lib/nokogiri/xml/node/save_options.rb#L37

which says:

        DEFAULT_XHTML = FORMAT | NO_DECLARATION | NO_EMPTY_TAGS | AS_XHTML

The NO_EMPTY_TAGS bit dates back to commit 1d06b4f in 2009 which doesn't have great test coverage and I believe was an error at the time, because ... obviously ... libxml2 did not respect this bit when serializing XHTML.

It appears as though https://gitlab.gnome.org/GNOME/libxml2/-/commit/dc6f009 has fixed that bug and is now respecting that flag when it's passed in. Ugh!

So here's the workaround: pass in your own SaveOptions for now, like this:

#! /usr/bin/env ruby
require 'nokogiri'
save_options = Nokogiri::XML::Node::SaveOptions::DEFAULT_XHTML & ~Nokogiri::XML::Node::SaveOptions::NO_EMPTY_TAGS
doc = Nokogiri::HTML("<html><body><table><colgroup><col span='2'>")
puts doc.to_xhtml(save_with: save_options)

I will get a fix into the next patch release of Nokogiri that omits this flag to restore the old behavior by default.

I'm going to comment on the upstream issue you opened (thank you for that!) letting them know that this is a Nokogiri issue, but giving some of this additional context.

@flavorjones
Copy link
Member

Also, re: your NOBLANKS question, the HTML5.parse method does not support the libxml2 parse options. You might want to use HTML4.parse; if so this section explains how to use the parse options:

https://nokogiri.org/tutorials/parsing_an_html_xml_document.html#parse-options

HTML5 parsing was taken from Nokogumbo and is explained here:

https://nokogiri.org/tutorials/parsing_an_html5_document.html#parsing-options

TL;DR there aren't as many options available in gumbo as there are in libxml2.

@flavorjones flavorjones added this to the v1.12.x patch releases milestone Sep 7, 2021
@flavorjones flavorjones changed the title Upgrade to 1.11.4 changes self-closing tags[help] [bug] Node#to_xhtml does not have self-closing tags when using libxml 2.9.11 and later Sep 7, 2021
@haxxxton
Copy link
Author

haxxxton commented Sep 7, 2021

@flavorjones wow! such a huge amount of investigation! thank you so much, and for providing a work around.

My diffs on my files now look like:

-<table class="genie-table"><col width="80" /><col width="60" /><col /><col width="100" /><thead><tr><th class="service-date">Service date</th>
+<table class="genie-table"><col width="80"/><col width="60"/><col/><col width="100"/><thead><tr><th class="service-date">Service date</th>

(Note the missing space between the self-closing element name). This is absolutely not an issue, but before i go and rerun all my snapshots to use this new format, do you anticipate the patch fix will go back to the style with the space (pre 1.11.4), or without (workaround)?

Regarding the above reply to the parsing options stuff, am i reading this wrong:

Nokogiri.HTML5(html, url = nil, encoding = nil, options = {})
Nokogiri::HTML5.parse(html, url = nil, encoding = nil, options = {})
Nokogiri::HTML5::Document.parse(html, url = nil, encoding = nil, options = {})

These three methods appear to accept 4 args (and, while im not using a valid option), im getting the error that ive providing too many arguments

ArgumentError:
       wrong number of arguments (given 4, expected 1..3)

@flavorjones
Copy link
Member

@haxxxton The trailing space in self-closing tags appears to be gone for good in libxml 2.9.11 and later, I'd suggest adopting that in any tests you have. (As an aside, I've found it's more robust to test semantic aspects of the markup whenever possible, instead of comparing the serialized text string.)

Regarding the HTML5 method signatures (which are a very recent addition to Nokogiri via merger with Nokogumbo, and we do need to spruce up the API docs as well as the tutorial), the final argument must be a Hash of options.

So this won't work:

Nokogiri::HTML5("<div>", nil, nil, 1234)

but these will be fine:

puts Nokogiri::HTML5("<div>", nil, nil, max_errors: 1234)
puts Nokogiri::HTML5("<div>", nil, max_errors: 1234)
puts Nokogiri::HTML5("<div>", max_errors: 1234)

Hope that helps!

flavorjones added a commit that referenced this issue Sep 23, 2021
Commit 1d06b4f introduced NO_EMPTY_TAGS into
SaveOptions::DEFAULT_XHTML which libxml2 ignored due to a
long-standing bug in serialization.

libxml2 v2.9.11 fixed that serialization bug
(https://gitlab.gnome.org/GNOME/libxml2/-/commit/dc6f009) and started
paying attention to the NO_EMPTY_TAGS save option, resulting in seeing
output containing, e.g. `<col></col>` instead of `<col/>`.

This commit updates the default XHTML save options to drop the
NO_EMPTY_TAGS flag, restoring this behavior.

Closes #2324
@flavorjones
Copy link
Member

This is now fixed on main and I expect to ship a patch release in the next few days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants