Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v1.13.0 XML::Document drops ampersand when parsing in-context fragments #2655

Closed
tylerjc opened this issue Sep 30, 2022 · 6 comments · Fixed by #2656
Closed

v1.13.0 XML::Document drops ampersand when parsing in-context fragments #2655

tylerjc opened this issue Sep 30, 2022 · 6 comments · Fixed by #2656

Comments

@tylerjc
Copy link

tylerjc commented Sep 30, 2022

Please describe the bug

  • https://github.com/sparklemotion/nokogiri/releases/tag/v1.13.0
  • There's some kind of issue/change between these versions that is eating ampersand characters (&). As far as I'm aware, this isn't happening with any other characters, but definitely could be.
  • We had upgraded to v1.13.8 during our Rails upgrade, and that's when we started noticing this issue. We backtracked the versions to figure out when it started happening, and narrowed it down to v1.12.5 last working as expected.

Help us reproduce what you're seeing

image

image

Environment

# Nokogiri (1.13.8)
    ---
    warnings: []
    nokogiri:
      version: 1.13.8
      cppflags:
      - "-I/Users/tylercook/.rvm/gems/ruby-2.6.1/gems/nokogiri-1.13.8-x86_64-darwin/ext/nokogiri"
      - "-I/Users/tylercook/.rvm/gems/ruby-2.6.1/gems/nokogiri-1.13.8-x86_64-darwin/ext/nokogiri/include"
      - "-I/Users/tylercook/.rvm/gems/ruby-2.6.1/gems/nokogiri-1.13.8-x86_64-darwin/ext/nokogiri/include/libxml2"
      ldflags: []
    ruby:
      version: 2.6.1
      platform: x86_64-darwin20
      gem_platform: x86_64-darwin-20
      description: ruby 2.6.1p33 (2019-01-30 revision 66950) [x86_64-darwin20]
      engine: ruby
    libxml:
      source: packaged
      precompiled: true
      patches:
      - 0001-Remove-script-macro-support.patch
      - 0002-Update-entities-to-remove-handling-of-ssi.patch
      - 0003-libxml2.la-is-in-top_builddir.patch
      - 0004-use-glibc-strlen.patch
      - 0005-avoid-isnan-isinf.patch
      - 0006-update-automake-files-for-arm64.patch
      - '0008-htmlParseComment-handle-abruptly-closed-comments.patch'
      - '0009-allow-wildcard-namespaces.patch'
      libxml2_path: "/Users/tylercook/.rvm/gems/ruby-2.6.1/gems/nokogiri-1.13.8-x86_64-darwin/ext/nokogiri"
      memory_management: ruby
      iconv_enabled: true
      compiled: 2.9.14
      loaded: 2.9.14
    libxslt:
      source: packaged
      precompiled: true
      patches:
      - 0001-update-automake-files-for-arm64.patch
      datetime_enabled: true
      compiled: 1.1.35
      loaded: 1.1.35
    other_libraries:
      zlib: 1.2.12
      libiconv: '1.16'
      libgumbo: 1.0.0-nokogiri
@tylerjc tylerjc added the state/needs-triage Inbox for non-installation-related bug reports or help requests label Sep 30, 2022
@tylerjc
Copy link
Author

tylerjc commented Sep 30, 2022

image

@flavorjones
Copy link
Member

flavorjones commented Oct 1, 2022

Hi, @tylerjc! Thanks for asking this question. I spent a little bit of time investigating this afternoon and this is a super interesting issue!

First, let me just ask a favor, which is that in the future please don't use screenshots. It's easier for maintainers to read and reproduce problems if you use regular old text.

characterizing the behavior change

OK, let's do this. Here's the script I'm going to use to explain what's going on here:

#! /usr/bin/env ruby

require "bundler/inline"

ENV["NV"] ||= "1.12.5" # =>

gemfile do
  source "https://rubygems.org"
  gem "nokogiri", "=#{ENV["NV"]}"
end

Nokogiri::VERSION # =>

doc = Nokogiri::XML::Document.parse("<p></p>")
frag = doc.at_css("p").parse("H&M Storage") # =>

doc = Nokogiri::HTML::Document.parse("<p></p>")
frag = doc.at_css("p").parse("H&M Storage") # =>

When I run this with Nokogiri v1.12.5, we see:

#! /usr/bin/env ruby

require "bundler/inline"

ENV["NV"] ||= "1.12.5" # => "1.12.5"

gemfile do
  source "https://rubygems.org"
  gem "nokogiri", "=#{ENV["NV"]}"
end

Nokogiri::VERSION # => "1.12.5"

doc = Nokogiri::XML::Document.parse("<p></p>")
frag = doc.at_css("p").parse("H&M Storage") # => [#<Nokogiri::XML::Text:0x384 "H&M Storage">]

doc = Nokogiri::HTML::Document.parse("<p></p>")
frag = doc.at_css("p").parse("H&M Storage") # => [#<Nokogiri::XML::Text:0x398 "H&M Storage">]

when I run it with v1.13.0 we see:

#! /usr/bin/env ruby

require "bundler/inline"

ENV["NV"] ||= "1.12.5" # => "1.13.0"

gemfile do
  source "https://rubygems.org"
  gem "nokogiri", "=#{ENV["NV"]}"
end

Nokogiri::VERSION # => "1.13.0"

doc = Nokogiri::XML::Document.parse("<p></p>")
frag = doc.at_css("p").parse("H&M Storage") # => [#<Nokogiri::XML::Text:0x384 "H Storage">]

doc = Nokogiri::HTML::Document.parse("<p></p>")
frag = doc.at_css("p").parse("H&M Storage") # => [#<Nokogiri::XML::Text:0x398 "H&M Storage">]

OK, so take note that v1.12.5 and v1.13.0 behave the same if we are working with a Nokogiri::HTML::Document, but the behavior changes if we are using a Nokogiri::XML::Document.

diagnosis

I git-bisected when this change was introduced, and it is commit 38c2f16 from #2388, which addressed an issue described at #1158. The summary of that change from the changelog is:

Error recovery from in-context parsing (e.g., Node#parse) now always uses the correct DocumentFragment class. Previously Nokogiri::HTML4::DocumentFragment was always used, even for XML documents.

What's happening here is that, in 1.12.5:

  • Node#inner_html parses its argument as an XML document fragment (XML instead of HTML because the original document is XML::Document)
  • initial parsing fails because there is invalid markup (the bare & character) and libxml2 does not recover from errors during "in-context fragment parsing")
  • the fallback behavior at https://github.com/sparklemotion/nokogiri/blob/v1.12.5/lib/nokogiri/xml/node.rb#L839 is to use Nokogiri::HTML::DocumentFragment to perform fragment parsing (using the more robust but less appropriate fragment parsing without a node context)

but in v1.13.0:

  • Node#inner_html still parses its argument as an XML document fragment
  • initial parsing still fails because there is invalid markup
  • but the fallback behavior is to use Nokogiri::XML::DocumentFragment because the original document is an XML::Document

The key bit is that the HTML parser is more forgiving than the XML parser and will autocorrect the bare & to &amp;.

We feel this new behavior is more correct, though I appreciate it breaks in your use case in an unexpected way.

possible workarounds for you

If this is an HTML document, I'd suggest starting with a Nokogiri::HTML::Document instead of a Nokogiri::XML::Document, and the behavior should be what you want.

If this is an XML document, but you truly want to parse HTML fragments, then I'd suggest being explicit about that by using code like this:

doc = Nokogiri::XML::Document.parse("<p></p>")
frag = Nokogiri::HTML::DocumentFragment.parse("H&M Storage")
doc.at_css("p").children = frag.children
doc.to_s # => "<?xml version=\"1.0\"?>\n<p>H&amp;M Storage</p>\n"

thinking about Node#inner_html=

You would be justified in thinking, "well jeez, I used a method called inner_html, why didn't it parse it as HTML", but the awful truth is that Node#inner_html= is just a convenience alias for Node#children= and it was never our intention for those two methods to behave differently.

I'm thinking about whether #inner_html= should always parse HTML. Although I suspect there would be a lot of people who think either way is "more intuitive," it mostly doesn't make sense to me to pretend to provide a way for an XML document to contain HTML content, given that they are different specs with different validity expectations and parsing behaviors. I think we'd end up with a lot of weird inconsistencies in behavior.

So I'm going to leave it the way it is, but I will add some clarifying documentation to that method to make it clear it's a convenience alias for children= and may not actually parse the argument as HTML.

Does all this make sense? Happy to answer questions.

@flavorjones flavorjones added meta/user-help and removed state/needs-triage Inbox for non-installation-related bug reports or help requests labels Oct 1, 2022
@flavorjones flavorjones changed the title [bug] Upgrading from v1.12.5 to v1.13.0 introduced a character-eating issue. v1.13.0 XML::Document drops ampersand when parsing in-context fragments Oct 1, 2022
@flavorjones
Copy link
Member

I made a couple of small updates to my previous post for clarity.

flavorjones added a commit that referenced this issue Oct 3, 2022
because it's not always parsed as HTML.

Closes #2655

[skip ci]
@flavorjones
Copy link
Member

I've drafted #2656 which clarifies this method in the doc string. Any feedback for me on this? If not I'll merge that in the next day or so and close this. Happy to discuss after that if you'd like!

flavorjones added a commit that referenced this issue Oct 7, 2022
because it's not always parsed as HTML.

Closes #2655

[skip ci]
@flavorjones
Copy link
Member

@tylerjc Does what I said above make sense? Do the docs I've written at #2656 explain the situation more clearly?

@flavorjones
Copy link
Member

OK, closing and merging the docs at #2656. Please let me know if you have any other thoughts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants