New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v1.13.0 XML::Document drops ampersand when parsing in-context fragments #2655
Comments
Hi, @tylerjc! Thanks for asking this question. I spent a little bit of time investigating this afternoon and this is a super interesting issue! First, let me just ask a favor, which is that in the future please don't use screenshots. It's easier for maintainers to read and reproduce problems if you use regular old text. characterizing the behavior changeOK, let's do this. Here's the script I'm going to use to explain what's going on here: #! /usr/bin/env ruby
require "bundler/inline"
ENV["NV"] ||= "1.12.5" # =>
gemfile do
source "https://rubygems.org"
gem "nokogiri", "=#{ENV["NV"]}"
end
Nokogiri::VERSION # =>
doc = Nokogiri::XML::Document.parse("<p></p>")
frag = doc.at_css("p").parse("H&M Storage") # =>
doc = Nokogiri::HTML::Document.parse("<p></p>")
frag = doc.at_css("p").parse("H&M Storage") # => When I run this with Nokogiri v1.12.5, we see: #! /usr/bin/env ruby
require "bundler/inline"
ENV["NV"] ||= "1.12.5" # => "1.12.5"
gemfile do
source "https://rubygems.org"
gem "nokogiri", "=#{ENV["NV"]}"
end
Nokogiri::VERSION # => "1.12.5"
doc = Nokogiri::XML::Document.parse("<p></p>")
frag = doc.at_css("p").parse("H&M Storage") # => [#<Nokogiri::XML::Text:0x384 "H&M Storage">]
doc = Nokogiri::HTML::Document.parse("<p></p>")
frag = doc.at_css("p").parse("H&M Storage") # => [#<Nokogiri::XML::Text:0x398 "H&M Storage">] when I run it with v1.13.0 we see: #! /usr/bin/env ruby
require "bundler/inline"
ENV["NV"] ||= "1.12.5" # => "1.13.0"
gemfile do
source "https://rubygems.org"
gem "nokogiri", "=#{ENV["NV"]}"
end
Nokogiri::VERSION # => "1.13.0"
doc = Nokogiri::XML::Document.parse("<p></p>")
frag = doc.at_css("p").parse("H&M Storage") # => [#<Nokogiri::XML::Text:0x384 "H Storage">]
doc = Nokogiri::HTML::Document.parse("<p></p>")
frag = doc.at_css("p").parse("H&M Storage") # => [#<Nokogiri::XML::Text:0x398 "H&M Storage">] OK, so take note that v1.12.5 and v1.13.0 behave the same if we are working with a diagnosisI git-bisected when this change was introduced, and it is commit 38c2f16 from #2388, which addressed an issue described at #1158. The summary of that change from the changelog is:
What's happening here is that, in 1.12.5:
but in v1.13.0:
The key bit is that the HTML parser is more forgiving than the XML parser and will autocorrect the bare We feel this new behavior is more correct, though I appreciate it breaks in your use case in an unexpected way. possible workarounds for youIf this is an HTML document, I'd suggest starting with a If this is an XML document, but you truly want to parse HTML fragments, then I'd suggest being explicit about that by using code like this: doc = Nokogiri::XML::Document.parse("<p></p>")
frag = Nokogiri::HTML::DocumentFragment.parse("H&M Storage")
doc.at_css("p").children = frag.children
doc.to_s # => "<?xml version=\"1.0\"?>\n<p>H&M Storage</p>\n" thinking about
|
I made a couple of small updates to my previous post for clarity. |
because it's not always parsed as HTML. Closes #2655 [skip ci]
I've drafted #2656 which clarifies this method in the doc string. Any feedback for me on this? If not I'll merge that in the next day or so and close this. Happy to discuss after that if you'd like! |
because it's not always parsed as HTML. Closes #2655 [skip ci]
OK, closing and merging the docs at #2656. Please let me know if you have any other thoughts. |
Please describe the bug
Help us reproduce what you're seeing
Environment
The text was updated successfully, but these errors were encountered: