Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Entity reference causes DocumentFragment to lowercase attribute names #961

Closed
bhollis opened this issue Aug 25, 2013 · 3 comments
Closed

Comments

@bhollis
Copy link

bhollis commented Aug 25, 2013

Using Nokogiri 1.6.0 on Ruby 2.0.0p0 (OSX)

require 'nokogiri'

xml = '<foo>
<svg viewBox="0 0 30 16"/>
Hey what?
</foo>'

bad_xml = '<foo>
<svg viewBox="0 0 30 16"/>
Hey &otimes; what?
</foo>'

d = Nokogiri::XML::Document.new
frag = Nokogiri::XML::DocumentFragment.new(d, xml, d)
bad_frag = Nokogiri::XML::DocumentFragment.new(d, bad_xml, d)
puts "GOOD: #{frag.to_s}"
puts "BAD: #{bad_frag.to_s}"

parse = Nokogiri::XML::DocumentFragment.parse(bad_xml)
puts "BONUS: #{parse.to_s}"

This prints:

GOOD: <foo>
<svg viewBox="0 0 30 16"/>
Hey what?
</foo>
BAD: <foo><svg viewbox="0 0 30 16"/>
Hey &#x2297; what?
</foo>
BONUS: <foo>
<svg viewBox="0 0 30 16"/>
Hey  what?
</foo>

What's happening here is that adding the &otimes; reference causes the document fragment to lowercase the viewBox attribute!

As a bonus, if I use the normal DocumentFragment.parse method, I just don't get any entity at all.

It's important to use an "uncommon" entity reference - stuff like &amp; or &rarr; doesn't trigger this bug.

@bhollis
Copy link
Author

bhollis commented Aug 31, 2013

More weirdness:

bad_xml = '<foo>
<svg viewBox="0 0 30 16"/>
Hey &otimes; what?
</foo>'

bad_xml_with_doctype = '<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
   "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<foo>
<svg viewBox="0 0 30 16"/>
Hey &otimes; what?
</foo>'

puts "With DOCTYPE"
puts Nokogiri::XML(bad_xml_with_doctype).root.to_xml

puts "But not as a Fragment:"
puts Nokogiri::XML(bad_xml_with_doctype).fragment(bad_xml).to_xml

Produces:

With DOCTYPE
<foo>
<svg viewBox="0 0 30 16"/>
Hey &otimes; what?
</foo>
But not as a Fragment:
<foo><svg viewbox="0 0 30 16"/>
Hey &#x2297; what?
</foo>

So even when I have a nice document parsed with a doctype, creating a fragment off that lowercases the attributes!

@stadelmanma
Copy link

Was this ever determined to be a bug or expected behavior? Since needing to add .gsub(/&(?!amp;)/, '&amp;') in odd places doesn't seem like a good fix to me.

stadelmanma added a commit to stadelmanma/sablon that referenced this issue Mar 5, 2018
Without this Nokogiri will downcase the attributes for some reason
which I assume must be expected behavior since the issue hasn't been
fixed.
Issue Reference: sparklemotion/nokogiri#961
stadelmanma added a commit to senny/sablon that referenced this issue Mar 7, 2018
This appears to be needed due to a long standing bug in nokogiri, see commits below. All '&' now get replaced with
'&amp;'. 

* Escape all ampersands in relationship nodes with '&amp;'

Without this Nokogiri will downcase the attributes for some reason
which I assume must be expected behavior since the issue hasn't been
fixed.
Issue Reference: sparklemotion/nokogiri#961

* Update HTML insertion integration test

This test ensures URL get their ampeersands escaped correctly
without interfering with ones that are already escaped.
@flavorjones
Copy link
Member

This is fixed on main by 38c2f16, see #1158 for a related bug report and additional context. This will be fixed in the next release of Nokogiri, v1.13.0 (sometime in the next few days).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants