Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

element name case changes when added to document with different encoding #1158

Closed
dcorrigan opened this issue Sep 10, 2014 · 3 comments
Closed

Comments

@dcorrigan
Copy link

Hi,

It looks like element names are serialized in a case-insensitive way if the element is added as a string to a document that declares a different encoding:

# encoding: utf-8
f = <<-EOT
<?xml version="1.0" encoding="us-ascii"?>
<root>
  <p>foo</p>
</root>
EOT
r = Nokogiri.XML(f)
p = r.at_css('p').replace('<yYy>©bar</yYy>')
puts r.to_s

This produces:

<?xml version="1.0" encoding="us-ascii"?>
<root>
  <yyy>&#169;bar</yyy>
</root>

I tested this on nokogiri 1.6.3.1, libxml2 2.8.0. Is this a libxml2 problem, maybe?

@knu
Copy link
Member

knu commented Sep 16, 2014

Seems replace(Nokogiri::XML::DocumentFragment.parse('<yYy>©bar</yYy>')) works around the problem. node.replace(string) internally calls node.fragment(string), which is where to start investigation.

@dcorrigan
Copy link
Author

Thanks, @knu. I'll use the work-around for now.

flavorjones added a commit that referenced this issue Dec 16, 2021
Introduce ClassResolver which will do the class lookup correctly, and
use it for Builder and for fragment parsing.

Closes #1158
@flavorjones
Copy link
Member

Apologies for not replying for such an embarrassingly long time.

Fragment parsing is a bit complicated because libxml2 doesn't recover from errors while parsing a fragment in the context of a specific node (e.g., node.parse(...) as opposed to Nokogiri::XML::DocumentFragment.parse(...)).

When an error (like the encoding error here, or any other syntax error) is encountered, Nokogiri's fallback behavior is to parse the fragment outside the context of the node. Unfortunately, though, it was using a hardcoded class to do this: Nokogiri::HTML4::DocumentFragment. This uses the HTML parser and causes all elements to be downcased.

See #2388 for fix.

flavorjones added a commit that referenced this issue Dec 16, 2021
Introduce ClassResolver which will do the class lookup correctly, and
use it for Builder and for fragment parsing.

Closes #1158
flavorjones added a commit that referenced this issue Dec 16, 2021
Introduce ClassResolver which will do the class lookup correctly, and
use it for Builder and for fragment parsing.

Closes #1158
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants