Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nokogiri::XML::Node#add_child_node_and_reparent_attrs behaves incorrectly if an attribute name has a colon #1790

Open
stevecheckoway opened this issue Aug 22, 2018 · 4 comments

Comments

@stevecheckoway
Copy link
Contributor

What problems are you experiencing?
Nokogiri::XML::Node#add_child_node_and_reparent_attrs uses an incorrect test to see if a node's namespaces need to be reparented (at least, I think that's the purpose of this code).

It's testing a.name =~ /:/ to decide to reparent. This breaks HTML elements which are allowed to have colons in their attribute names. (Essentially, HTML only allows foreign elements to have explicit namespaces.)

It's a little convoluted to demonstrate the problem using the Nokogiri API.

require 'nokogiri'

doc = Nokogiri::HTML::Document.new
html = Nokogiri::XML::Element.new('html', doc)
html['a'] = 'en'
attr = html.attribute('a')
attr.name = 'xml:lang'

puts(html.attribute_nodes[0].inspect)
doc.add_child(html)
puts(html.attribute_nodes[0].inspect)

This prints

#<Nokogiri::XML::Attr:0x3ff75a84b6d8 name="xml:lang" value="en">
#<Nokogiri::XML::Attr:0x3ff75a84b340 name="lang" namespace=#<Nokogiri::XML::Namespace:0x3ff75a84b2dc prefix="xml" href="http://www.w3.org/XML/1998/namespace"> value="en">

I note that using doc.root = html doesn't do this reparenting.

Here's a backwards compatible fix.

require 'nokogiri'

module Nokogiri
  module XML
    class Node
      # HTML elements can have attributes that contain colons.
      # Nokogiri::XML::Node#[]= treats names with colons as a prefixed QName
      # and tries to create an attribute in a namespace. This is especially
      # annoying with attribute names like xml:lang since libxml2 will
      # actually create the xml namespace if it doesn't exist already.
      def add_child_node_and_reparent_attrs node
        add_child_node(node)
        node.attribute_nodes.find_all { |a| a.namespace }.each do |attr|
          attr.remove
          node[attr.name] = attr.value
        end
      end
    end
  end
end

doc = Nokogiri::HTML::Document.new
html = Nokogiri::XML::Element.new('html', doc)
html['a'] = 'en'
attr = html.attribute('a')
attr.name = 'xml:lang'

puts(html.attribute_nodes[0].inspect)
doc.add_child(html)
puts(html.attribute_nodes[0].inspect)

This prints

#<Nokogiri::XML::Attr:0x3fdb7a09b5b0 name="xml:lang" value="en">
#<Nokogiri::XML::Attr:0x3fdb7a09b5b0 name="xml:lang" value="en">

Ideally, an API for manipulating attributes in a given namespace that doesn't do the parsing based on colon (e.g., uses xmlNewProp rather than xmlSetProp) would be fantastic.

What's the output from nokogiri -v?

# Nokogiri (1.8.4)
    ---
    warnings: []
    nokogiri: 1.8.4
    ruby:
      version: 2.4.4
      platform: x86_64-darwin17
      description: ruby 2.4.4p296 (2018-03-28 revision 63013) [x86_64-darwin17]
      engine: ruby
    libxml:
      binding: extension
      source: packaged
      libxml2_path: "/Users/steve/programming/nokogumbo/vendor/bundle/ruby/2.4.0/gems/nokogiri-1.8.4/ports/x86_64-apple-darwin17.4.0/libxml2/2.9.8"
      libxslt_path: "/Users/steve/programming/nokogumbo/vendor/bundle/ruby/2.4.0/gems/nokogiri-1.8.4/ports/x86_64-apple-darwin17.4.0/libxslt/1.1.32"
      libxml2_patches:
      - 0001-Revert-Do-not-URI-escape-in-server-side-includes.patch
      libxslt_patches: []
      compiled: 2.9.8
      loaded: 2.9.8

Can you provide a self-contained script that reproduces what you're seeing?

@flavorjones
Copy link
Member

Thank you for submitting this! I'm just back from vacation and it will take a few days to catch up on everything. Thanks for your patience.

@stevecheckoway
Copy link
Contributor Author

In retrospect, I'm not sure what this function is supposed to do at all. If I'm reading xml_node.c correctly (and I'm not at all sure that I am since I'm having a separate issue), all of the namespace reparenting should be happening in relink_namespace.

@flavorjones
Copy link
Member

Hey, @stevecheckoway! Finally getting around to looking at this.

all of the namespace reparenting should be happening in relink_namespace

Yes, you're right. This method originally came from #870 which was a fix for #869; but the actual root cause was that relink_namespaces was bailing early -- skipping attributes -- if the element didn't have a namespace itself.

I just wrote #2310 which fixes the underlying implementation of relink_namespaces, but now we have a different issue, which is: there don't appear to be any/many tests of the HTML5::Node behavior implemented in 7b3c972. Hang on for one sec on that point ...

Side note: this refactor became much easier after #1712 was fixed by #2246 and #2228 was fixed by #2230, because the reparenting/namespacing behavior was much less buggy.

I'm not sure what this function is supposed to do at all

Again, this method came from #870 which was a fix for #869. Specifically, it's intended to handle cases like this:

Nokogiri::XML::Builder.new do |xml|
  xml.root('xmlns:foo' => 'bar') {
    xml.obj('foo:attr' => 'baz')
  }
end

We're trying to resolve this ambiguity:

  • is the attribute named foo:attr
  • or is it named attr and it's in the namespace identified by the prefix foo?

If we were re-designing the Builder API, we could probably allow for explicitly setting the namespace for that attribute. However, there's an analogous use case that doesn't use Builder:

        doc1 = Nokogiri::XML('<root/>')
        doc2 = Nokogiri::XML('<root xmlns:foo="http://foo.io"/>')
        node = doc1.create_element('obj', 'foo:attr' => 'baz')
        # at this point, the attribute is named `foo:attr`
        # #<Nokogiri::XML::Element:0xba4 name="obj" attributes=[#<Nokogiri::XML::Attr:0xb90 name="foo:attr" value="baz">]>
        doc2.root.add_child(node)
        # but now the attribute is `attr` with a namespace
        # #<Nokogiri::XML::Element:0xba4 name="obj" attributes=[#<Nokogiri::XML::Attr:0xbcc name="attr" namespace=#<Nokogiri::XML::Namespace:0xbb8 prefix="foo" href="http://foo.io"> value="baz">]>

Your suggested fix breaks this "namespacing" process for the attribute in question.

Normally this is not an issue if the implied prefix doesn't correspond to a namespace in the document -- for example, using foo:lang instead of xml:lang correctly creates this attribute:

#<Nokogiri::XML::Attr:0x550 name="foo:lang" value="en">

but unfortunately in an XML document the xml prefix is reserved and will always resolve.

@stevecheckoway WDYT of an alternative approach: what if we skipped relink_namespace in HTML4+5 documents? I'm going to start a new branch where I try to write a test for 7b3c972

@flavorjones
Copy link
Member

See #2310 for what I think is a good fix for all of this that in fewer lines of code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants