Skip to content
This repository has been archived by the owner on Aug 26, 2023. It is now read-only.

coerce(str) results in Text node rather than html #160

Closed
dan42 opened this issue Nov 27, 2020 · 8 comments
Closed

coerce(str) results in Text node rather than html #160

dan42 opened this issue Nov 27, 2020 · 8 comments

Comments

@dan42
Copy link

dan42 commented Nov 27, 2020

I ran into a surprise when trying to insert content into a document using element.before("<div/>")
For regular elements it works, but if it's a script element, "<div/>" is escaped and inserted as text.
Apparently due to coerce behaving weirdly for this case?

Nokogiri.HTML('<script>').at('//script').send(:coerce, '<div/>')
#=> [#<Nokogiri::XML::Element:0x2adb350529dc name="div">]

Nokogiri.HTML5('<script>').at('//script').send(:coerce, '<div/>')
#=> [#<Nokogiri::XML::Text:0x2adb3505bb2c "<div>">]
Nokogiri.HTML5('<span>').at('//span').send(:coerce, '<div/>')
#=> [#<Nokogiri::XML::Element:0x2adb3505b014 name="div">]
@stevecheckoway
Copy link
Collaborator

I think the source of this bug is in Nokogiri but not in a way that matters for Nokogiri.

What's happening is that when #before is called, it ends up calling #add_sibling which calls #coerce, as you've discovered. #coerce sees that the input is a string and then calls #fragment on it.

Here's the documentation for #fragment.

      ###
      # Create a DocumentFragment containing +tags+ that is relative to _this_
      # context node.

And this explains what's happening. self is a script element. When Nokogumbo parses an HTML fragment with a script element as the context element, it does what the standard says to do in this case, namely to start tokenizing in the script data state. Inside a script element, <div/> has no special meaning, so it returns a text node.

I think what should happen is the parent element's #coerce should be called rather than the current element. That is the #add_sibling should be changed from

      def add_sibling(next_or_previous, node_or_tags)
        impl = (next_or_previous == :next) ? :add_next_sibling_node : :add_previous_sibling_node
        iter = (next_or_previous == :next) ? :reverse_each : :each

        node_or_tags = coerce node_or_tags
        ...

to

      def add_sibling(next_or_previous, node_or_tags)
        impl = (next_or_previous == :next) ? :add_next_sibling_node : :add_previous_sibling_node
        iter = (next_or_previous == :next) ? :reverse_each : :each

        node_or_tags = parent.coerce node_or_tags
        ...

I verified that passing the parent element as the context element for the DocumentFragment works in this case.

irb(main):001:0> doc = Nokogiri.HTML5('<script>')
=> #<Nokogiri::HTML5::Document:0x4c4 name="document" children=[#<Nokogiri::XML::Element:0x4b0 name="html" children=[#<Nokogiri::XML::Element:0x488 name="he...
irb(main):002:0> doc.at('//script').before('<div/>')
=> #<Nokogiri::XML::Element:0x474 name="script">
irb(main):003:0> doc.to_s
=> "<html><head><div></div><script></script></head><body></body></html>"
irb(main):004:0> doc
=> #<Nokogiri::HTML5::Document:0x4c4 name="document" children=[#<Nokogiri::XML::Element:0x4b0 name="html" children=[#<Nokogiri::XML::Element:0x488 name="head" children=[#<Nokogiri::XML::Element:0x4d8 name="div">, #<Nokogiri::XML::Element:0x474 name="script">]>, #<Nokogiri::XML::Element:0x49c name="body">]>]>

but I did that by modifying Nokogiri::HTML5::Node#fragment which is going to be incorrect in general.

I'm not sure why this doesn't affect Nokogiri. Maybe the rules are different for earlier HTML or maybe there's some bug in xmlParseInNodeContext.

@flavorjones Do you have any thoughts on this? Should I create a PR with this change for Nokogiri? I'm not really sure how I would test it though.

@flavorjones
Copy link
Collaborator

Hey, @stevecheckoway. I agree with your take, that Nokogiri should be calling #coerce on the parent node when adding siblings.

I'm looking at how to test this by examining what's going wrong. You're right that xmlParseInNodeContext does not differentiate between a script tag and any other element when it parses in-context, which is why we haven't caught this bug before now.

I'm going to spike on a testing approach that will assert that the method is called on the parent, and see how I feel about it.

@flavorjones
Copy link
Collaborator

@stevecheckoway PR sparklemotion/nokogiri#2116 is making its way through CI now. If it goes green, and I think it will, it'll be in v1.11.0.rc4, hopefully within the next week.

@stevecheckoway
Copy link
Collaborator

@flavorjones Great! Thanks so much for looking into this and getting it fixed so quickly.

@flavorjones
Copy link
Collaborator

flavorjones commented Nov 29, 2020

@stevecheckoway I'm going to spend some time today getting nokogiri master in shape to integration-test this with nokogumbo. I'm going to reopen this until I can do that, if that's OK?

EDIT: the work I'm referring to is at sparklemotion/nokogiri#1788

@flavorjones flavorjones reopened this Nov 29, 2020
@stevecheckoway
Copy link
Collaborator

By all means!

@dan42
Copy link
Author

dan42 commented Nov 30, 2020

Thanks to everyone for the very fast fix!

@flavorjones
Copy link
Collaborator

OK, I'm going to close this since Nokogiri master is once again compatible with Nokogumbo, and I've also got a draft PR at #163 to further improve the integration. Thanks all for your patience!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants