Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug] HTML5 foreign element namespaces should not be required by CSS queries #2376

Closed
danblaker opened this issue Nov 25, 2021 · 21 comments
Closed

Comments

@danblaker
Copy link

After upgrading to Nokogiri v1.12.0, I am unable to target inline svg elements via css or xpath selectors. Sample console commands:

html = <<-HTML.strip_heredoc
<!DOCTYPE html>
<html>
<head>
     <meta charset="utf-8">
     <meta name="viewport" content="width=device-width, initial-scale=1">
     <title>svg test</title>
</head>
<body>
  <div id="svg-container">
    <svg version="1.1" width="300" height="200" xmlns="http://www.w3.org/2000/svg">

      <rect width="100%" height="100%" fill="red" />

      <circle cx="150" cy="100" r="80" fill="green" />

      <text x="150" y="125" font-size="60" text-anchor="middle" fill="white">SVG</text>
    </svg>
  </div>
</body>
</html>
HTML

doc = Nokogiri::HTML5.parse(html)
doc.at('svg') # nil
doc.at_css('svg') # nil
doc.at_xpath('//svg') # nil

Expected behavior

Using the previous versions of Nokogiri (v 1.10.10) + nokogumbo (1.5.0), I was able to target SVG elements without any problems.

Environment

# Nokogiri (1.12.5)
    ---
    warnings: []
    nokogiri:
      version: 1.12.5
      cppflags:
      - "-I/usr/local/bundle/gems/nokogiri-1.12.5/ext/nokogiri"
      - "-I/usr/local/bundle/gems/nokogiri-1.12.5/ext/nokogiri/include"
      - "-I/usr/local/bundle/gems/nokogiri-1.12.5/ext/nokogiri/include/libxml2"
      ldflags: []
    ruby:
      version: 2.6.8
      platform: aarch64-linux-musl
      gem_platform: aarch64-linux
      description: ruby 2.6.8p205 (2021-07-07 revision 67951) [aarch64-linux-musl]
      engine: ruby
    libxml:
      source: packaged
      precompiled: false
      patches:
      - 0001-Remove-script-macro-support.patch
      - 0002-Update-entities-to-remove-handling-of-ssi.patch
      - 0003-libxml2.la-is-in-top_builddir.patch
      - 0004-use-glibc-strlen.patch
      - 0005-avoid-isnan-isinf.patch
      - 0006-update-automake-files-for-arm64.patch
      - 0007-Fix-XPath-recursion-limit.patch
      libxml2_path: "/usr/local/bundle/gems/nokogiri-1.12.5/ext/nokogiri"
      memory_management: ruby
      iconv_enabled: true
      compiled: 2.9.12
      loaded: 2.9.12
    libxslt:
      source: packaged
      precompiled: false
      patches:
      - 0001-update-automake-files-for-arm64.patch
      - 0002-Fix-xml2-config-check-in-configure-script.patch
      datetime_enabled: true
      compiled: 1.1.34
      loaded: 1.1.34
    other_libraries:
      libgumbo: 1.0.0-nokogiri
@danblaker danblaker added the state/needs-triage Inbox for non-installation-related bug reports or help requests label Nov 25, 2021
@flavorjones
Copy link
Member

@danblaker Thanks for opening this issue, and sorry you're having problems.

The behavior you're describing here seems to have changed in Nokogumbo at some point between v1.5.0 and v2.0.3 (note that Nokogiri imported Nokogumbo v2.0.5).

I wasn't easily able to find the specific commit that changed this behavior because it was challenging to build old versions of Nokogumbo. (Maybe @stevecheckoway knows?)

In any case, the document you've provided, when parsed by libgumbo, has a namespace of http://www.w3.org/2000/svg and so you can work around this by calling:

doc.at_css('svg', "xmlns" => "http://www.w3.org/2000/svg")
doc.at_xpath('//xmlns:svg', "xmlns" => "http://www.w3.org/2000/svg")

@flavorjones flavorjones added meta/user-help state/will-close topic/gumbo Gumbo HTML5 parser topic/HTML5 and removed state/needs-triage Inbox for non-installation-related bug reports or help requests labels Nov 26, 2021
@pvande
Copy link

pvande commented Nov 29, 2021

@flavorjones From what I've been able to dig up, this likely changed in Nokogumbo v2.0.0 as a part of their support for the embedded HTML5 namespaces.

While the proposed workaround is functional, I'm still uncertain whether it's desirable. HTML5 has two documented syntaxes: HTML and XML. The XML syntax is not particularly difficult to handle overall, since it's an XML document at its core. The HTML syntax (which is presumably what's handled by Nokogumbo) is different in how it is parsed, processed, and serialized.

Part of what's specified in HTML syntax is that <math> and <svg> tags name foreign elements that are validated according to an external namespace. However, referring to that namespace (beyond an optional XML-compatible xmlns declaration) is forbidden. In general, agents processing this content — and more relevantly, users querying this content — are expected to behave as though the namespaces don't exist (i.e. these elements all coexist in a single namespace).

So I suppose the "correct" resolution to this issue depends on what Nokogiri's goals are. If Nokogiri is functionally intended to be an XML manipulation tool (with support for HTML -> XML -> HTML processing), the described solution above accurately describes the XML representation of the problem space. If Nokogiri is intended to be a DOM manipulation tool, however, it would seem preferable to adhere to the conventions of the document type.

@stevecheckoway
Copy link
Contributor

stevecheckoway commented Nov 29, 2021

This is almost certainly my fault. Namespaces on elements in HTML is tricky.

Let me just point out as an aside that the HTML spec doesn't really define how to write HTML documents in XML. It says (emphasis mine),

An XML parser, for the purposes of this specification, is a construct that follows the rules given in XML to map a string of bytes or characters into a Document object. At the time of writing, no such rules actually exist.

I completely agree that querying by CSS selector should definitely work as you expect. I.e., it should follow the rules for CSS.

I have no idea what the "correct" behavior for XPATH querying should be though. HTML documents use a modified XPATH 1.0. This change is incompatible with XPATH 1.0. (There are similar changes to XSLT.) This suggests that we'd need different behavior depending on whether the document is an XML document or an HTML document.

One idea (and I think the original Nokogumbo implementation, until I changed it), is to not have foreign nodes live in a namespace. This violates the HTML standard and thus I thought it important to not do that.

I took your example document and opened it in Firefox and ran

document.querySelector("rect").namespaceURI

in the development console. As expected, document.querySelector("rect") returned the rect element and its namespace is http://www.w3.org/2000/svg.

I converted your example to an XHTML document.

<html xmlns="http://www.w3.org/1999/xhtml">
<head>
     <meta charset="utf-8" />
     <meta name="viewport" content="width=device-width, initial-scale=1" />
     <title>svg test</title>
</head>
<body>
  <div id="svg-container">
    <svg version="1.1" width="300" height="200" xmlns="http://www.w3.org/2000/svg">

      <rect width="100%" height="100%" fill="red" />

      <circle cx="150" cy="100" r="80" fill="green" />

      <text x="150" y="125" font-size="60" text-anchor="middle" fill="white">SVG</text>
    </svg>
  </div>
</body>
</html>

With both the HTML syntax and the XHTML syntax, the following code has the same behavior.

document.querySelector("rect") // returns the rect
document.evaluate('//rect', document.body).iterateNext() // returns null
document.evaluate('//svg:rect', document.body, (prefix) => prefix === 'svg' && "http://www.w3.org/2000/svg").iterateNext() // returns the rect

So the first and third return the rect and the second returns null.

I think that suggests the correct way forward here. When operating on an HTML document (which XHTML is, as confirmed by document instanceof HTMLDocument and document instanceof XMLDocument in the browser),

  1. the CSS query should work without namespaces; and
  2. the XPATH query should require namespaces on foreign elements.

I'm not familiar with how Nokogiri implements at_css() and at_xpath(), although I assume that at least the XPATH one is handled by libxml2.

@flavorjones Any idea how difficult it would be to change at_css() to do the right thing here?

@flavorjones
Copy link
Member

👋 Thanks for jumping in here @stevecheckoway.

  1. the CSS query should work without namespaces; and
  2. the XPATH query should require namespaces on foreign elements.

Your path of reasoning leading to this suggestion feels very reasonable to me, and I can easily get behind this as the "right" behavior. I'm curious to head what @pvande and @danblaker think!

I'm not familiar with how Nokogiri implements at_css() and at_xpath(), although I assume that at least the XPATH one is handled by libxml2

CSS queries are converted into XPath queries by Nokogiri::CSS::Parser, so everything gets run through libxml2's xpath evaluator.

Any idea how difficult it would be to change at_css() to do the right thing here?

I need to explore a bit to figure out the best option, but we do have the ability to modify (or augment) the compilation process for CSS queries. Probably this is the best bet: to compile CSS queries into XPath that ignores the namespace of the node and selects only on the bare node name.

@flavorjones flavorjones changed the title [bug] inline SVG elements are ignored in the latest Nokogiri version [bug] HTML5 foreign element namespaces should not be required by CSS queries Nov 29, 2021
@danblaker
Copy link
Author

I agree with the approach @stevecheckoway describes. If Nokogiri's selector methods on an HTML5 doc return identical results compared to selectors in the Firefox console for the same doc, I think that would meet the expectation of any user trying to parse HTML5 with Nokogiri.

@pvande
Copy link

pvande commented Nov 29, 2021

@danblaker's said it best — my expectation is parity with browsers. Thanks for digging in here, @stevecheckoway!

@flavorjones
Copy link
Member

OK, I'll play around with it a bit this week.

In the meantime, you may want to try to emulate this mode by calling Document#remove_namespaces!. Some documentation for this method is also available in the "searching" tutorial.

Note that this approach affects both CSS and XPath queries because it's nuked all the namespace declarations and references from the document:

# before
doc.at_css('svg', "xmlns" => "http://www.w3.org/2000/svg")
doc.at_xpath('//xmlns:svg', "xmlns" => "http://www.w3.org/2000/svg")

doc.remove_namespaces!

# after
doc.at_css('svg')
doc.at_xpath('//svg')

@flavorjones
Copy link
Member

Would love a 👍 or a clarifying comment on these tests, which is what I'm going to try to make pass on my feature branch:

      describe "HTML5 foreign elements" do
        let(:html) { <<~HTML }
          <!DOCTYPE html>
          <html>
            <head>
               <meta charset="utf-8">
               <meta name="viewport" content="width=device-width, initial-scale=1">
               <title>svg test</title>
            </head>
            <body>
              <div id="svg-container">
                <svg version="1.1" width="300" height="200" xmlns="http://www.w3.org/2000/svg">
                  <rect width="100%" height="100%" fill="red" />
                  <circle cx="150" cy="100" r="80" fill="green" />
                  <text x="150" y="125" font-size="60" text-anchor="middle" fill="white">SVG</text>
                </svg>
              </div>
            </body>
          </html>
        HTML

        let(:ns) { { "xmlns" => "http://www.w3.org/2000/svg" } }

        describe "in an XML doc" do
          let(:doc) { Nokogiri::XML::Document.parse(html) }

          it "requires namespace in XPath queries" do
            assert_empty(doc.xpath("//svg"))
            refute_empty(doc.xpath("//xmlns:svg", ns))
          end

          it "requires namespace in CSS queries" do
            assert_empty(doc.css("svg"))
            refute_empty(doc.css("xmlns|svg", ns))
          end
        end

        describe "in an HTML4 doc" do
          let(:doc) { Nokogiri::HTML4::Document.parse(html) }

          it "omits namespace in XPath queries" do
            refute_empty(doc.xpath("//svg"))
            assert_empty(doc.xpath("//xmlns:svg", ns))
          end

          it "omits namespace in CSS queries" do
            refute_empty(doc.css("svg"))
            assert_empty(doc.css("xmlns|svg", ns))
          end
        end

        describe "in an HTML5 doc" do
          let(:doc) { Nokogiri::HTML5::Document.parse(html) }

          it "requires namespace in XPath queries" do
            assert_empty(doc.xpath("//svg"))
            refute_empty(doc.xpath("//xmlns:svg", ns))
          end

          it "omits namespace in CSS queries" do
            refute_empty(doc.css("svg"))
            assert_empty(doc.css("xmlns|svg", ns))
          end
        end
      end

Note that the only failing test right now is HTML5 foreign elements::in an HTML5 doc#test_0002_omits namespace in CSS queries.

@pvande
Copy link

pvande commented Nov 30, 2021

@flavorjones My only caution about this change is that I don't actually know what the correct HTML4 behavior should be. I suspect your proposal is correct, but IIRC svg didn't become an implicitly recognized foreign element until HTML5…

Regardless, the behavior may match the expectation.

@stevecheckoway
Copy link
Contributor

I'm 👍 but I don't think xmlns should be used as the prefix. It's a reserved prefix.

@flavorjones
Copy link
Member

@stevecheckoway Sorry, do you mean the use of "xmlns" in the test? If so, I'm happy to change because in the context of the test it's an arbitrary prefix (and supported by the underlying libraries despite its reserved status).

@flavorjones
Copy link
Member

i.e., this change is what you're proposing?

diff --git a/test/xml/test_xpath.rb b/test/xml/test_xpath.rb
index 8a898f9..3a185b7 100644
--- a/test/xml/test_xpath.rb
+++ b/test/xml/test_xpath.rb
@@ -628,19 +628,19 @@ def collision(nodes)
           </html>
         HTML
 
-        let(:ns) { { "xmlns" => "http://www.w3.org/2000/svg" } }
+        let(:ns) { { "nsfoo" => "http://www.w3.org/2000/svg" } }
 
         describe "in an XML doc" do
           let(:doc) { Nokogiri::XML::Document.parse(html) }
 
           it "requires namespace in XPath queries" do
             assert_empty(doc.xpath("//svg"))
-            refute_empty(doc.xpath("//xmlns:svg", ns))
+            refute_empty(doc.xpath("//nsfoo:svg", ns))
           end
 
           it "requires namespace in CSS queries" do
             assert_empty(doc.css("svg"))
-            refute_empty(doc.css("xmlns|svg", ns))
+            refute_empty(doc.css("nsfoo|svg", ns))
           end
         end
 
@@ -649,12 +649,12 @@ def collision(nodes)
 
           it "omits namespace in XPath queries" do
             refute_empty(doc.xpath("//svg"))
-            assert_empty(doc.xpath("//xmlns:svg", ns))
+            assert_empty(doc.xpath("//nsfoo:svg", ns))
           end
 
           it "omits namespace in CSS queries" do
             refute_empty(doc.css("svg"))
-            assert_empty(doc.css("xmlns|svg", ns))
+            assert_empty(doc.css("nsfoo|svg", ns))
           end
         end
 
@@ -663,12 +663,12 @@ def collision(nodes)
 
           it "requires namespace in XPath queries" do
             assert_empty(doc.xpath("//svg"))
-            refute_empty(doc.xpath("//xmlns:svg", ns))
+            refute_empty(doc.xpath("//nsfoo:svg", ns))
           end
 
           it "omits namespace in CSS queries" do
             refute_empty(doc.css("svg"))
-            assert_empty(doc.css("xmlns|svg", ns))
+            assert_empty(doc.css("nsfoo|svg", ns))
           end
         end
       end

@stevecheckoway
Copy link
Contributor

Yeah, that's precisely what I was suggesting.

@flavorjones
Copy link
Member

flavorjones commented Dec 13, 2021

OK, I've got a branch that allows the XPathVisitor class (which is doing the translation from CSS selector to XPath query) to know whether it's operating on an HTML5 document. I cleaned up some rough edges along the way and feel pretty good about that code.

However, now I'm faced with the decision about what XPath to use. I believe the right choice is to use local-name(), but the benchmarks are Not Good™:

#! /usr/bin/env ruby

require "nokogiri"
require "benchmark/ips"

doc = Nokogiri::HTML5::Document.parse(File.read("test/files/tlm.html"))

q_element_name = "//span"
q_local_name = "//*[local-name()='span']"

raise("queries are not equivalent") unless doc.xpath(q_local_name).length == doc.xpath(q_element_name).length

Benchmark.ips do |x|
  x.report("element name") do
    doc.xpath(q_element_name).length
  end  
  x.report("local name") do
    doc.xpath(q_local_name).length
  end  
  x.compare!
end
Warming up --------------------------------------
        element name     2.057k i/100ms
          local name   191.000  i/100ms
Calculating -------------------------------------
        element name     18.912k (± 6.7%) i/s -     94.622k in   5.026911s
          local name      1.848k (± 3.3%) i/s -      9.359k in   5.069819s

Comparison:
        element name:    18911.5 i/s
          local name:     1848.1 i/s - 10.23x  (± 0.00) slower

That is, the local-name() implementation is 10x slower in libxml2 than the plain alternative. (On JRuby it's less bad but still 1.8x slower.)

I may play with a builtin function implemented in C to speed this up; but if anyone has other ideas please comment!

@flavorjones
Copy link
Member

Sigh, libxml2 already has this implemented pretty efficiently as xmlXPathLocalNameFunction, I don't think we can reasonably do better in a nokogiri builtin function.

Is everybody OK taking a 10x performance hit on CSS queries in HTML5 docs? If not, what alternatives am I not thinking of?

@stevecheckoway
Copy link
Contributor

Ouch! 10x performance hit is pretty significant. Any idea what makes it so much slower?

@flavorjones
Copy link
Member

Plain element name matching is inlined as part of xpath evaluation:

element-name

But calling an xpath function introduces that function call overhead, as well as internal stack management overhead; plus unnecessary string allocations are being made; and all that overhead is invoked for every element in the document:

local-name

Looking at these graphs, though, the memory allocation component seems to dominate, and I bet that we could avoid that with a builtin that did the comparison without returning strings (similar to what I did with the builtin nokogiri:css-class function in #2137). Let me see what I can do with that approach.

@flavorjones
Copy link
Member

Following up: writing a C builtin improved things slightly, from ~10x slower to ~6x slower. The xpath function overhead and memory allocation is still pretty noticeable:

builtin

So then I dug into libxml2 and was able to hack in limited namespace wildcarding, which is almost as fast as a standard child-axis search, about 1.1x slower. Here's the stack analysis for posterity:

wildcard

and here's the benchmark:

      //span
           18.923k (± 8.9%) i/s -     93.906k in   5.010792s
      //*[local-name()='span']
            1.849k (± 2.8%) i/s -      9.261k in   5.011560s
      //*[nokogiri-builtin:local-name-is('span')]
            3.191k (± 2.4%) i/s -     16.150k in   5.064798s
      //*:span
           18.016k (± 4.6%) i/s -     89.900k in   5.003444s
    
      Comparison:
      //span:
          18922.5 i/s
      //*:span:
          18016.5 i/s - same-ish: difference falls within error
      //*[nokogiri-builtin:local-name-is('span')]:
           3190.6 i/s - 5.93x  (± 0.00) slower
      //*[local-name()='span']:
           1849.4 i/s - 10.23x  (± 0.00) slower

@flavorjones
Copy link
Member

So I'm about to submit a PR that I'm hoping to get into the next release that will use:

  • local-name() as a fallback if we need to use standard XPath 1.0
  • nokogiri-builtin:local-name-is() in libxml2 if it's using system libraries
  • the patched libxml2 wildcard support in libxml2 if it's using packaged libraries

@flavorjones
Copy link
Member

PR is #2403

@flavorjones
Copy link
Member

PR has been merged, this will be fixed in v1.13.0 (release is imminent)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants