Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML5 documents should not require namespaces in CSS selector queries #2403

Merged
merged 13 commits into from Jan 4, 2022

Conversation

flavorjones
Copy link
Member

What problem is this PR intended to solve?

Discussion at #2376 led to the decision to continue to parse HTML5 foreign elements with namespaces, but not to require that namespace in CSS queries.

So, in the canonical example:

<!DOCTYPE html>
<html>
  <body>
    <div id="svg-container">
      <svg version="1.1" width="300" height="200" xmlns="http://www.w3.org/2000/svg">
        <rect width="100%" height="100%" fill="red" />
        <circle cx="150" cy="100" r="80" fill="green" />
        <text x="150" y="125" font-size="60" text-anchor="middle" fill="white">SVG</text>
      </svg>
    </div>
  </body>
</html>

Nokogiri will respond to the following searches:

doc.css("svg") # ⇒ returns the svg element
doc.css("nsfoo|svg", { "nsfoo" => "http://www.w3.org/2000/svg" }) # ⇒ also returns the svg element

In order to accomplish this, this PR is rather invasive, but leaves a lot of code better factored, better documented, more flexible, and more extensible.

Specific changes:

  • Nokogiri::CSS.xpath_for allows a Nokogiri::CSS::XPathVisitor to be injected for fine-grained control over how CSS selectors are translated into XPath queries
  • the CSS cache is extended to account for XPathVisitor configuration
  • patch libxml2 to support wildcard namespaces in XPath queries (nonstandard in XPath 1.0)
  • HTML5 documents and fragments translate child-axis searches into a wildcard-namespace query if libxml2 is patched; else use a builtin xpath function if libxml2; else use standard XPath local-name() (this order of preference reflects performance, faster-is-more-preferred)

Have you included adequate test coverage?

Oh yesssss.

Also here are the benchmarks of the various implementations:

      //span
           18.923k (± 8.9%) i/s -     93.906k in   5.010792s
      //*[local-name()='span']
            1.849k (± 2.8%) i/s -      9.261k in   5.011560s
      //*[nokogiri-builtin:local-name-is('span')]
            3.191k (± 2.4%) i/s -     16.150k in   5.064798s
      //*:span
           18.016k (± 4.6%) i/s -     89.900k in   5.003444s
    
      Comparison:
      //span:
          18922.5 i/s
      //*:span:
          18016.5 i/s - same-ish: difference falls within error
      //*[nokogiri-builtin:local-name-is('span')]:
           3190.6 i/s - 5.93x  (± 0.00) slower
      //*[local-name()='span']:
           1849.4 i/s - 10.23x  (± 0.00) slower

I'll also drop in some graphs of the callstacks below.

Does this change affect the behavior of either the C or the Java implementations?

HTML5 is already only supported by CRuby, and so much of this functionality is simply not going to be turned on in JRuby. That said, if we need this functionality in JRuby, the fallback use of standard XPath local-name() would suffice until a need arose for performance optimizations there.

@flavorjones
Copy link
Member Author

Callstacks! You'll notice that the queries that use xpath functions end up spending a lot of time in the xpath function stack, and in allocating memory for the objects on that "stack" including the short-lived return values. The standard query and the wildcard query are faster because the comparisons are being done "inline" in the XPath query engine, without invoking xpath functions.

Callstack for standard child-axis query:
element-name

Callstack for local-name query:
local-name

Callstack for the builtin function query:
builtin

Callstack for the wildcard namespace query:
wildcard

@flavorjones
Copy link
Member Author

I could use a review in particular of the libxml2 patch b52dde7, because this feels a little bit risky.

Worth noting, though: with that patch applied, libxml2's full test suite still passes.

@stevecheckoway
Copy link
Contributor

I spent a little while looking at the xpath.c patch. Nothing jumps out at me as being wrong, but the file is huge and it's difficult to know what the implications are precisely. I'll try to look more closely this coming week.

@flavorjones flavorjones force-pushed the 2376-html5-namespaces-in-css-queries branch from b52dde7 to f3af677 Compare January 3, 2022 20:04
@flavorjones
Copy link
Member Author

@stevecheckoway Thanks for the spot check. I'm going to merge this and accept a little bit of risk, because both libxml2 and nokogiri's full test suites still pass with that libxml2 patch applied.

@stevecheckoway
Copy link
Contributor

I think that makes sense. Sorry for not doing a better review in a reasonable amount of time.

@flavorjones
Copy link
Member Author

Oh, please don't apologize! I appreciate your attention.

This makes it explicit that visitor must be injected.

Although this is technically a breaking change, CSS::Node is
essentially an internal API and I would be extremely surprised if it
was being used directly by anyone.

This commit also ensures that CSS::Node, CSS::Parser, and
CSS::Tokenizer are omitted from API documentation (to reflect their
status as an internal-only and unstable API).
Note that the method signature of internal-only method
CSS::Parser.xpath_for has changed to prefer required parameters rather
than an options hash with assumed defaults.

Also note that I've introduced some XPath constants for XPath query
prefixes, which we should start using.
which prevents incorrect cache results being returned in cases where
different visitor configurations are used.

We also deprecate XPathVisitorAlwaysUseBuiltins and
XPathVisitorOptimallyUseBuiltins in favor of a simple XPathVisitor
with constructor arguments.
which will allow us to more easily special-case element names in an
upcoming commit.
and refactor the XPathVisitor#css_class to not rely on aliasing during
object initialization, which feels funny to me
we're about to introduce variations in how CSS selectors are
translated into XPath and I want to ensure we have adequate test coverage.
In HTML5, foreign elements have namespaces; but those namespaces
should not be considered for the purposes of CSS searches.

Unfortunately, this implementation's use of local-name() is ~10x
slower than using the normal inline element name matching, which
subsequent commits will explore.
Also tweak :nodoc: directives to avoid accidentally excluding
important modules (like Nokogiri::CSS)

and avoid including code that's not usually relevant to the user
API (like Nokogiri::XML:PP, Nokogiri::CSS:Tokenizer, and
Nokogiri::HTML4::Document::EncodingReader)
and other small changes to doc strings
This is almost as fast as a standard child-axis search, and much
faster than the builtin or using local-name():

  //span
       18.923k (± 8.9%) i/s -     93.906k in   5.010792s
  //*[local-name()='span']
        1.849k (± 2.8%) i/s -      9.261k in   5.011560s
  //*[nokogiri-builtin:local-name-is('span')]
        3.191k (± 2.4%) i/s -     16.150k in   5.064798s
  //*:span
       18.016k (± 4.6%) i/s -     89.900k in   5.003444s

  Comparison:
  //span:
      18922.5 i/s
  //*:span:
      18016.5 i/s - same-ish: difference falls within error
  //*[nokogiri-builtin:local-name-is('span')]:
       3190.6 i/s - 5.93x  (± 0.00) slower
  //*[local-name()='span']:
       1849.4 i/s - 10.23x  (± 0.00) slower
@flavorjones flavorjones force-pushed the 2376-html5-namespaces-in-css-queries branch from af58bbc to d1a710e Compare January 3, 2022 22:10
@flavorjones flavorjones merged commit dcccf72 into main Jan 4, 2022
@flavorjones flavorjones deleted the 2376-html5-namespaces-in-css-queries branch January 4, 2022 01:59
flavorjones added a commit that referenced this pull request Nov 28, 2023
**What problem is this PR intended to solve?**

Before a minor release, I generally review deprecations and look for
things we can remove.

* Removed `Nokogiri::HTML5.get` which was deprecated in v1.12.0. [#2278]
(@flavorjones)
* Removed the CSS-to-XPath utility modules
`XPathVisitorAlwaysUseBuiltins` and `XPathVisitorOptimallyUseBuiltins`,
which were deprecated in v1.13.0 in favor of `XPathVisitor` constructor
args. [#2403] (@flavorjones)
* Removed `XML::Reader#attribute_nodes` which was deprecated in v1.13.8
in favor of `#attribute_hash`. [#2598, #2599] (@flavorjones)

Also we're now specifying version numbers in remaining deprecation
warnings.

**Have you included adequate test coverage?**

Tests have been removed, otherwise no new coverage needed.

**Does this change affect the behavior of either the C or the Java
implementations?**

As documented above.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants