Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML5 documents should not require namespaces in CSS selector queries #2403

Merged
merged 13 commits into from Jan 4, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
7 changes: 6 additions & 1 deletion CHANGELOG.md
Expand Up @@ -35,6 +35,7 @@ This release ends support for:
### Improved

* `{XML,HTML4}::DocumentFragment` constructors all now take an optional parse options parameter or block (similar to Document constructors). [[#1692](https://github.com/sparklemotion/nokogiri/issues/1692)] (Thanks, [@JackMc](https://github.com/JackMc)!)
* `Nokogiri::CSS.xpath_for` allows an `XPathVisitor` to be injected, for finer-grained control over how CSS queries are translated into XPath.
* [CRuby] `XML::Reader#encoding` will return the encoding detected by the parser when it's not passed to the constructor. [[#980](https://github.com/sparklemotion/nokogiri/issues/980)]
* [CRuby] Handle abruptly-closed HTML comments as recommended by WHATWG. (Thanks to [tehryanx](https://hackerone.com/tehryanx?type=user) for reporting!)
* [CRuby] `Node#line` is no longer capped at 65535. libxml v2.9.0 and later support a new parse option, exposed as `Nokogiri::XML::ParseOptions::PARSE_BIG_LINES`, which is turned on by default in `ParseOptions::DEFAULT_{XML,XSLT,HTML,SCHEMA}` (Note that JRuby already supported large line numbers.) [[#1764](https://github.com/sparklemotion/nokogiri/issues/1764), [#1493](https://github.com/sparklemotion/nokogiri/issues/1493), [#1617](https://github.com/sparklemotion/nokogiri/issues/1617), [#1505](https://github.com/sparklemotion/nokogiri/issues/1505), [#1003](https://github.com/sparklemotion/nokogiri/issues/1003), [#533](https://github.com/sparklemotion/nokogiri/issues/533)]
Expand All @@ -45,7 +46,9 @@ This release ends support for:

### Fixed

* XML::Builder blocks restore context properly when exceptions are raised. [[#2372](https://github.com/sparklemotion/nokogiri/issues/2372)] (Thanks, [@ric2b](https://github.com/ric2b) and [@rinthedev](https://github.com/rinthedev)!)
* CSS queries on HTML5 documents now correctly match foreign elements (SVG, MathML) when namespaces are not specified in the query. [[#2376](https://github.com/sparklemotion/nokogiri/issues/2376)]
* `XML::Builder` blocks restore context properly when exceptions are raised. [[#2372](https://github.com/sparklemotion/nokogiri/issues/2372)] (Thanks, [@ric2b](https://github.com/ric2b) and [@rinthedev](https://github.com/rinthedev)!)
* The `Nokogiri::CSS::Parser` cache now uses the `XPathVisitor` configuration as part of the cache key, preventing incorrect cache results from being returned when multiple `XPathVisitor` options are being used.
* Error recovery from in-context parsing (e.g., `Node#parse`) now always uses the correct `DocumentFragment` class. Previously `Nokogiri::HTML4::DocumentFragment` was always used, even for XML documents. [[#1158](https://github.com/sparklemotion/nokogiri/issues/1158)]
* `DocumentFragment#>` now works properly, matching a CSS selector against only the fragment roots. [[#1857](https://github.com/sparklemotion/nokogiri/issues/1857)]
* `XML::DocumentFragment#errors` now correctly contains any parsing errors encountered. Previously this was always empty. (Note that `HTML::DocumentFragment#errors` already did this.)
Expand All @@ -61,6 +64,8 @@ This release ends support for:
### Deprecated

* Passing a `Nokogiri::XML::Node` as the second parameter to `Node.new` is deprecated and will generate a warning. This will become an error in a future version of Nokogiri. [[#975](https://github.com/sparklemotion/nokogiri/issues/975)]
* `Nokogiri::CSS::Parser`, `Nokogiri::CSS::Tokenizer`, and `Nokogiri::CSS::Node` are now internal-only APIs that are no longer documented, and should not be considered stable. With the introduction of `XPathVisitor` injection into `Nokogiri::CSS.xpath_for` there should be no reason to rely on these internal APIs.
* CSS-to-XPath utility classes `Nokogiri::CSS::XPathVisitorAlwaysUseBuiltins` and `XPathVisitorOptimallyUseBuiltins` are deprecated. Prefer `Nokogiri::CSS::XPathVisitor` with appropriate constructor arguments. These classes will be removed in a future version of Nokogiri.


## 1.12.5 / 2021-09-27
Expand Down
4 changes: 2 additions & 2 deletions ext/nokogiri/xml_dtd.c
Expand Up @@ -57,9 +57,9 @@ entities(VALUE self)

/*
* call-seq:
* notations
* notations() → Hash<name(String)⇒Notation>
*
* Get a hash of the notations for this DTD.
* [Returns] All the notations for this DTD in a Hash of Notation +name+ to Notation.
*/
static VALUE
notations(VALUE self)
Expand Down
22 changes: 22 additions & 0 deletions ext/nokogiri/xml_xpath_context.c
Expand Up @@ -86,6 +86,26 @@ xpath_builtin_css_class(xmlXPathParserContextPtr ctxt, int nargs)
xmlXPathFreeObject(needle);
}


/* xmlXPathFunction to select nodes whose local name matches, for HTML5 CSS queries that should ignore namespaces */
static void
xpath_builtin_local_name_is(xmlXPathParserContextPtr ctxt, int nargs)
{
xmlXPathObjectPtr element_name;

assert(ctxt->context->node);

CHECK_ARITY(1);
CAST_TO_STRING;
CHECK_TYPE(XPATH_STRING);
element_name = valuePop(ctxt);

valuePush(ctxt, xmlXPathNewBoolean(xmlStrEqual(ctxt->context->node->name, element_name->stringval)));

xmlXPathFreeObject(element_name);
}


/*
* call-seq:
* register_ns(prefix, uri)
Expand Down Expand Up @@ -361,6 +381,8 @@ new (VALUE klass, VALUE nodeobj)
xmlXPathRegisterNs(ctx, NOKOGIRI_BUILTIN_PREFIX, NOKOGIRI_BUILTIN_URI);
xmlXPathRegisterFuncNS(ctx, (const xmlChar *)"css-class", NOKOGIRI_BUILTIN_URI,
xpath_builtin_css_class);
xmlXPathRegisterFuncNS(ctx, (const xmlChar *)"local-name-is", NOKOGIRI_BUILTIN_URI,
xpath_builtin_local_name_is);

self = Data_Wrap_Struct(klass, 0, deallocate, ctx);
return self;
Expand Down
43 changes: 37 additions & 6 deletions lib/nokogiri/css.rb
@@ -1,18 +1,49 @@
# coding: utf-8
# frozen_string_literal: true

module Nokogiri
# Translate a CSS selector into an XPath 1.0 query
module CSS
class << self
###
# Parse this CSS selector in +selector+. Returns an AST.
def parse(selector)
# TODO: Deprecate this method ahead of 2.0 and delete it in 2.0.
# It is not used by Nokogiri and shouldn't be part of the public API.
def parse(selector) # :nodoc:
Parser.new.parse(selector)
end

###
# Get the XPath for +selector+.
# :call-seq:
# xpath_for(selector) → String
# xpath_for(selector [, prefix:] [, visitor:] [, ns:]) → String
#
# Translate a CSS selector to the equivalent XPath query.
#
# [Parameters]
# - +selector+ (String) The CSS selector to be translated into XPath
#
# - +prefix:+ (String)
#
# The XPath prefix for the query, see Nokogiri::XML::XPath for some options. Default is
# +XML::XPath::GLOBAL_SEARCH_PREFIX+.
#
# - +visitor:+ (Nokogiri::CSS::XPathVisitor)
#
# The visitor class to use to transform the AST into XPath. Default is
# +Nokogiri::CSS::XPathVisitor.new+.
#
# - +ns:+ (Hash<String ⇒ String>)
#
# The namespaces that are referenced in the query, if any. This is a hash where the keys are
# the namespace prefix and the values are the namespace URIs. Default is an empty Hash.
#
# [Returns] (String) The equivalent XPath query for +selector+
#
# 💡 Note that translated queries are cached for performance concerns.
#
def xpath_for(selector, options = {})
Parser.new(options[:ns] || {}).xpath_for(selector, options)
prefix = options.fetch(:prefix, Nokogiri::XML::XPath::GLOBAL_SEARCH_PREFIX)
visitor = options.fetch(:visitor) { Nokogiri::CSS::XPathVisitor.new }
ns = options.fetch(:ns, {})
Parser.new(ns).xpath_for(selector, prefix, visitor)
end
end
end
Expand Down
4 changes: 2 additions & 2 deletions lib/nokogiri/css/node.rb
Expand Up @@ -2,7 +2,7 @@

module Nokogiri
module CSS
class Node
class Node # :nodoc:
ALLOW_COMBINATOR_ON_SELF = [:DIRECT_ADJACENT_SELECTOR, :FOLLOWING_SELECTOR, :CHILD_SELECTOR]

# Get the type of this node
Expand All @@ -23,7 +23,7 @@ def accept(visitor)

###
# Convert this CSS node to xpath with +prefix+ using +visitor+
def to_xpath(prefix = "//", visitor = XPathVisitor.new)
def to_xpath(prefix, visitor)
prefix = "." if ALLOW_COMBINATOR_ON_SELF.include?(type) && value.first.nil?
prefix + visitor.accept(self)
end
Expand Down
16 changes: 12 additions & 4 deletions lib/nokogiri/css/parser.rb
@@ -1,7 +1,7 @@
# frozen_string_literal: true
#
# DO NOT MODIFY!!!!
# This file is automatically generated by Racc 1.5.2
# This file is automatically generated by Racc 1.6.0
# from Racc grammar file "".
#

Expand All @@ -10,6 +10,14 @@

require_relative "parser_extras"

module Nokogiri
module CSS
# :nodoc: all
class Parser < Racc::Parser
end
end
end

module Nokogiri
module CSS
class Parser < Racc::Parser
Expand Down Expand Up @@ -247,7 +255,7 @@ def unescape_css_string(str)
"." => 27,
"*" => 28,
"|" => 29,
":" => 30, }
":" => 30 }

racc_nt_base = 31

Expand Down Expand Up @@ -485,7 +493,7 @@ def _reduce_27(val, _values, result)
end

def _reduce_28(val, _values, result)
result = Node.new(:ELEMENT_NAME,
result = Node.new(:ATTRIB_NAME,
[[val.first, val.last].compact.join(':')]
)

Expand All @@ -495,7 +503,7 @@ def _reduce_28(val, _values, result)
def _reduce_29(val, _values, result)
# Default namespace is not applied to attributes.
# So we don't add prefix "xmlns:" as in namespaced_ident.
result = Node.new(:ELEMENT_NAME, [val.first])
result = Node.new(:ATTRIB_NAME, [val.first])

result
end
Expand Down
12 changes: 10 additions & 2 deletions lib/nokogiri/css/parser.y
Expand Up @@ -96,14 +96,14 @@ rule
;
attrib_name
: namespace '|' IDENT {
result = Node.new(:ELEMENT_NAME,
result = Node.new(:ATTRIB_NAME,
[[val.first, val.last].compact.join(':')]
)
}
| IDENT {
# Default namespace is not applied to attributes.
# So we don't add prefix "xmlns:" as in namespaced_ident.
result = Node.new(:ELEMENT_NAME, [val.first])
result = Node.new(:ATTRIB_NAME, [val.first])
}
;
function
Expand Down Expand Up @@ -255,6 +255,14 @@ end

require_relative "parser_extras"

module Nokogiri
module CSS
# :nodoc: all
class Parser < Racc::Parser
end
end
end

---- inner

def unescape_css_identifier(identifier)
Expand Down
25 changes: 12 additions & 13 deletions lib/nokogiri/css/parser_extras.rb
Expand Up @@ -4,7 +4,7 @@

module Nokogiri
module CSS
class Parser < Racc::Parser
class Parser < Racc::Parser # :nodoc:
CACHE_SWITCH_NAME = :nokogiri_css_parser_cache_is_off

@cache = {}
Expand All @@ -23,7 +23,7 @@ def set_cache(value) # rubocop:disable Naming/AccessorMethodName

# Get the css selector in +string+ from the cache
def [](string)
return unless cache_on?
return nil unless cache_on?
@mutex.synchronize { @cache[string] }
end

Expand Down Expand Up @@ -71,17 +71,10 @@ def next_token
end

# Get the xpath for +string+ using +options+
def xpath_for(string, options = {})
key = "#{string}#{options[:ns]}#{options[:prefix]}"
v = self.class[key]
return v if v

args = [
options[:prefix] || "//",
options[:visitor] || XPathVisitor.new,
]
self.class[key] = parse(string).map do |ast|
ast.to_xpath(*args)
def xpath_for(string, prefix, visitor)
key = cache_key(string, prefix, visitor)
self.class[key] ||= parse(string).map do |ast|
ast.to_xpath(prefix, visitor)
end
end

Expand All @@ -90,6 +83,12 @@ def on_error(error_token_id, error_value, value_stack)
after = value_stack.compact.last
raise SyntaxError, "unexpected '#{error_value}' after '#{after}'"
end

def cache_key(query, prefix, visitor)
if self.class.cache_on?
[query, prefix, @namespaces, visitor.config]
end
end
end
end
end
3 changes: 2 additions & 1 deletion lib/nokogiri/css/tokenizer.rb
Expand Up @@ -7,7 +7,8 @@

module Nokogiri
module CSS
class Tokenizer # :nodoc:
# :nodoc: all
class Tokenizer
require 'strscan'

class ScanError < StandardError ; end
Expand Down
3 changes: 2 additions & 1 deletion lib/nokogiri/css/tokenizer.rex
@@ -1,6 +1,7 @@
module Nokogiri
module CSS
class Tokenizer # :nodoc:
# :nodoc: all
class Tokenizer

macro
nl \n|\r\n|\r|\f
Expand Down