Skip to content

Commit

Permalink
fix: ensure parsed fragments are always of the right type
Browse files Browse the repository at this point in the history
Introduce ClassResolver which will do the class lookup correctly, and
use it for Builder and for fragment parsing.

Closes #1158
  • Loading branch information
flavorjones committed Dec 16, 2021
1 parent 86504a1 commit 38c2f16
Show file tree
Hide file tree
Showing 8 changed files with 108 additions and 15 deletions.
1 change: 1 addition & 0 deletions CHANGELOG.md
Expand Up @@ -24,6 +24,7 @@ A related discussion about Trust exists at [#2357](https://github.com/sparklemot
### Fixed

* XML::Builder blocks restore context properly when exceptions are raised. [[#2372](https://github.com/sparklemotion/nokogiri/issues/2372)] (Thanks, [@ric2b](https://github.com/ric2b) and [@rinthedev](https://github.com/rinthedev)!)
* Error recovery from in-context parsing (e.g., `Node#parse`) now always uses the correct `DocumentFragment` class. Previously `Nokogiri::HTML4::DocumentFragment` was always used, even for XML documents. [[#1158](https://github.com/sparklemotion/nokogiri/issues/1158)]
* [CRuby] Fix memory leak in `Document#canonicalize` when inclusive namespaces are passed in. [[#2345](https://github.com/sparklemotion/nokogiri/issues/2345)]
* [CRuby] Fix memory leak in `Document#canonicalize` when an argument type error is raised. [[#2345](https://github.com/sparklemotion/nokogiri/issues/2345)]
* [CRuby] Fix memory leak in `EncodingHandler` where iconv handlers were not being cleaned up. [[#2345](https://github.com/sparklemotion/nokogiri/issues/2345)]
Expand Down
1 change: 1 addition & 0 deletions lib/nokogiri.rb
Expand Up @@ -115,6 +115,7 @@ def Nokogiri(*args, &block)
end

require_relative "nokogiri/version"
require_relative "nokogiri/class_resolver"
require_relative "nokogiri/syntax_error"
require_relative "nokogiri/xml"
require_relative "nokogiri/xslt"
Expand Down
67 changes: 67 additions & 0 deletions lib/nokogiri/class_resolver.rb
@@ -0,0 +1,67 @@
# coding: utf-8
# frozen_string_literal: true

require "set"

module Nokogiri
#
# Some classes in Nokogiri are namespaced as a group, for example
# Document, DocumentFragment, and Builder.
#
# It's sometimes necessary to look up the related class, e.g.:
#
# XML::Builder → XML::Document
# HTML4::Builder → HTML4::Document
# HTML5::Document → HTML5::DocumentFragment
#
# This module is included into those key classes who need to do this.
#
module ClassResolver
# #related_class restricts matching namespaces to those matching this set.
VALID_NAMESPACES = Set.new(["HTML", "HTML4", "HTML5", "XML"])

# :call-seq:
# related_class(class_name) → Class
#
# Find a class constant within the
#
# Some examples:
#
# Nokogiri::XML::Document.new.related_class("DocumentFragment")
# # => Nokogiri::XML::DocumentFragment
# Nokogiri::HTML4::Document.new.related_class("DocumentFragment")
# # => Nokogiri::HTML4::DocumentFragment
#
# Note this will also work for subclasses that follow the same convention, e.g.:
#
# Loofah::HTML::Document.new.related_class("DocumentFragment")
# # => Loofah::HTML::DocumentFragment
#
# And even if it's a subclass, this will iterate through the superclasses:
#
# class ThisIsATopLevelClass < Nokogiri::HTML4::Builder ; end
# ThisIsATopLevelClass.new.related_class("Document")
# # => Nokogiri::HTML4::Document
#
def related_class(class_name)
klass = nil
inspecting = self.class

while inspecting
namespace_path = inspecting.name.split("::")[0..-2]
inspecting = inspecting.superclass

next unless VALID_NAMESPACES.include?(namespace_path.last)

related_class_name = (namespace_path << class_name).join("::")
klass = begin
Object.const_get(related_class_name)
rescue NameError
nil
end
break if klass
end
klass
end
end
end
10 changes: 3 additions & 7 deletions lib/nokogiri/xml/builder.rb
Expand Up @@ -262,6 +262,8 @@ module XML
# </root>
#
class Builder
include Nokogiri::ClassResolver

DEFAULT_DOCUMENT_OPTIONS = { namespace_inheritance: true }

# The current Document object being built
Expand Down Expand Up @@ -307,13 +309,7 @@ def initialize(options = {}, root = nil, &block)
@doc = root.document
@parent = root
else
klassname = "::" + (self.class.name.split("::")[0..-2] + ["Document"]).join("::")
klass = begin
Object.const_get(klassname)
rescue NameError
Nokogiri::XML::Document
end
@parent = @doc = klass.new
@parent = @doc = related_class("Document").new
end

@context = nil
Expand Down
6 changes: 3 additions & 3 deletions lib/nokogiri/xml/node.rb
Expand Up @@ -56,6 +56,7 @@ module XML
class Node
include Nokogiri::XML::PP::Node
include Nokogiri::XML::Searchable
include Nokogiri::ClassResolver
include Enumerable

# Element node type, see Nokogiri::XML::Node#element?
Expand Down Expand Up @@ -936,8 +937,7 @@ def matches?(selector)
# Create a DocumentFragment containing +tags+ that is relative to _this_
# context node.
def fragment(tags)
type = document.html? ? Nokogiri::HTML : Nokogiri::XML
type::DocumentFragment.new(document, tags, self)
document.related_class("DocumentFragment").new(document, tags, self)
end

###
Expand Down Expand Up @@ -988,7 +988,7 @@ def parse(string_or_io, options = nil)
node_set = in_context(contents, options.to_i)
if node_set.empty? && (document.errors.length > error_count)
if options.recover?
fragment = Nokogiri::HTML4::DocumentFragment.parse(contents)
fragment = document.related_class("DocumentFragment").parse(contents)
node_set = fragment.children
else
raise document.errors[error_count]
Expand Down
1 change: 1 addition & 0 deletions nokogiri.gemspec
Expand Up @@ -232,6 +232,7 @@ Gem::Specification.new do |spec|
"lib/nekodtd.jar",
"lib/nekohtml.jar",
"lib/nokogiri.rb",
"lib/nokogiri/class_resolver.rb",
"lib/nokogiri/css.rb",
"lib/nokogiri/css/node.rb",
"lib/nokogiri/css/parser.rb",
Expand Down
11 changes: 9 additions & 2 deletions test/xml/test_builder.rb
Expand Up @@ -424,11 +424,14 @@ def test_builder_uses_proper_document_class
xml_builder = Nokogiri::XML::Builder.new
assert_instance_of(Nokogiri::XML::Document, xml_builder.doc)

html_builder = Nokogiri::HTML::Builder.new
assert_instance_of(Nokogiri::HTML::Document, html_builder.doc)
html_builder = Nokogiri::HTML4::Builder.new
assert_instance_of(Nokogiri::HTML4::Document, html_builder.doc)

foo_builder = ThisIsATestBuilder.new
assert_instance_of(Nokogiri::XML::Document, foo_builder.doc)

foo_builder = ThisIsAnotherTestBuilder.new
assert_instance_of(Nokogiri::HTML4::Document, foo_builder.doc)
end

private
Expand All @@ -443,3 +446,7 @@ def namespaces_defined_on(node)
class ThisIsATestBuilder < Nokogiri::XML::Builder
# this exists for the test_builder_uses_proper_document_class and should be empty
end

class ThisIsAnotherTestBuilder < Nokogiri::HTML4::Builder
# this exists for the test_builder_uses_proper_document_class and should be empty
end
26 changes: 23 additions & 3 deletions test/xml/test_node.rb
Expand Up @@ -91,33 +91,45 @@ def test_parse_config_option
end

def test_node_context_parsing_of_malformed_html_fragment
doc = HTML.parse("<html><body><div></div></body></html>")
doc = HTML4.parse("<html><body><div></div></body></html>")
context_node = doc.at_css("div")
nodeset = context_node.parse("<div </div>")

assert_equal(1, doc.errors.length)
assert_equal(1, nodeset.length)
assert_equal("<div></div>", nodeset.to_s)
assert_instance_of(Nokogiri::HTML4::Document, nodeset.document)
assert_instance_of(Nokogiri::HTML4::Document, nodeset.first.document)
end

def test_node_context_parsing_of_malformed_html_fragment_with_recover_is_corrected
doc = HTML.parse("<html><body><div></div></body></html>")
doc = HTML4.parse("<html><body><div></div></body></html>")
context_node = doc.at_css("div")
nodeset = context_node.parse("<div </div>", &:recover)

assert_equal(1, doc.errors.length)
assert_equal(1, nodeset.length)
assert_equal("<div></div>", nodeset.to_s)
assert_instance_of(Nokogiri::HTML4::Document, nodeset.document)
assert_instance_of(Nokogiri::HTML4::Document, nodeset.first.document)
end

def test_node_context_parsing_of_malformed_html_fragment_without_recover_is_not_corrected
doc = HTML.parse("<html><body><div></div></body></html>")
doc = HTML4.parse("<html><body><div></div></body></html>")
context_node = doc.at_css("div")
assert_raises(Nokogiri::XML::SyntaxError) do
context_node.parse("<div </div>", &:strict)
end
end

def test_node_context_parsing_of_malformed_xml_fragment_uses_the_right_class_to_recover
doc = XML.parse("<root><body><div></div></body></root>")
context_node = doc.at_css("div")
nodeset = context_node.parse("<div </div") # causes an error and recovers
assert_instance_of(Nokogiri::XML::Document, nodeset.document)
assert_instance_of(Nokogiri::XML::Document, nodeset.first.document)
end

def test_parse_error_list
error_count = xml.errors.length
xml.root.parse("<hello>")
Expand Down Expand Up @@ -421,6 +433,14 @@ def test_duplicate_node_removes_namespace
end
end

def test_fragment_creates_appropriate_class
frag = Nokogiri.XML("<root><child/></root>").at_css("child").fragment("<thing/>")
assert_instance_of(Nokogiri::XML::DocumentFragment, frag)

frag = Nokogiri.HTML4("<root><child/></root>").at_css("child").fragment("<thing/>")
assert_instance_of(Nokogiri::HTML4::DocumentFragment, frag)
end

def test_fragment_creates_elements
apple = xml.fragment("<Apple/>")
apple.children.each do |child|
Expand Down

0 comments on commit 38c2f16

Please sign in to comment.