Skip to content

Commit

Permalink
bug: remove HTML comments that exist outside the html element
Browse files Browse the repository at this point in the history
fixes #80
  • Loading branch information
flavorjones committed Apr 5, 2020
1 parent 5e02888 commit 9aa508c
Show file tree
Hide file tree
Showing 3 changed files with 67 additions and 1 deletion.
5 changes: 5 additions & 0 deletions CHANGELOG.md
Expand Up @@ -7,6 +7,11 @@
* Allow more CSS length units: "ch", "vw", "vh", "Q", "lh", "vmin", "vmax". [#178] (Thanks, @JuanitoFatas!)


### Fixes

* Remove comments from `Loofah::HTML::Document`s that exist outside the `html` element. [#80]


### Other changes

* Gem metadata being set [#181] (Thanks, @JuanitoFatas!)
Expand Down
20 changes: 19 additions & 1 deletion lib/loofah.rb
Expand Up @@ -35,7 +35,7 @@ class << self
# Shortcut for Loofah::HTML::Document.parse
# This method accepts the same parameters as Nokogiri::HTML::Document.parse
def document(*args, &block)
Loofah::HTML::Document.parse(*args, &block)
remove_comments_before_html_element Loofah::HTML::Document.parse(*args, &block)
end

# Shortcut for Loofah::HTML::DocumentFragment.parse
Expand Down Expand Up @@ -80,5 +80,23 @@ def scrub_xml_document(string_or_io, method)
def remove_extraneous_whitespace(string)
string.gsub(/\n\s*\n\s*\n/, "\n\n")
end

private

# remove comments that exist outside of the HTML element.
#
# these comments are allowed by the HTML spec:
#
# https://www.w3.org/TR/html401/struct/global.html#h-7.1
#
# but are not scrubbed by Loofah because these nodes don't meet
# the contract that scrubbers expect of a node (e.g., it can be
# replaced, sibling and children nodes can be created).
def remove_comments_before_html_element(doc)
doc.children.each do |child|
child.unlink if child.comment?
end
doc
end
end
end
43 changes: 43 additions & 0 deletions test/integration/test_ad_hoc.rb
Expand Up @@ -216,5 +216,48 @@ def test_dont_remove_whitespace_between_tags
assert_nil sanitized.at_css("animate")["values"]
end
end

#
# brought up by https://github.com/flavorjones/loofah/issues/80
#
context "comments outside html" do
context "bare comments" do
HTML = "<!-- --!><script>alert(1)</script><!-- -->"

it "Loofah.document removes the comment" do
sanitized = Loofah.document(HTML)
sanitized_html = sanitized.to_html
refute_match(/--/, sanitized_html)
end

it "Loofah.scrub_document removes the comment" do
sanitized = Loofah.scrub_document(HTML, :prune)
sanitized_html = sanitized.to_html
refute_match(/--/, sanitized_html)
end
end

context "doc with comments outside HTML" do
HTML = <<~EOF
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<!-- spaces ->
<!-- tabs -->
<!-- more spaces -->
<html><body><div>hello
EOF

it "Loofah.document removes the comment" do
sanitized = Loofah.document(HTML)
sanitized_html = sanitized.to_html
refute_match(/--/, sanitized_html)
end

it "Loofah.scrub_document removes the comment" do
sanitized = Loofah.scrub_document(HTML, :prune)
sanitized_html = sanitized.to_html
refute_match(/--/, sanitized_html)
end
end
end
end
end

0 comments on commit 9aa508c

Please sign in to comment.