Skip to content

Commit

Permalink
HTML5 parsing using Nokogumbo
Browse files Browse the repository at this point in the history
Introduces `--report-missing-doctype`, defaulting to false, to report on
missing `<!DOCTYPE html>` at the beginning of the document.

Depends on Nokogumbo >= 1.4.10 for rubys/nokogumbo#46

Fixes #318.
  • Loading branch information
jeremy committed Nov 2, 2016
1 parent be5fd7a commit 0a55dd4
Show file tree
Hide file tree
Showing 14 changed files with 65 additions and 68 deletions.
2 changes: 1 addition & 1 deletion Gemfile
@@ -1,3 +1,3 @@
source 'http://rubygems.org'
source 'https://rubygems.org'

gemspec
10 changes: 5 additions & 5 deletions README.md
Expand Up @@ -61,7 +61,7 @@ You can enable or disable most of the following checks.

### HTML

* Whether your HTML markup is valid. This is done via [Nokogiri](http://www.nokogiri.org/tutorials/ensuring_well_formed_markup.html), to ensure well-formed markup.
* Whether your HTML markup is valid. This is done via [Nokogumbo](https://github.com/rubys/nokogumbo) to validate well-formed HTML5 markup.

## Usage

Expand Down Expand Up @@ -196,7 +196,7 @@ htmlproofer ./_site

### Using through Docker

If you have trouble with (or don't want to) install Ruby/Nokogiri, the command-line tool can be run through Docker. See [html-proofer-docker](https://github.com/18F/html-proofer-docker) for more information.
If you have trouble with (or don't want to) install Ruby/Nokogumbo, the command-line tool can be run through Docker. See [html-proofer-docker](https://github.com/18F/html-proofer-docker) for more information.

## Ignoring content

Expand All @@ -218,7 +218,7 @@ The `HTMLProofer` constructor takes an optional hash of additional options:
| `check_external_hash` | Checks whether external hashes exist (even if the webpage exists). This slows the checker down. | `false` |
| `check_favicon` | Enables the favicon checker. | `false` |
| `check_opengraph` | Enables the Open Graph checker. | `false` |
| `check_html` | Enables HTML validation errors from Nokogiri | `false` |
| `check_html` | Enables HTML validation errors from Nokogumbo | `false` |
|`checks_to_ignore`| An array of Strings indicating which checks you'd like to not perform. | `[]`
| `directory_index_file` | Sets the file to look for when a link refers to a directory. | `index.html` |
| `disable_external` | If `true`, does not run the external link checker, which can take a lot of time. | `false` |
Expand Down Expand Up @@ -247,13 +247,13 @@ See below for more information.

### Configuring HTML validation rules

If `check_html` is `true`, Nokogiri performs additional validation on your HTML.
If `check_html` is `true`, Nokogumbo performs additional validation on your HTML.

You can pass in additional options to configure this validation.

| Option | Description | Default |
| :----- | :---------- | :------ |
| `report_invalid_tags` | When `check_html` is enabled, HTML markup that is unknown to Nokogiri are reported as errors. | `false`
| `report_invalid_tags` | When `check_html` is enabled, HTML markup that is unknown to Nokogumbo are reported as errors. | `false`
| `report_missing_names` | When `check_html` is enabled, HTML markup that are missing entity names are reported as errors. | `false`
| `report_script_embeds` | When `check_html` is enabled, `script` tags containing markup [are reported as errors](http://git.io/vOovv). Enabling this option ignores those errors. | `false`

Expand Down
10 changes: 6 additions & 4 deletions bin/htmlproofer
Expand Up @@ -20,7 +20,7 @@ Mercenary.program(:htmlproofer) do |p|
p.option 'checks_to_ignore', '--checks-to-ignore check1,[check2,...]', Array, ' An array of Strings indicating which checks you\'d like to not perform.'
p.option 'check_external_hash', '--check-external-hash', 'Checks whether external hashes exist (even if the webpage exists). This slows the checker down (default: `false`).'
p.option 'check_favicon', '--check-favicon', 'Enables the favicon checker (default: `false`).'
p.option 'check_html', '--check-html', 'Enables HTML validation errors from Nokogiri (default: `false`).'
p.option 'check_html', '--check-html', 'Enables HTML validation errors from Nokogumbo (default: `false`).'
p.option 'check_img_http', '--check-img-http', 'Fails an image if it\'s marked as `http` (default: `false`).'
p.option 'check_opengraph', '--check-opengraph', 'Enables the Open Graph checker (default: `false`).'
p.option 'directory_index_file', '--directory-index-file', String, 'Sets the file to look for when a link refers to a directory. (default: `index.html`)'
Expand All @@ -32,9 +32,10 @@ Mercenary.program(:htmlproofer) do |p|
p.option 'external_only', '--external_only', 'Only checks problems with external references'
p.option 'file_ignore', '--file-ignore file1,[file2,...]', Array, 'A comma-separated list of Strings or RegExps containing file paths that are safe to ignore'
p.option 'http_status_ignore', '--http-status-ignore 123,[xxx, ...]', Array, 'A comma-separated list of numbers representing status codes to ignore.'
p.option 'report_invalid_tags', '--report-invalid-tags', 'Ignore `check_html` errors associated with unknown markup (default: `false`)'
p.option 'report_missing_names', '--report-missing-names', 'Ignore `check_html` errors associated with missing entities (default: `false`)'
p.option 'report_script_embeds', '--report-script-embeds', 'Ignore `check_html` errors associated with `script`s (default: `false`)'
p.option 'report_invalid_tags', '--report-invalid-tags', 'Report `check_html` errors associated with unknown markup (default: `false`)'
p.option 'report_missing_names', '--report-missing-names', 'Report `check_html` errors associated with missing entities (default: `false`)'
p.option 'report_script_embeds', '--report-script-embeds', 'Report `check_html` errors associated with `script`s (default: `false`)'
p.option 'report_missing_doctype', '--report-missing-doctype', 'Report `check_html` errors associated with missing or out-of-order DOCTYPE (default: `false`)'
p.option 'log_level', '--log-level <level>', String, 'Sets the logging level, as determined by Yell'
p.option 'only_4xx', '--only-4xx', 'Only reports errors for links that fall within the 4xx status code range'
p.option 'timeframe', '--timeframe <time>', String, 'A string representing the caching timeframe.'
Expand Down Expand Up @@ -76,6 +77,7 @@ Mercenary.program(:htmlproofer) do |p|
options[:validation][:report_script_embeds] = opts['report_script_embeds']
options[:validation][:report_missing_names] = opts['report_missing_names']
options[:validation][:report_invalid_tags] = opts['report_invalid_tags']
options[:validation][:report_missing_doctype] = opts['report_missing_doctype']

options[:cache] = {}
options[:cache][:timeframe] = opts['timeframe'] unless opts['timeframe'].nil?
Expand Down
2 changes: 1 addition & 1 deletion html-proofer.gemspec
Expand Up @@ -18,7 +18,7 @@ Gem::Specification.new do |gem|
gem.require_paths = ['lib']

gem.add_dependency 'mercenary', '~> 0.3.2'
gem.add_dependency 'nokogiri', '~> 1.5'
gem.add_dependency 'nokogumbo', '~> 1.4', '>= 1.4.10'
gem.add_dependency 'colored', '~> 1.2'
gem.add_dependency 'typhoeus', '~> 0.7'
gem.add_dependency 'yell', '~> 2.0'
Expand Down
31 changes: 17 additions & 14 deletions lib/html-proofer/check/html.rb
@@ -1,26 +1,29 @@
class HtmlCheck < ::HTMLProofer::Check
# tags embedded in scripts are used in templating languages: http://git.io/vOovv
SCRIPT_EMBEDS_MSG = /Element script embeds close tag/
INVALID_TAG_MSG = /Tag ([\w\-:]+) invalid/
INVALID_PREFIX = /Namespace prefix/
PARSE_ENTITY_REF = /htmlParseEntityRef: no name/
DOCTYPE_MSG = /The doctype must be the first token in the document/

def run
@html.errors.each do |error|
message = error.message
line = error.line

if message =~ INVALID_TAG_MSG || message =~ INVALID_PREFIX
next unless options[:validation][:report_invalid_tags]
end

if message =~ PARSE_ENTITY_REF
next unless options[:validation][:report_missing_names]
end

# tags embedded in scripts are used in templating languages: http://git.io/vOovv
next if !options[:validation][:report_script_embeds] && message =~ SCRIPT_EMBEDS_MSG
add_issue(error.message, line: error.line) if report?(error.message)
end
end

add_issue(message, line: line)
def report?(message)
case message
when SCRIPT_EMBEDS_MSG
options[:validation][:report_script_embeds]
when INVALID_TAG_MSG, INVALID_PREFIX
options[:validation][:report_invalid_tags]
when PARSE_ENTITY_REF
options[:validation][:report_missing_names]
when DOCTYPE_MSG
options[:validation][:report_missing_doctype]
else
true
end
end
end
2 changes: 1 addition & 1 deletion lib/html-proofer/check/links.rb
Expand Up @@ -31,7 +31,7 @@ def run
# is there even an href?
if missing_href?
# HTML5 allows dropping the href: http://git.io/vBX0z
next if @html.internal_subset.name == 'html' && @html.internal_subset.external_id.nil?
next if @html.internal_subset.nil? || (@html.internal_subset.name == 'html' && @html.internal_subset.external_id.nil?)
add_issue('anchor has no href attribute', line: line, content: content)
next
end
Expand Down
3 changes: 2 additions & 1 deletion lib/html-proofer/configuration.rb
Expand Up @@ -44,7 +44,8 @@ module Configuration
VALIDATION_DEFAULTS = {
:report_script_embeds => false,
:report_missing_names => false,
:report_invalid_tags => false
:report_invalid_tags => false,
:report_missing_doctype => false
}

CACHE_DEFAULTS = {}
Expand Down
14 changes: 2 additions & 12 deletions lib/html-proofer/utils.rb
@@ -1,4 +1,4 @@
require 'nokogiri'
require 'nokogumbo'

module HTMLProofer
module Utils
Expand All @@ -15,7 +15,7 @@ def create_nokogiri(path)
content = path
end

Nokogiri::HTML(clean_content(content))
Nokogiri::HTML5(content)
end
module_function :create_nokogiri

Expand All @@ -26,15 +26,5 @@ def swap(href, replacement)
href
end
module_function :swap

# address a problem with Nokogiri's parsing URL entities
# problem from http://git.io/vBYU1
# solution from http://git.io/vBYUi
def clean_content(string)
string.gsub(%r{https?://([^>]+)}i) do |url|
url.gsub(/&(?!amp;)/, '&amp;')
end
end
module_function :clean_content
end
end
2 changes: 1 addition & 1 deletion spec/html-proofer/command_spec.rb
Expand Up @@ -87,7 +87,7 @@
end

it 'works with check-html' do
broken = "#{FIXTURES_DIR}/html/invalid_tag.html"
broken = "#{FIXTURES_DIR}/html/unmatched_end_tag.html"
output = make_bin('--check-html --report-invalid-tags', broken)
expect(output).to match('1 failure')
end
Expand Down
16 changes: 8 additions & 8 deletions spec/html-proofer/element_spec.rb
Expand Up @@ -2,54 +2,54 @@

describe HTMLProofer::Element do
before(:each) do
@check = HTMLProofer::Check.new('', '', Nokogiri::HTML(''), HTMLProofer::Configuration::PROOFER_DEFAULTS)
@check = HTMLProofer::Check.new('', '', Nokogiri::HTML5(''), HTMLProofer::Configuration::PROOFER_DEFAULTS)
end

describe '#initialize' do
it 'accepts the xmlns attribute' do
nokogiri = Nokogiri::HTML('<a xmlns:cc="http://creativecommons.org/ns#">Creative Commons</a>')
nokogiri = Nokogiri::HTML5('<a xmlns:cc="http://creativecommons.org/ns#">Creative Commons</a>')
checkable = HTMLProofer::Element.new(nokogiri.css('a').first, @check)
expect(checkable.instance_variable_get(:@xmlns_cc)).to eq 'http://creativecommons.org/ns#'
end

it 'assignes the text node' do
nokogiri = Nokogiri::HTML('<p>One')
nokogiri = Nokogiri::HTML5('<p>One')
checkable = HTMLProofer::Element.new(nokogiri.css('p').first, @check)
expect(checkable.instance_variable_get(:@text)).to eq 'One'
end

it 'accepts the content attribute' do
nokogiri = Nokogiri::HTML('<meta name="twitter:card" content="summary">')
nokogiri = Nokogiri::HTML5('<meta name="twitter:card" content="summary">')
checkable = HTMLProofer::Element.new(nokogiri.css('meta').first, @check)
expect(checkable.instance_variable_get(:@content)).to eq 'summary'
end
end

describe '#ignores_pattern_check' do
it 'works for regex patterns' do
nokogiri = Nokogiri::HTML('<script src=/assets/main.js></script>')
nokogiri = Nokogiri::HTML5('<script src=/assets/main.js></script>')
checkable = HTMLProofer::Element.new(nokogiri.css('script').first, @check)
expect(checkable.ignores_pattern_check([/\/assets\/.*(js|css|png|svg)/])).to eq true
end

it 'works for string patterns' do
nokogiri = Nokogiri::HTML('<script src=/assets/main.js></script>')
nokogiri = Nokogiri::HTML5('<script src=/assets/main.js></script>')
checkable = HTMLProofer::Element.new(nokogiri.css('script').first, @check)
expect(checkable.ignores_pattern_check(['/assets/main.js'])).to eq true
end
end

describe '#url' do
it 'works for src attributes' do
nokogiri = Nokogiri::HTML('<img src=image.png />')
nokogiri = Nokogiri::HTML5('<img src=image.png />')
checkable = HTMLProofer::Element.new(nokogiri.css('img').first, @check)
expect(checkable.url).to eq 'image.png'
end
end

describe '#ignore' do
it 'works for twitter cards' do
nokogiri = Nokogiri::HTML('<meta name="twitter:url" data-proofer-ignore content="http://example.com/soon-to-be-published-url">')
nokogiri = Nokogiri::HTML5('<meta name="twitter:url" data-proofer-ignore content="http://example.com/soon-to-be-published-url">')
checkable = HTMLProofer::Element.new(nokogiri.css('meta').first, @check)
expect(checkable.ignore?).to eq true
end
Expand Down
2 changes: 1 addition & 1 deletion spec/html-proofer/fixtures/html/weird_onclick.html
@@ -1 +1 @@
<a href="http://www.google.com" onClick="window.open('https://calendar.google.com/calendar/embed?src=o3uckm13gkcc5tu4u3846ik0pc%40group.calendar.google.com&ctz=America/Los_Angeles','Team Calendar','')">
<a href="http://www.google.com" onClick="window.open('https://calendar.google.com/calendar/embed?src=o3uckm13gkcc5tu4u3846ik0pc%40group.calendar.google.com&ctz=America/Los_Angeles','Team Calendar','')"></a>
22 changes: 11 additions & 11 deletions spec/html-proofer/html_spec.rb
Expand Up @@ -19,40 +19,40 @@
expect(proofer.failed_tests).to eq []
end

it 'fails for an invalid tag' do
it 'works for custom tags' do
html = "#{FIXTURES_DIR}/html/invalid_tag.html"
proofer = run_proofer(html, :file, { :check_html => true, :validation => { :report_invalid_tags => true} })
expect(proofer.failed_tests.first).to match(/Tag myfancytag invalid \(line 2\)/)
expect(proofer.failed_tests).to eq []
end

it 'fails for an unmatched end tag' do
html = "#{FIXTURES_DIR}/html/unmatched_end_tag.html"
proofer = run_proofer(html, :file, { :check_html => true })
expect(proofer.failed_tests.first).to match(/Unexpected end tag : div \(line 3\)/)
expect(proofer.failed_tests.first).to include("@3:1: That tag isn't allowed here Currently open tags: html, body..\n</div>\n^ (line 3)")
end

it 'fails for an unescaped ampersand in attribute' do
it 'allows an unescaped ampersand in attribute' do
html = "#{FIXTURES_DIR}/html/unescaped_ampersand_in_attribute.html"
proofer = run_proofer(html, :file, { :check_html => true })
expect(proofer.failed_tests.first).to match(/htmlParseEntityRef: expecting ';' \(line 2\)/)
expect(proofer.failed_tests).to eq []
end

it 'fails for mismatch between opening and ending tag' do
html = "#{FIXTURES_DIR}/html/opening_and_ending_tag_mismatch.html"
proofer = run_proofer(html, :file, { :check_html => true })
expect(proofer.failed_tests.first).to match(/Opening and ending tag mismatch: p and strong/)
expect(proofer.failed_tests.first).to include("@3:31: That tag isn't allowed here Currently open tags: html, body, p, strong..\n<p>The quick <strong>brown fox</p>\n ^ (line 3)")
end

it 'fails for div inside head' do
html = "#{FIXTURES_DIR}/html/div_inside_head.html"
proofer = run_proofer(html, :file, { :check_html => true })
expect(proofer.failed_tests.first).to match(/Unexpected end tag : head \(line 5\)/)
expect(proofer.failed_tests.first).to include("@5:1: That tag isn't allowed here Currently open tags: html, body..\n</head>\n^ (line 5)")
end

it 'fails for missing closing quotation mark in href' do
html = "#{FIXTURES_DIR}/html/missing_closing_quotes.html"
proofer = run_proofer(html, :file, { :check_html => true })
expect(proofer.failed_tests.to_s).to match(/Couldn't find end of Start Tag a \(line 6\)/)
expect(proofer.failed_tests.to_s).to include("@6:1: Tokenizer error with an unimplemented error message.\\n\\n^ (line 6)\"]")
end

it "doesn't fail for single ampersand" do
Expand All @@ -61,10 +61,10 @@
expect(proofer.failed_tests).to eq []
end

it "fails for single ampersand when asked" do
it "allows single ampersand" do
html = "#{FIXTURES_DIR}/html/single_amp.html"
proofer = run_proofer(html, :file, { :check_html => true, :validation => { :report_missing_names => true } })
expect(proofer.failed_tests.first).to match('htmlParseEntityRef: no name')
expect(proofer.failed_tests).to eq []
end

it 'ignores embeded scripts when asked' do
Expand All @@ -78,7 +78,7 @@
opts = { :check_html => true, :validation => { :report_script_embeds => true } }
ignorableScript = "#{FIXTURES_DIR}/html/ignore_script_embeds.html"
proofer = run_proofer(ignorableScript, :file, opts)
expect(proofer.failed_tests.length).to eq 2
expect(proofer.failed_tests).to eq []
end

it 'does not fail for weird iframe sources' do
Expand Down
14 changes: 7 additions & 7 deletions spec/html-proofer/links_spec.rb
Expand Up @@ -64,10 +64,10 @@
expect(proofer.failed_tests.first).to match(/internally linking to .\/notreal.html, which does not exist/)
end

it 'fails for link with no href' do
it 'allows link with no href' do
missingLinkHrefFilepath = "#{FIXTURES_DIR}/links/missingLinkHref.html"
proofer = run_proofer(missingLinkHrefFilepath, :file)
expect(proofer.failed_tests.first).to match(/anchor has no href attribute/)
expect(proofer.failed_tests).to eq []
end

it 'should follow redirects' do
Expand Down Expand Up @@ -208,16 +208,16 @@
expect(proofer.failed_tests).to eq []
end

it 'fails for empty href within link elements' do
it 'allows empty href on link elements' do
head_link = "#{FIXTURES_DIR}/links/head_link_href_empty.html"
proofer = run_proofer(head_link, :file)
expect(proofer.failed_tests.first).to match(/anchor has no href attribute/)
expect(proofer.failed_tests).to eq []
end

it 'fails for absent href within link elements' do
it 'allows missing href on link elements' do
head_link = "#{FIXTURES_DIR}/links/head_link_href_absent.html"
proofer = run_proofer(head_link, :file)
expect(proofer.failed_tests.first).to match(/anchor has no href attribute/)
expect(proofer.failed_tests).to eq []
end

it 'fails for internal linking to a directory without trailing slash' do
Expand Down Expand Up @@ -465,7 +465,7 @@

missing_href = "#{FIXTURES_DIR}/links/blank_href_htmlunknown.html"
proofer = run_proofer(missing_href, :file)
expect(proofer.failed_tests.length).to eq 1
expect(proofer.failed_tests).to eq []
end

it 'works with internal_domains' do
Expand Down
3 changes: 2 additions & 1 deletion spec/html-proofer/scripts_spec.rb
Expand Up @@ -16,7 +16,8 @@
it 'fails for missing internal src' do
file = "#{FIXTURES_DIR}/scripts/script_missing_internal.html"
proofer = run_proofer(file, :file)
expect(proofer.failed_tests.first).to match(/doesnotexist.js does not exist \(line 5\)/)
expect(proofer.failed_tests.length).to eq 1
expect(proofer.failed_tests.first).to include('spec/html-proofer/fixtures/scripts/script_missing_internal.html: internal script doesnotexist.js does not exist (line 5)')
end

it 'works for present content' do
Expand Down

0 comments on commit 0a55dd4

Please sign in to comment.