Nokogiri HTML refactor #6

justinthec · 2016-06-08T05:25:27Z

Why?

The current use of the StartTagHelper module in DeprecatedClasses for analyzing HTML start tags is not robust (as demonstrated by #5), difficult to maintain due to excessive use of regular expressions, and limited in functionality (no multiline start tags, catches tags within quotes, etc).

Refactoring ERBLint to use a proper parser, supported by over 100 contributors will solve all the problems of the current solution and make it easier for other developers to contribute linters to the gem.

What?

This PR changes the interface that linters use to analyze the file from file_content/lines to a file_tree.

This file_tree is a Nokogiri::HTML::Fragment which can be traversed and examined just like any other Nokogiri::HTML::Node.

3 small modifications are being made during the transformation from the bare text file_content to the file_tree HTML fragment.

All ERB tags (<%..%>, <%=..%>, <%..-%>, <%#..%>, etc) have been ~~replaced by <erb>...</erb>~~ removed from the file content and replaced with corresponding whitespace and newlines.
- Example: <% apples \n %> --> __________\n___
All ERB tag literals (<%%, %%>) have been escaped via htmlentities entity encoding.
- Example: <%% --> <%%
An erb_lint_end_marker tag has been appended to the end of the document fragment for calculating end-of-file line numbers. This is a workaround for this Nokogiri line number bug.
~~All content within ERB tags has had all of its contained entities encoded (special character escaping using htmlentities)~~
~~All ERB tags (now <erb>...</erb>) within strings ("..." or '...') have been replaced by _erb_..._/erb_ to prevent them from being parsed as tags by Nokogiri::HTML.~~

How?

The StartTagHelper has been removed.
ERBLint::Parser has been added, exposing a parse method to generate the valid Nokogiri::HTML::Fragment from a file_content string.
The ERBLint::Runner parses the file using the parse method before passing the resulting file_tree to each linter's lint_file method.
ERBLint::Linter#lint_lines has been removed.

Todo

Finish adding test cases to parser_spec.rb

For review:
@edward @volmer @lemonmade

lemonmade · 2016-06-13T15:42:17Z

lib/erb_lint/linters/final_newline.rb

-      protected
-
-      def lint_lines(lines)
+      def lint_file(file_tree)


Is there no access to the source of the file? You're jumping through a lot of hoops to make this work with the parsed tree.

lemonmade · 2016-06-13T15:50:48Z

Can't say I am in love with the hackery, but I understand that this is probably better in the long run. Good job getting this to work 👍

edward · 2016-06-14T02:14:42Z

lib/erb_lint/linters/deprecated_classes.rb

-              end
+        html_elements = Parser.filter_erb_nodes(file_tree)
+        html_elements.each do |html_element|
+          html_element.attribute_nodes.select { |attribute| attribute.name.casecmp('class') == 0 }.each do |class_attr|


These lines are a little hard to read.

Could you please break this out into some well-named variables? Something like:

html_element_class_attributes = html_element.attribute_nodes.select { |attribute| attribute.name.casecmp('class') == 0 } html_element_class_attributes.each do |class_attr|

edward · 2016-06-14T02:38:51Z

Looks like progress!

Give me a shout tomorrow if you’d like help with the Rubocop warnings.

jeremyhansonfinger · 2016-06-16T20:30:55Z

@justinthec I think you already have a note on this, but I just rediscovered that single quotes in text strings cause parser to throw errors when called by content-style-checker (in my branch based off of your branch). Perhaps something that can be addressed as part of any rewrites?

     Failure/Error: raise ParsingError, 'Unclosed string found.'

     ERBLint::Parser::ParsingError:
       Unclosed string found.
     # ./lib/erb_lint/parser.rb:45:in `escape_erb_tags_in_strings'
     # ./lib/erb_lint/parser.rb:21:in `parse'

justinthec · 2016-06-16T22:03:40Z

Yup, I'll fix the string validation tonight, thanks for making note here 👍

jhansonf · 2016-06-16T22:21:40Z

Thanks again!

justinthec · 2016-06-23T05:37:05Z

lib/erb_lint/linters/final_newline.rb

-            line: lines.length,
-            message: 'Remove the trailing newline at the end of the file.'
-          )
+      def lint_file(file_tree)


@lemonmade this comment got unintentionally marked as resolved by Github.

Is there no access to the source of the file? You're jumping through a lot of hoops to make this work with the parsed tree.

The idea behind using a parse tree always is that it provides a consistent interface for all linters to use and that most future linters (data-attributes, product content style, etc) will be complex enough to leverage the benefits a tree brings.

The reason that I'm jumping through the hoops here in FinalNewline is only because of the Nokogiri line number bug. Once that is fixed, the hacky end_marker will be completely removed and this linter will be much simpler, just finding the last child and calling #line.

As a compromise would you suggest a wrapper class around both the Parse tree and a copy of the original source?

Explanation makes sense to me, I think what you have is good 👍

justinthec · 2016-06-23T05:56:26Z

The PR description has been updated to reflect the latest version of this refactor. All previous feedback has been addressed except for one line comment by Chris which I've re-added manually.

Ready for 👀 round 2; thanks again in advance :)

jeremyhansonfinger · 2016-06-23T13:27:55Z

@justinthec This looks great. I rebased content-linter on your latest commit and adjusted accordingly and it's resolved the issue with dumb apostrophes.

jeremyhansonfinger · 2016-06-28T20:15:56Z

Hey @justinthec I'm trying to do something that calls this branch of erb-lint and seems like <br> tags throw parser off, since it's looking for a closing tag. Ideally I'd like parser not to escape <br> tags by default since I want to be able to work with the knowledge that there's a line break within text content, but let me know if that seems like a bad idea, or not feasible.

Anyway, if I run erb-lint on

<p>Check out this sweet support page.<br>\nIt&#39;s really a program.</p>\n

It throws an error:

justinthec · 2016-07-21T04:18:05Z

I've gone and done a self review of the code after not looking at most of it for 28 days and I'm impressed that I still understand it and can walk through it clearly (especially parser.rb).

The Nokogiri Line number bug that I was originally waiting for has not been addressed for exactly a month now with no end in sight so I'm going to go ahead and ship this.

I'll still keep my eye out on that bug to see when I can remove my workaround but it is pretty small and well documented in the code.

edward · 2016-07-22T21:17:00Z

👍

…ontent.

justinthec · 2016-09-28T14:53:31Z

🎉 thanks for getting this through the last mile Jeremy!

jeremyhansonfinger · 2016-09-28T15:36:00Z

No problem man, thanks for all your work!

volmer · 2016-10-05T16:32:21Z

Tried to use the gem at this revision, and now it explodes with invalid markup:

ERBLint::Parser::ParseError
File could not be successfully parsed. Ensure all tags are properly closed

Not sure if it's the expected behaviour after this PR, but I think it should have been better if we got a linter message instead of an exception.

jeremyhansonfinger · 2016-10-05T17:30:18Z

Thanks @volmer, I'll look into this ASAP.

jeremyhansonfinger · 2016-10-05T19:11:53Z

Trying to figure out how to best provide a linter error message. This seems pretty hacky but it’s the first thing I’ve been able to puzzle out:

Rather than raising a ParserError of class StandardError while parsing here, instead it could write to a variable that’s accessible to Runner.run.

Then I could stick an if statement between these two lines telling Runner.run to use a specific line/message combination for errors if that variable isn’t nil?

I'm sure there's a better way to do this, @justinthec . . .

justinthec · 2016-10-06T03:09:01Z

I think throwing a ParseError is fine as long as we rescue it in a try/catch block.

So https://github.com/Shopify/erb-lint/blob/master/lib/erb_lint/runner.rb#L18 would look like:

    def run(filename, file_content)
      file_tree = begin
        Parser.parse(file_content)
      rescue Parser::ParseError
        nil
      end
      return unless file_tree

      linters_for_file = @linters.select { |linter| !linter_excludes_file?(linter, filename) }
      linters_for_file.map do |linter|
        {
          linter_name: linter.class.simple_name,
          errors: linter.lint_file(file_tree)
        }
      end
    end

This would fail silently; if we instead wanted to generate a violation for Policial.io then we could modify it to return an error from a pseudo HTMLValidity linter like:

return {
  linter_name: 'HTMLValidity`,
  errors: ['File is not HTML valid and could not be successfully parsed. Ensure all tags are properly closed']
}

jeremyhansonfinger · 2016-10-06T14:37:13Z

@volmer @justinthec Am testing in Policial.io. Ran into an error, am troubleshooting:

undefined method `map' for nil:NilClass Did you mean? tap

justinthec · 2016-10-06T15:30:43Z

linters_for_file might be nil since the file couldn't be parsed?

jeremyhansonfinger · 2016-10-06T17:02:24Z

@justinthec Yes, that's the case. I need to figure out a way to pass the error directly to Policial as a linter message without erb-lint trying to actually lint the file.

justinthec · 2016-10-06T17:31:57Z

You can use the snippet I provided. Return the HTMLValidity error in the rescue block.

jeremyhansonfinger · 2016-10-06T18:22:11Z

@justinthec Ah, got it! I was just missing some brackets and :line and :message.

        return [
          { linter_name: 'HTMLValidity', errors: [
            { line: 1,
              message: 'File is not HTML valid and could not be successfully parsed.
              Ensure all tags are properly closed.' }
          ] }
        ]

This reverts commit 4cf4cb7.

lemonmade reviewed Jun 13, 2016
View reviewed changes

edward reviewed Jun 14, 2016
View reviewed changes

justinthec force-pushed the nokogiri-xml-refactor branch from 16db763 to 04bce1b Compare June 20, 2016 14:08

justinthec reviewed Jun 23, 2016
View reviewed changes

justinthec changed the title ~~Nokogiri XML refactor~~ Nokogiri XML/HTML refactor Jul 23, 2016

justinthec changed the title ~~Nokogiri XML/HTML refactor~~ Nokogiri HTML refactor Jul 23, 2016

justinthec added 6 commits August 2, 2016 18:00

Working, need to finish parser specs

f5af029

Fixed all Rubocop except for ABC and complexity

7f06f08

Added filter_erb_nodes parser helper

cfe57c1

Fix string escaping and add test suite for it

719c256

Change Parser to strip all ERB tags and finish tests + policial

f636bbb

Removed the need to turn off 2 cops

408d024

Switched from XML parser to HTML parser

a456aae

justinthec force-pushed the nokogiri-xml-refactor branch from b7dca9d to a456aae Compare August 2, 2016 22:00

justinthec and others added 6 commits August 2, 2016 22:07

Update runner.rb

fa889a7

Refactor to use pre and post processing on the tree to preserve ERB c…

ae7aab1

…ontent.

Fixed some tests

e298b5b

Fixed all tests

f4f9380

Disable ModuleLength, RedundantReturn, AbcSize cops

a2e0f03

Resolve Rubocop errors

bd1b9f6

jeremyhansonfinger merged commit 4cf4cb7 into master Sep 28, 2016

jeremyhansonfinger mentioned this pull request Oct 6, 2016

Add ContentStyle #8

Merged

1 task

justinthec mentioned this pull request Oct 6, 2016

DeprecatedClasses code review #3

Closed

jeremyhansonfinger deleted the nokogiri-xml-refactor branch October 17, 2016 21:55

jeremyhansonfinger added a commit that referenced this pull request Oct 25, 2017

Revert "Nokogiri HTML refactor (#6)"

0e6daa3

This reverts commit 4cf4cb7.

jeremyhansonfinger mentioned this pull request Oct 25, 2017

Revert Nokogiri refactor #11

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nokogiri HTML refactor #6

Nokogiri HTML refactor #6

justinthec commented Jun 8, 2016 •

edited

lemonmade Jun 13, 2016

lemonmade commented Jun 13, 2016

edward Jun 14, 2016

edward commented Jun 14, 2016

jeremyhansonfinger commented Jun 16, 2016 •

edited

justinthec commented Jun 16, 2016

jhansonf commented Jun 16, 2016

justinthec Jun 23, 2016 •

edited

lemonmade Jun 23, 2016

justinthec commented Jun 23, 2016

jeremyhansonfinger commented Jun 23, 2016

jeremyhansonfinger commented Jun 28, 2016 •

edited

justinthec commented Jul 21, 2016 •

edited

edward commented Jul 22, 2016

justinthec commented Sep 28, 2016 •

edited

jeremyhansonfinger commented Sep 28, 2016

volmer commented Oct 5, 2016

jeremyhansonfinger commented Oct 5, 2016

jeremyhansonfinger commented Oct 5, 2016 •

edited

justinthec commented Oct 6, 2016 •

edited

jeremyhansonfinger commented Oct 6, 2016

justinthec commented Oct 6, 2016

jeremyhansonfinger commented Oct 6, 2016

justinthec commented Oct 6, 2016

jeremyhansonfinger commented Oct 6, 2016 •

edited

Nokogiri HTML refactor #6

Nokogiri HTML refactor #6

Conversation

justinthec commented Jun 8, 2016 • edited

Why?

What?

How?

Todo

lemonmade Jun 13, 2016

Choose a reason for hiding this comment

lemonmade commented Jun 13, 2016

edward Jun 14, 2016

Choose a reason for hiding this comment

edward commented Jun 14, 2016

jeremyhansonfinger commented Jun 16, 2016 • edited

justinthec commented Jun 16, 2016

jhansonf commented Jun 16, 2016

justinthec Jun 23, 2016 • edited

Choose a reason for hiding this comment

lemonmade Jun 23, 2016

Choose a reason for hiding this comment

justinthec commented Jun 23, 2016

jeremyhansonfinger commented Jun 23, 2016

jeremyhansonfinger commented Jun 28, 2016 • edited

justinthec commented Jul 21, 2016 • edited

edward commented Jul 22, 2016

justinthec commented Sep 28, 2016 • edited

jeremyhansonfinger commented Sep 28, 2016

volmer commented Oct 5, 2016

jeremyhansonfinger commented Oct 5, 2016

jeremyhansonfinger commented Oct 5, 2016 • edited

justinthec commented Oct 6, 2016 • edited

jeremyhansonfinger commented Oct 6, 2016

justinthec commented Oct 6, 2016

jeremyhansonfinger commented Oct 6, 2016

justinthec commented Oct 6, 2016

jeremyhansonfinger commented Oct 6, 2016 • edited

justinthec commented Jun 8, 2016 •

edited

jeremyhansonfinger commented Jun 16, 2016 •

edited

justinthec Jun 23, 2016 •

edited

jeremyhansonfinger commented Jun 28, 2016 •

edited

justinthec commented Jul 21, 2016 •

edited

justinthec commented Sep 28, 2016 •

edited

jeremyhansonfinger commented Oct 5, 2016 •

edited

justinthec commented Oct 6, 2016 •

edited

jeremyhansonfinger commented Oct 6, 2016 •

edited