Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nokogiri HTML refactor #6

Merged
merged 13 commits into from Sep 28, 2016
Merged

Nokogiri HTML refactor #6

merged 13 commits into from Sep 28, 2016

Conversation

justinthec
Copy link

@justinthec justinthec commented Jun 8, 2016

Why?

The current use of the StartTagHelper module in DeprecatedClasses for analyzing HTML start tags is not robust (as demonstrated by #5), difficult to maintain due to excessive use of regular expressions, and limited in functionality (no multiline start tags, catches tags within quotes, etc).

Refactoring ERBLint to use a proper parser, supported by over 100 contributors will solve all the problems of the current solution and make it easier for other developers to contribute linters to the gem.

What?

This PR changes the interface that linters use to analyze the file from file_content/lines to a file_tree.

This file_tree is a Nokogiri::HTML::Fragment which can be traversed and examined just like any other Nokogiri::HTML::Node.

3 small modifications are being made during the transformation from the bare text file_content to the file_tree HTML fragment.

  • All ERB tags (<%..%>, <%=..%>, <%..-%>, <%#..%>, etc) have been replaced by <erb>...</erb> removed from the file content and replaced with corresponding whitespace and newlines.
    • Example: <% apples \n %> --> __________\n___
  • All ERB tag literals (<%%, %%>) have been escaped via htmlentities entity encoding.
    • Example: <%% --> &lt;%%
  • An erb_lint_end_marker tag has been appended to the end of the document fragment for calculating end-of-file line numbers. This is a workaround for this Nokogiri line number bug.
  • All content within ERB tags has had all of its contained entities encoded (special character escaping using htmlentities)
  • All ERB tags (now <erb>...</erb>) within strings ("..." or '...') have been replaced by _erb_..._/erb_ to prevent them from being parsed as tags by Nokogiri::HTML.

How?

  • The StartTagHelper has been removed.
  • ERBLint::Parser has been added, exposing a parse method to generate the valid Nokogiri::HTML::Fragment from a file_content string.
  • The ERBLint::Runner parses the file using the parse method before passing the resulting file_tree to each linter's lint_file method.
  • ERBLint::Linter#lint_lines has been removed.

Todo

  • Finish adding test cases to parser_spec.rb

For review:
@edward @volmer @lemonmade

protected

def lint_lines(lines)
def lint_file(file_tree)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there no access to the source of the file? You're jumping through a lot of hoops to make this work with the parsed tree.

@lemonmade
Copy link
Member

Can't say I am in love with the hackery, but I understand that this is probably better in the long run. Good job getting this to work 👍

end
html_elements = Parser.filter_erb_nodes(file_tree)
html_elements.each do |html_element|
html_element.attribute_nodes.select { |attribute| attribute.name.casecmp('class') == 0 }.each do |class_attr|
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These lines are a little hard to read.

Could you please break this out into some well-named variables? Something like:

html_element_class_attributes = html_element.attribute_nodes.select { |attribute| attribute.name.casecmp('class') == 0 }

html_element_class_attributes.each do |class_attr|

@edward
Copy link

edward commented Jun 14, 2016

Looks like progress!

Give me a shout tomorrow if you’d like help with the Rubocop warnings.

@jeremyhansonfinger
Copy link

jeremyhansonfinger commented Jun 16, 2016

@justinthec I think you already have a note on this, but I just rediscovered that single quotes in text strings cause parser to throw errors when called by content-style-checker (in my branch based off of your branch). Perhaps something that can be addressed as part of any rewrites?

     Failure/Error: raise ParsingError, 'Unclosed string found.'

     ERBLint::Parser::ParsingError:
       Unclosed string found.
     # ./lib/erb_lint/parser.rb:45:in `escape_erb_tags_in_strings'
     # ./lib/erb_lint/parser.rb:21:in `parse'

@justinthec
Copy link
Author

Yup, I'll fix the string validation tonight, thanks for making note here 👍

@jhansonf
Copy link

Thanks again!

line: lines.length,
message: 'Remove the trailing newline at the end of the file.'
)
def lint_file(file_tree)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lemonmade this comment got unintentionally marked as resolved by Github.

Is there no access to the source of the file? You're jumping through a lot of hoops to make this work with the parsed tree.

The idea behind using a parse tree always is that it provides a consistent interface for all linters to use and that most future linters (data-attributes, product content style, etc) will be complex enough to leverage the benefits a tree brings.

The reason that I'm jumping through the hoops here in FinalNewline is only because of the Nokogiri line number bug. Once that is fixed, the hacky end_marker will be completely removed and this linter will be much simpler, just finding the last child and calling #line.

As a compromise would you suggest a wrapper class around both the Parse tree and a copy of the original source?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explanation makes sense to me, I think what you have is good 👍

@justinthec
Copy link
Author

The PR description has been updated to reflect the latest version of this refactor. All previous feedback has been addressed except for one line comment by Chris which I've re-added manually.

Ready for 👀 round 2; thanks again in advance :)

@jeremyhansonfinger
Copy link

@justinthec This looks great. I rebased content-linter on your latest commit and adjusted accordingly and it's resolved the issue with dumb apostrophes.

@jeremyhansonfinger
Copy link

jeremyhansonfinger commented Jun 28, 2016

Hey @justinthec I'm trying to do something that calls this branch of erb-lint and seems like <br> tags throw parser off, since it's looking for a closing tag. Ideally I'd like parser not to escape <br> tags by default since I want to be able to work with the knowledge that there's a line break within text content, but let me know if that seems like a bad idea, or not feasible.

Anyway, if I run erb-lint on

<p>Check out this sweet support page.<br>\nIt&#39;s really a program.</p>\n

It throws an error:

image

@justinthec
Copy link
Author

justinthec commented Jul 21, 2016

I've gone and done a self review of the code after not looking at most of it for 28 days and I'm impressed that I still understand it and can walk through it clearly (especially parser.rb).

The Nokogiri Line number bug that I was originally waiting for has not been addressed for exactly a month now with no end in sight so I'm going to go ahead and ship this.

I'll still keep my eye out on that bug to see when I can remove my workaround but it is pretty small and well documented in the code.

@edward
Copy link

edward commented Jul 22, 2016

👍

@justinthec justinthec changed the title Nokogiri XML refactor Nokogiri XML/HTML refactor Jul 23, 2016
@justinthec justinthec changed the title Nokogiri XML/HTML refactor Nokogiri HTML refactor Jul 23, 2016
@jeremyhansonfinger jeremyhansonfinger merged commit 4cf4cb7 into master Sep 28, 2016
@justinthec
Copy link
Author

justinthec commented Sep 28, 2016

🎉 thanks for getting this through the last mile Jeremy!

@jeremyhansonfinger
Copy link

No problem man, thanks for all your work!

@volmer
Copy link

volmer commented Oct 5, 2016

Tried to use the gem at this revision, and now it explodes with invalid markup:

ERBLint::Parser::ParseError
File could not be successfully parsed. Ensure all tags are properly closed

Not sure if it's the expected behaviour after this PR, but I think it should have been better if we got a linter message instead of an exception.

@jeremyhansonfinger
Copy link

Thanks @volmer, I'll look into this ASAP.

@jeremyhansonfinger
Copy link

jeremyhansonfinger commented Oct 5, 2016

Trying to figure out how to best provide a linter error message. This seems pretty hacky but it’s the first thing I’ve been able to puzzle out:

Rather than raising a ParserError of class StandardError while parsing here, instead it could write to a variable that’s accessible to Runner.run.

Then I could stick an if statement between these two lines telling Runner.run to use a specific line/message combination for errors if that variable isn’t nil?

I'm sure there's a better way to do this, @justinthec . . .

@justinthec
Copy link
Author

justinthec commented Oct 6, 2016

I think throwing a ParseError is fine as long as we rescue it in a try/catch block.

So https://github.com/Shopify/erb-lint/blob/master/lib/erb_lint/runner.rb#L18 would look like:

    def run(filename, file_content)
      file_tree = begin
        Parser.parse(file_content)
      rescue Parser::ParseError
        nil
      end
      return unless file_tree

      linters_for_file = @linters.select { |linter| !linter_excludes_file?(linter, filename) }
      linters_for_file.map do |linter|
        {
          linter_name: linter.class.simple_name,
          errors: linter.lint_file(file_tree)
        }
      end
    end

This would fail silently; if we instead wanted to generate a violation for Policial.io then we could modify it to return an error from a pseudo HTMLValidity linter like:

return {
  linter_name: 'HTMLValidity`,
  errors: ['File is not HTML valid and could not be successfully parsed. Ensure all tags are properly closed']
}

@jeremyhansonfinger
Copy link

@volmer @justinthec Am testing in Policial.io. Ran into an error, am troubleshooting:

undefined method `map' for nil:NilClass Did you mean? tap

@justinthec
Copy link
Author

linters_for_file might be nil since the file couldn't be parsed?

@jeremyhansonfinger
Copy link

@justinthec Yes, that's the case. I need to figure out a way to pass the error directly to Policial as a linter message without erb-lint trying to actually lint the file.

@justinthec
Copy link
Author

You can use the snippet I provided. Return the HTMLValidity error in the rescue block.

@jeremyhansonfinger
Copy link

jeremyhansonfinger commented Oct 6, 2016

@justinthec Ah, got it! I was just missing some brackets and :line and :message.

        return [
          { linter_name: 'HTMLValidity', errors: [
            { line: 1,
              message: 'File is not HTML valid and could not be successfully parsed.
              Ensure all tags are properly closed.' }
          ] }
        ]

@jeremyhansonfinger jeremyhansonfinger mentioned this pull request Oct 6, 2016
1 task
@jeremyhansonfinger jeremyhansonfinger deleted the nokogiri-xml-refactor branch October 17, 2016 21:55
jeremyhansonfinger added a commit that referenced this pull request Oct 25, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants