Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The regex in Attribute::URL.clean_url is very slow for JavaScript bookmarklet URLs #816

Open
alexwlchan opened this issue Apr 7, 2024 · 1 comment

Comments

@alexwlchan
Copy link

alexwlchan commented Apr 7, 2024

I recently upgraded from HTML-Proofer 3 to 5, and noticed that it was taking an exorbitant length of time to run on a site that had previously been fine.

I eventually traced the issue to a page with a bookmarklet on it; a long javascript:… URL with various URL-encoded entities – trying to check that URL was taking a very long time (multiple minutes).

Steps to reproduce

Here's a short program that tries to lint an HTML file with the problematic link:

require 'html-proofer'

File.open('bookmarklet.html', 'w') do |f|
  f.puts '<a href="javascript:var%20ids%20=%20[%22omnibox%22,%20%22cards%22,%20%22welcome%22];var%20classes%20=%20[%22widget-viewcard%22,%20%22widget-zoom%22,%20%22watermark%22];var%20hidden%20=%20(window.getComputedStyle(document.getElementById(ids[0]))).getPropertyValue(%22display%22);if%20(hidden%20!==%20%22none%22)%20{var%20disp%20=%20%22none%22;}%20else%20{var%20disp%20=%20%22%22;}for%20(var%20i%20=%200;%20i%20&lt;%20ids.length;%20i++)%20{document.getElementById(ids[i]).style.display%20=%20disp;}for%20(var%20i%20=%200;%20i%20&lt;%20classes.length;%20i++)%20{var%20div%20=%20document.getElementsByClassName(classes[i]);for%20(var%20j%20=%200;%20j%20&lt;%20div.length;%20j++)%20{div[j].style.display%20=%20disp;}}">Toggle Google Maps</a>'
end

HTMLProofer.check_file(
  'bookmarklet.html', {
    checks: ['Links'],
    log_level: :debug
  }
).run

The output I see is:

$ ruby repro.rb
Running 1 check (Links) in bookmarklet.html on *.html files ...


Running Links in bookmarklet.html

but it never actually completes.

I ran this with the latest version of HTML-Proofer and Ruby:

$ bundle exec htmlproofer --version
5.0.8

$ ruby --version
ruby 3.3.0 (2023-12-25 revision 5124f9ac75) [arm64-darwin23]

The output I see is:

Debugging analysis

I added a bunch of print statements to determine where it was going and where it was getting stuck. This stuff may be obvious to people who know the library, but this is a new codebase to me:

Here's where that log message comes from:

@logger.log(:debug, "Running #{check.short_name} in #{path}")
@current_check = check
check.run

In this case check is a link check. Follow that down a level, I found it stalled on this line:

@link = create_element(node)

Here node is the <a> tag, I think?

If you follow the chain of function calls, create_element(node) calls Element.new(@runner, node, base_url: base_url), and following that down I found the issue is somewhere in Attribute::Url – it gets stuck on this line:

@url = Attribute::Url.new(runner, link_attribute, base_url: base_url, source: @runner.current_source, filename: @runner.current_filename)

And if you go and look at how Attribute::Url works, you see that the initialiser calls clean_url!, and that turns out to be the root of the slowness:

# catch any obvious issues, like strings in port numbers
private def clean_url!
return if @url =~ /^([!#{Regexp.last_match(0)}-;=?-\[\]_a-z~]|%[0-9a-fA-F]{2})+$/
@url = Addressable::URI.parse(@url).normalize.to_s
end

We can verify this by running my URL against the regex as a standalone program, and observing that it takes a very long time:

url = "javascript:var%20ids%20=%20[%22omnibox%22,%20%22cards%22,%20%22welcome%22];var%20classes%20=%20[%22widget-viewcard%22,%20%22widget-zoom%22,%20%22watermark%22];var%20hidden%20=%20(window.getComputedStyle(document.getElementById(ids[0]))).getPropertyValue(%22display%22);if%20(hidden%20!==%20%22none%22)%20{var%20disp%20=%20%22none%22;}%20else%20{var%20disp%20=%20%22%22;}for%20(var%20i%20=%200;%20i%20&lt;%20ids.length;%20i++)%20{document.getElementById(ids[i]).style.display%20=%20disp;}for%20(var%20i%20=%200;%20i%20&lt;%20classes.length;%20i++)%20{var%20div%20=%20document.getElementsByClassName(classes[i]);for%20(var%20j%20=%200;%20j%20&lt;%20div.length;%20j++)%20{div[j].style.display%20=%20disp;}}"

puts url =~ /^([!#{Regexp.last_match(0)}-;=?-\[\]_a-z~]|%[0-9a-fA-F]{2})+$/

This immediately starts consuming 100% of my Mac's CPU, and has been since I started writing this issue. It has yet to complete.

Suggested fix

This regex first appeared in #394. I'm not sure what it's meant to be doing.

I'm also not sure what clean_url! is for; I wonder if it should be doing anything with javascript: URLs? Is there anything HTML-Proofer can do with them downstream? If not, maybe it should just skip any URL starting javascript: here?

Notes

This problem occurs even if I add the data-proofer-ignore attribute to my <a> tag – that was mildly surprising to me.

@gjtorikian
Copy link
Owner

Hi, my apologies for not seeing this sooner. That regexp was, in fact, doing nothing, and was probably vestigial. Removing it in #820 still lets the test suite pass, so I'll release this fix in the next patch release this week.

This problem occurs even if I add the data-proofer-ignore attribute to my <a> tag – that was mildly surprising to me.

Hm, I tried to recreate this and couldn't. If you have another test case after the fix above is merged, I'll take a look again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants