Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug in normalized_host in Addressable (ArgumentError: invalid byte sequence in UTF-8) #62

Closed
walro opened this issue Nov 4, 2015 · 13 comments · Fixed by #158
Closed

Comments

@walro
Copy link
Contributor

walro commented Nov 4, 2015

When

irb(main):021:0> Twingly::URL.parse("http://some_site.net%C2")
ArgumentError: invalid byte sequence in UTF-8
    from /Users/robin/.gem/ruby/2.2.3/gems/addressable-2.3.8/lib/addressable/uri.rb:1097:in `host='
    from /Users/robin/.gem/ruby/2.2.3/gems/addressable-2.3.8/lib/addressable/uri.rb:2050:in `display_uri'
    from /Users/robin/Workspace/twingly/twingly-url/lib/twingly/url.rb:30:in `internal_parse'
    from /Users/robin/Workspace/twingly/twingly-url/lib/twingly/url.rb:21:in `parse'
    from (irb):21
    from /Users/robin/.gem/ruby/2.2.3/gems/bundler-1.10.6/lib/bundler/cli/console.rb:14:in `run'
    from /Users/robin/.gem/ruby/2.2.3/gems/bundler-1.10.6/lib/bundler/cli.rb:308:in `console'
    from /Users/robin/.gem/ruby/2.2.3/gems/bundler-1.10.6/lib/bundler/vendor/thor/lib/thor/command.rb:27:in `run'
    from /Users/robin/.gem/ruby/2.2.3/gems/bundler-1.10.6/lib/bundler/vendor/thor/lib/thor/invocation.rb:126:in `invoke_command'
    from /Users/robin/.gem/ruby/2.2.3/gems/bundler-1.10.6/lib/bundler/vendor/thor/lib/thor.rb:359:in `dispatch'
    from /Users/robin/.gem/ruby/2.2.3/gems/bundler-1.10.6/lib/bundler/vendor/thor/lib/thor/base.rb:440:in `start'
    from /Users/robin/.gem/ruby/2.2.3/gems/bundler-1.10.6/lib/bundler/cli.rb:10:in `start'
    from /Users/robin/.gem/ruby/2.2.3/gems/bundler-1.10.6/bin/bundle:20:in `block in <top (required)>'
    from /Users/robin/.gem/ruby/2.2.3/gems/bundler-1.10.6/lib/bundler/friendly_errors.rb:7:in `with_friendly_errors'
    from /Users/robin/.gem/ruby/2.2.3/gems/bundler-1.10.6/bin/bundle:18:in `<top (required)>'
    from /Users/robin/.gem/ruby/2.2.3/bin/bundle:23:in `load'
    from /Users/robin/.gem/ruby/2.2.3/bin/bundle:23:in `<main>'irb(main):022:0> 

Another one: http://+%D5d.some_site.net

@walro walro added the bug label Nov 4, 2015
@roback
Copy link
Member

roback commented Jan 28, 2016

I just wanted to see what caused the error :)

Addressable first runs unencode_component on the host part which for some_site.net%C2 results in the string some_site.net\xC2 which ruby cannot split in Addressable::IDNA.to_ascii.

> Addressable::URI.unencode_component("some_site.net%C2")
=> "some_site.net\xC2"
> "some_site.net\xC2".split("")
ArgumentError: invalid byte sequence in UTF-8
from (pry):35:in `split'

@dentarg
Copy link
Collaborator

dentarg commented Jan 28, 2016

@roback what about when not using the idn gem and libidn, do we still have this error?

@roback
Copy link
Member

roback commented Jan 28, 2016

Removed the idn-ruby gem and I still get the same error, but this time from gsub:

> Twingly::URL.parse("http://some_site.net%C2")
ArgumentError: invalid byte sequence in UTF-8
    from /Users/mattias/.gem/ruby/2.2.3/gems/addressable-2.4.0/lib/addressable/uri.rb:432:in `gsub'
    from /Users/mattias/.gem/ruby/2.2.3/gems/addressable-2.4.0/lib/addressable/uri.rb:432:in `unencode'
    from /Users/mattias/.gem/ruby/2.2.3/gems/addressable-2.4.0/lib/addressable/uri.rb:530:in `normalize_component'
    from /Users/mattias/.gem/ruby/2.2.3/gems/addressable-2.4.0/lib/addressable/uri.rb:1079:in `normalized_host'
    from /Users/mattias/.gem/ruby/2.2.3/gems/addressable-2.4.0/lib/addressable/uri.rb:1177:in `normalized_authority'
    from /Users/mattias/.gem/ruby/2.2.3/gems/addressable-2.4.0/lib/addressable/uri.rb:2078:in `normalize'
    from /Users/mattias/.gem/ruby/2.2.3/gems/addressable-2.4.0/lib/addressable/uri.rb:2103:in `display_uri'
    from /Users/mattias/repos/twingly-url/lib/twingly/url.rb:38:in `internal_parse'
    from /Users/mattias/repos/twingly-url/lib/twingly/url.rb:26:in `parse'
    from (irb):1

The same as above, but with the idn-ruby gem:

> Twingly::URL.parse("http://some_site.net%C2")
ArgumentError: invalid byte sequence in UTF-8
    from /Users/mattias/.gem/ruby/2.2.3/gems/addressable-2.4.0/lib/addressable/idna/native.rb:36:in `split'
    from /Users/mattias/.gem/ruby/2.2.3/gems/addressable-2.4.0/lib/addressable/idna/native.rb:36:in `to_ascii'
    from /Users/mattias/.gem/ruby/2.2.3/gems/addressable-2.4.0/lib/addressable/uri.rb:1072:in `normalized_host'
    from /Users/mattias/.gem/ruby/2.2.3/gems/addressable-2.4.0/lib/addressable/uri.rb:1177:in `normalized_authority'
    from /Users/mattias/.gem/ruby/2.2.3/gems/addressable-2.4.0/lib/addressable/uri.rb:2078:in `normalize'
    from /Users/mattias/.gem/ruby/2.2.3/gems/addressable-2.4.0/lib/addressable/uri.rb:2103:in `display_uri'
    from /Users/mattias/repos/twingly-url/lib/twingly/url.rb:38:in `internal_parse'
    from /Users/mattias/repos/twingly-url/lib/twingly/url.rb:26:in `parse'
    from (irb):1

@roback
Copy link
Member

roback commented Jan 29, 2016

Perhaps we can use a combination of Addressable::URI.unencode_component and String.valid_encoding? before giving the url to Addressable::URI.heuristic_parse.

url = "http://some_site.net%C2"
url.valid_encoding?
# => true
unencoded_url = Addressable::URI.unencode_component(url)
# => "http://some_site.net\xC2"
unencoded_url.valid_encoding?
# => false

@roback
Copy link
Member

roback commented Jan 29, 2016

Perhaps we can use a combination of Addressable::URI.unencode_component and String.valid_encoding? before giving the url to Addressable::URI.heuristic_parse.

It wasn't that simple :(

Failures:

  1) Twingly::URL.parse when given badly encoded input will replace badly encoded characters with unicode replacement character (U+FFFD)
     Failure/Error: let(:actual)            { described_class.parse(badly_encoded_url) }
     ArgumentError:
       invalid byte sequence in UTF-8
     # /Users/mattias/.gem/ruby/2.2.3/gems/addressable-2.4.0/lib/addressable/uri.rb:432:in `gsub'
     # /Users/mattias/.gem/ruby/2.2.3/gems/addressable-2.4.0/lib/addressable/uri.rb:432:in `unencode'
     # ./lib/twingly/url.rb:32:in `internal_parse'
     # ./lib/twingly/url.rb:26:in `parse'
     # ./spec/lib/twingly/url_spec.rb:78:in `block (4 levels) in <top (required)>'
     # ./spec/lib/twingly/url_spec.rb:81:in `block (4 levels) in <top (required)>'

  2) Twingly::URL#normalized handles URL with ] in it should eq "http://www.iwaseki.co.jp/cgi/yybbs/yybbs.cgi/%DEuropean]buy"
     Failure/Error: it { is_expected.to eq(url) }

       expected: "http://www.iwaseki.co.jp/cgi/yybbs/yybbs.cgi/%DEuropean]buy"
            got: ""

       (compared using ==)
     # ./spec/lib/twingly/url_spec.rb:342:in `block (4 levels) in <top (required)>'

Finished in 0.15815 seconds (files took 0.12399 seconds to load)
128 examples, 2 failures

@roback
Copy link
Member

roback commented Jan 29, 2016

Opened sporkmonger/addressable#224

@roback
Copy link
Member

roback commented Jan 29, 2016

Not a fan of this solution, but it works. I cannot come up with a better one 😢

diff --git a/lib/twingly/url.rb b/lib/twingly/url.rb
index 97635e7..50ade75 100644
--- a/lib/twingly/url.rb
+++ b/lib/twingly/url.rb
@@ -35,6 +35,8 @@ module Twingly
         scheme = addressable_uri.scheme
         raise Twingly::URL::Error::ParseError unless scheme =~ ACCEPTED_SCHEMES

+        guard_against_addressable_bug(addressable_uri)
+
         public_suffix_domain = PublicSuffix.parse(addressable_uri.display_uri.host)
         raise Twingly::URL::Error::ParseError if public_suffix_domain.nil?

@@ -56,6 +58,18 @@ module Twingly
         end
       end

+      # Workaround for the following bug in addressable:
+      # https://github.com/sporkmonger/addressable/issues/224
+      def guard_against_addressable_bug(addressable_uri)
+        addressable_uri.display_uri
+      rescue ArgumentError => error
+        if error.message.include?("invalid byte sequence in UTF-8")
+          raise Twingly::URL::Error::ParseError
+        end
+
+        raise
+      end
+
       private :new, :internal_parse, :to_addressable_uri
     end

diff --git a/spec/lib/twingly/url_spec.rb b/spec/lib/twingly/url_spec.rb
index 729f847..ded80b2 100644
--- a/spec/lib/twingly/url_spec.rb
+++ b/spec/lib/twingly/url_spec.rb
@@ -27,6 +27,7 @@ def invalid_urls
     "http://xn--t...-/",
     "http://xn--...-",
     "leather beltsbelts for menleather beltmens beltsleather belts for menmens beltbelt bucklesblack l...",
+    "some_site.net%C2",
   ]
 end

@dentarg
Copy link
Collaborator

dentarg commented Sep 12, 2016

Just to make this issue more clear, the bug is in #normalized_host in Addressable:

[32] pry(main)> Addressable::VERSION::STRING
=> "2.4.0"
[33] pry(main)> Addressable::URI.heuristic_parse("http://some_site.net%C2").normalized_host
ArgumentError: invalid byte sequence in UTF-8
from /Users/dentarg/.gem/ruby/2.2.5/gems/addressable-2.4.0/lib/addressable/uri.rb:432:in `gsub'
[34] pry(main)> wtf
Exception: ArgumentError: invalid byte sequence in UTF-8
--
0: /Users/dentarg/.gem/ruby/2.2.5/gems/addressable-2.4.0/lib/addressable/uri.rb:432:in `gsub'
1: /Users/dentarg/.gem/ruby/2.2.5/gems/addressable-2.4.0/lib/addressable/uri.rb:432:in `unencode'
2: /Users/dentarg/.gem/ruby/2.2.5/gems/addressable-2.4.0/lib/addressable/uri.rb:530:in `normalize_component'
3: /Users/dentarg/.gem/ruby/2.2.5/gems/addressable-2.4.0/lib/addressable/uri.rb:1079:in `normalized_host'
4: (pry):22:in `__pry__'

@dentarg dentarg changed the title ArgumentError: invalid byte sequence in UTF-8 Bug in normalized_host in Addressable (ArgumentError: invalid byte sequence in UTF-8) Sep 12, 2016
@dentarg
Copy link
Collaborator

dentarg commented Nov 5, 2016

No change for addressable 2.5.0:

$ pry
[1] pry(main)> require "addressable"
=> true
[2] pry(main)> Addressable::VERSION::STRING
=> "2.5.0"
[3] pry(main)> Addressable::URI.heuristic_parse("http://some_site.net%C2").normalized_host
ArgumentError: invalid byte sequence in UTF-8
from /Users/dentarg/.gem/ruby/2.2.5/gems/addressable-2.5.0/lib/addressable/idna/native.rb:36:in `split'
[4] pry(main)> wtf?
Exception: ArgumentError: invalid byte sequence in UTF-8
--
0: /Users/dentarg/.gem/ruby/2.2.5/gems/addressable-2.5.0/lib/addressable/idna/native.rb:36:in `split'
1: /Users/dentarg/.gem/ruby/2.2.5/gems/addressable-2.5.0/lib/addressable/idna/native.rb:36:in `to_ascii'
2: /Users/dentarg/.gem/ruby/2.2.5/gems/addressable-2.5.0/lib/addressable/uri.rb:1092:in `normalized_host'
3: (pry):3:in `__pry__'
4: /Users/dentarg/.gem/ruby/2.2.5/gems/pry-0.10.4/lib/pry/pry_instance.rb:355:in `eval'
5: /Users/dentarg/.gem/ruby/2.2.5/gems/pry-0.10.4/lib/pry/pry_instance.rb:355:in `evaluate_ruby'
6: /Users/dentarg/.gem/ruby/2.2.5/gems/pry-0.10.4/lib/pry/pry_instance.rb:323:in `handle_line'
7: /Users/dentarg/.gem/ruby/2.2.5/gems/pry-0.10.4/lib/pry/pry_instance.rb:243:in `block (2 levels) in eval'
8: /Users/dentarg/.gem/ruby/2.2.5/gems/pry-0.10.4/lib/pry/pry_instance.rb:242:in `catch'
9: /Users/dentarg/.gem/ruby/2.2.5/gems/pry-0.10.4/lib/pry/pry_instance.rb:242:in `block in eval'
$ pry
[1] pry(main)> require "addressable"
=> true
[2] pry(main)> Addressable::URI.heuristic_parse("http://some_site.net%C2").normalized_host
ArgumentError: invalid byte sequence in UTF-8
from /Users/dentarg/.gem/ruby/2.2.5/gems/addressable-2.5.0/lib/addressable/uri.rb:440:in `gsub'
[3] pry(main)> wtf?
Exception: ArgumentError: invalid byte sequence in UTF-8
--
0: /Users/dentarg/.gem/ruby/2.2.5/gems/addressable-2.5.0/lib/addressable/uri.rb:440:in `gsub'
1: /Users/dentarg/.gem/ruby/2.2.5/gems/addressable-2.5.0/lib/addressable/uri.rb:440:in `unencode'
2: /Users/dentarg/.gem/ruby/2.2.5/gems/addressable-2.5.0/lib/addressable/uri.rb:536:in `normalize_component'
3: /Users/dentarg/.gem/ruby/2.2.5/gems/addressable-2.5.0/lib/addressable/uri.rb:1099:in `normalized_host'
4: (pry):2:in `__pry__'
5: /Users/dentarg/.gem/ruby/2.2.5/gems/pry-0.10.4/lib/pry/pry_instance.rb:355:in `eval'
6: /Users/dentarg/.gem/ruby/2.2.5/gems/pry-0.10.4/lib/pry/pry_instance.rb:355:in `evaluate_ruby'
7: /Users/dentarg/.gem/ruby/2.2.5/gems/pry-0.10.4/lib/pry/pry_instance.rb:323:in `handle_line'
8: /Users/dentarg/.gem/ruby/2.2.5/gems/pry-0.10.4/lib/pry/pry_instance.rb:243:in `block (2 levels) in eval'
9: /Users/dentarg/.gem/ruby/2.2.5/gems/pry-0.10.4/lib/pry/pry_instance.rb:242:in `catch'
[4] pry(main)> Addressable::VERSION::STRING
=> "2.5.0"

@dentarg
Copy link
Collaborator

dentarg commented Jun 10, 2018

Still an error upstream

=> Addressable::VERSION
irb(main):005:0> Addressable::VERSION::STRING
=> "2.5.2"
irb(main):006:0> Addressable::URI.parse("http://example.com%C2").display_uri
ArgumentError: invalid byte sequence in UTF-8
	from /Users/dentarg/.gem/ruby/2.4.2/gems/addressable-2.5.2/lib/addressable/idna/native.rb:36:in `split'
	from /Users/dentarg/.gem/ruby/2.4.2/gems/addressable-2.5.2/lib/addressable/idna/native.rb:36:in `to_ascii'
	from /Users/dentarg/.gem/ruby/2.4.2/gems/addressable-2.5.2/lib/addressable/uri.rb:1092:in `normalized_host'
	from /Users/dentarg/.gem/ruby/2.4.2/gems/addressable-2.5.2/lib/addressable/uri.rb:1210:in `normalized_authority'
	from /Users/dentarg/.gem/ruby/2.4.2/gems/addressable-2.5.2/lib/addressable/uri.rb:2133:in `normalize'
	from /Users/dentarg/.gem/ruby/2.4.2/gems/addressable-2.5.2/lib/addressable/uri.rb:2158:in `display_uri'
	from (irb):6
	from /Users/dentarg/.rubies/ruby-2.4.2/bin/irb:11:in `<main>'

without idn-ruby:

irb(main):001:0> Addressable::URI.parse("http://example.com%C2").display_uri
ArgumentError: invalid byte sequence in UTF-8
	from /Users/dentarg/.gem/ruby/2.4.2/gems/addressable-2.5.2/lib/addressable/uri.rb:440:in `gsub'
	from /Users/dentarg/.gem/ruby/2.4.2/gems/addressable-2.5.2/lib/addressable/uri.rb:440:in `unencode'
	from /Users/dentarg/.gem/ruby/2.4.2/gems/addressable-2.5.2/lib/addressable/uri.rb:536:in `normalize_component'
	from /Users/dentarg/.gem/ruby/2.4.2/gems/addressable-2.5.2/lib/addressable/uri.rb:1099:in `normalized_host'
	from /Users/dentarg/.gem/ruby/2.4.2/gems/addressable-2.5.2/lib/addressable/uri.rb:1210:in `normalized_authority'
	from /Users/dentarg/.gem/ruby/2.4.2/gems/addressable-2.5.2/lib/addressable/uri.rb:2133:in `normalize'
	from /Users/dentarg/.gem/ruby/2.4.2/gems/addressable-2.5.2/lib/addressable/uri.rb:2158:in `display_uri'
	from (irb):1
	from /Users/dentarg/.rubies/ruby-2.4.2/bin/irb:11:in `<main>'

@dentarg
Copy link
Collaborator

dentarg commented Aug 7, 2018

2,5 years later Bob has replied :-)

Pasting sporkmonger/addressable#224 (comment) for your convenience

These are some gross URIs. 😝

That said, I'm not sure I think this is a bug. Given what display_uri is supposed to do, this is legitimately an exceptional condition. There is no way to correctly render a UTF-8 string for that hostname. However, http://example.com%C2, gross as it is, I think it's actually a valid URI, so raising an invalid URI exception doesn't seem correct either. That makes me think this behavior may actually be correct, if perhaps a little surprising.

reg-name = *( unreserved / pct-encoded / sub-delims )

@dentarg
Copy link
Collaborator

dentarg commented Oct 5, 2022

I think this issue should be closed, the workaround (#79, tweaked in cd38e55) makes sense to always have in twingly-url to avoid exceptions.

dentarg added a commit to dentarg/twingly-url that referenced this issue Oct 5, 2022
According to the "preferred format" used by DNS.

See https://en.wikipedia.org/wiki/Domain_Name_System#Domain_name_syntax,_internationalization

Moves one invalid URL to the set of invalid URLs (if you enter
http://www..twingly..com/ in the address bar in Chrome, it does a
search, doesn't try to visit any site).

Close twingly#62
@roback
Copy link
Member

roback commented Oct 6, 2022

Yes, you're right, we'll probably never be able to remove that rescue. It was less temporary than we first thought :)

@roback roback closed this as completed Oct 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants