Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix invalid byte sequence in UTF-8 exception when unencoding URLs containing non UTF-8 characters #459

Merged
merged 4 commits into from Jul 30, 2022

Conversation

jarthod
Copy link
Contributor

@jarthod jarthod commented Jul 2, 2022

Hi 👋

First of all I've been using addressable for some time in my product to deal with complex URL transformations and it's been super helpfull. Thanks 🙇

Recently I started getting one invalid byte sequence in UTF-8 exception in gsub when parsing and normalizing some weird URL containing non UTF-8 compatible characters (ISO-8859-1). So I looked at the code and found that it is supposed to change the encoding back to ASCII-8BIT during this phase to avoid any encoding issue (good idea):
https://github.com/sporkmonger/addressable/blob/addressable-2.8.0/lib/addressable/uri.rb#L576-L580

BUT a couple lines later in the unencode method it actually forces back to UTF-8 right BEFORE the gsub:
https://github.com/sporkmonger/addressable/blob/addressable-2.8.0/lib/addressable/uri.rb#L472-L480

(this is comming from this change: e4f2bd6 following this issue: #154)

So this change back to UTF-8 before the gsub is breaking again this workflow, the spec I added in this PR gives this failure using the original code:

Failures:

  1) Addressable::URI when unencoding a multibyte string should not fail with UTF-8 incompatible string (ISO-8859-1 encoding in this example)
     Failure/Error:
       result = uri.gsub(/%[0-9a-f]{2}/iu) do |sequence|
         c = sequence[1..3].to_i(16).chr
         c.force_encoding("utf-8")
         leave_encoded.include?(c) ? sequence : c
       end
     
     ArgumentError:
       invalid byte sequence in UTF-8
     # ./lib/addressable/uri.rb:480:in `gsub'
     # ./lib/addressable/uri.rb:480:in `unencode'
     # ./spec/addressable/uri_spec.rb:5997:in `block (2 levels) in <top (required)>'

My fix simply removes some of the force_encoding and changes slightly the one inside the gsub (to avoid breaking the other issue fixed before). The test suite passes entirely and I have also checked this version on the 150k+ URLs present in my product (parse + normalize) without any error. I am already using this version in production for about a week.

Let me know if you have any doubt or questions.

Suggested line for the changelog:

- fixes "invalid byte sequence in UTF-8" exception when unencoding URLs containing non UTF-8 characters

spec/addressable/uri_spec.rb Outdated Show resolved Hide resolved
spec/addressable/uri_spec.rb Outdated Show resolved Hide resolved
@dentarg
Copy link
Collaborator

dentarg commented Jul 2, 2022

Did you see #224? Sounds (a bit) like the same issue?

@jarthod
Copy link
Contributor Author

jarthod commented Jul 2, 2022

Did you see #224? Sounds (a bit) like the same issue?

Oops no I didn't see this one, it looks a bit like the same issue indeed, although it's about the hostname part in #224. My PR here does not fix #224 because there's other potential problems with encoding in the code handling the hostname part:

     Failure/Error:
       value.to_s.split('.', -1).map do |segment|
         if segment.size > 0 && segment.size < 64
           IDN::Idna.toASCII(segment, IDN::Idna::ALLOW_UNASSIGNED)
         elsif segment.size >= 64
           segment
         else
           ''
         end
       end.join('.')
     
     ArgumentError:
       invalid byte sequence in UTF-8
     # ./lib/addressable/idna/native.rb:38:in `split'
     # ./lib/addressable/idna/native.rb:38:in `to_ascii'
     # ./lib/addressable/uri.rb:1131:in `normalized_host'
     # ./lib/addressable/uri.rb:1261:in `normalized_authority'
     # ./lib/addressable/uri.rb:2191:in `normalize'
     # ./lib/addressable/uri.rb:2216:in `display_uri'
     # ./spec/addressable/uri_spec.rb:5998:in `block (2 levels) in <top (required)>'

My PR helps for the path and query parts which are more likely to contain non UTF-8 chars in my experience.
I can probably have a look at improving #224 too if you want once this PR has been vetted.

spec/addressable/uri_spec.rb Outdated Show resolved Hide resolved
spec/addressable/uri_spec.rb Outdated Show resolved Hide resolved
Copy link
Collaborator

@dentarg dentarg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, I will merge if you can please the 🐶 and rebase :)

@jarthod
Copy link
Contributor Author

jarthod commented Jul 27, 2022

Ok I think I have please the 🐶 (although I'm not happy with the result), rebased the branch on main and squashed the multiple fixes into a single commit ready to merge.

@dentarg
Copy link
Collaborator

dentarg commented Jul 27, 2022

I'm not sure what's going on with GitHub Actions, but I can't merge this as status hasn't been reported for a number of jobs (same problem in #469 but there it makes a bit sense). Do you mind opening this as a new PR?

@dentarg
Copy link
Collaborator

dentarg commented Jul 27, 2022

Maybe wait with that... I saw this now This is a scheduled macOS-10.15 brownout. ... at https://github.com/sporkmonger/addressable/actions/runs/2749162489 under Annotations

@jarthod
Copy link
Contributor Author

jarthod commented Jul 27, 2022

Indeed, according to actions/runner-images#5583 the brownout ends in 4 hours so retrying the run after that might pass. The image will have to be updated though because end of August macOS 10.15 will be completely gone ^^


Edit 10h later: I just tried but don't have the rights to re-start the build so I'll let you do it.

@sporkmonger sporkmonger merged commit 068f673 into sporkmonger:main Jul 30, 2022
@sporkmonger
Copy link
Owner

Thanks for being thorough and checking 150k extra URL parses. 😁

@jarthod
Copy link
Contributor Author

jarthod commented Jul 30, 2022

Thanks @sporkmonger & @dentarg ! Happy to do it again if you need some validation for other changes.

Pontus4 added a commit to dentarg/twingly-url that referenced this pull request Aug 23, 2022
Since `PublicSuffix` v4.0.3, it is possible to parse the URL `http://+%D5d.some_site.net`.

It is also possible to normalize the URL since `Addressable` v2.8.1, which includes this fix: sporkmonger/addressable#459.

Hence, this is now a valid URL, which means that it should be moved to the `valid_urls` array in the specs.

`PublicSuffix` 4.x is supported by `Addressable` since v2.7.0 (see https://github.com/sporkmonger/addressable/blob/main/CHANGELOG.md#addressable-270).
@jarthod jarthod deleted the iso-encoding-problem branch January 23, 2023 15:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants