Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Normalization issue with # (%23) #295

Open
matthieuprat opened this issue Mar 22, 2018 · 6 comments
Open

Normalization issue with # (%23) #295

matthieuprat opened this issue Mar 22, 2018 · 6 comments

Comments

@matthieuprat
Copy link

matthieuprat commented Mar 22, 2018

Not sure this is an actual bug:

Addressable::URI.parse("http://example.org?foo=%E9%23").normalize.query 

This returns the string foo=%E9#. I would have expected foo=%E9%23.

Note that %E9 is the escaped version of the é character in ISO-8859-1, that is URI.escape('é'.encode('iso-8859-1')).

Is this the intended behavior?

@dentarg
Copy link
Collaborator

dentarg commented Jun 10, 2018

No answers, but I found that #query and #query_values doesn't match:

$ irb -raddressable/uri
irb(main):001:0> Addressable::VERSION::STRING
=> "2.5.2"
irb(main):002:0> Addressable::URI.parse("http://example.org?foo=%E9%23").normalize.query
=> "foo=%E9#"
irb(main):003:0> Addressable::URI.parse("http://example.org?foo=%E9%23").normalize.query_values
=> {"foo"=>"\xE9#"}
irb(main):004:0> Addressable::URI.unencode("%E9%23")
=> "\xE9#"

I think this issue is similar to #224

@sporkmonger
Copy link
Owner

sporkmonger commented Aug 7, 2018

https://tools.ietf.org/html/rfc3986#section-2.5 applies here.

When a new URI scheme defines a component that represents textual
data consisting of characters from the Universal Character Set [UCS],
the data should first be encoded as octets according to the UTF-8
character encoding [STD63]; then only those octets that do not
correspond to characters in the unreserved set should be percent-
encoded. For example, the character A would be represented as "A",
the character LATIN CAPITAL LETTER A WITH GRAVE would be represented
as "%C3%80", and the character KATAKANA LETTER A would be represented
as "%E3%82%A2".

@sporkmonger
Copy link
Owner

sporkmonger commented Aug 7, 2018

We should probably add a test case for uri.query_values = {"À": "ア"} since it's a cited example.

@dentarg
Copy link
Collaborator

dentarg commented Feb 19, 2019

Related to #334

@dentarg
Copy link
Collaborator

dentarg commented Jul 19, 2023

I found that #query and #query_values doesn't match

Probably due to #114 (comment)

Ultimately, the query_values method is attempting to emulate the application/x-www-form-urlencoded content type, poorly specified though it may be.

@dentarg
Copy link
Collaborator

dentarg commented Jul 19, 2023

This returns the string foo=%E9#. I would have expected foo=%E9%23.

I think this is another variant of #366 where addressable incorrectly decodes the percent-encoded reserved character # (%23)

@dentarg dentarg changed the title URI normalization issue with ISO-8859-1 encoding Normalization issue with # (%23) Jul 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants