Normalization of path segments should probably happen before normalization of percent escaping #8

sporkmonger · 2010-03-20T06:34:07Z

Addressable::URI.parse("/%2E/").normalize.to_str.should == "/%2E/"

The text was updated successfully, but these errors were encountered:

sporkmonger · 2010-03-20T06:41:55Z

This issue probably requires a check-in with the IETF URI mailing list before deciding one way or the other.

kovyrin · 2013-10-18T01:03:48Z

I understand that it's been a long time ago, but still wanted to check in to see what's up with this issue? We've hit this bug in a bit different context and are not sure how to deal with it. Any chance this going to be fixed?

sporkmonger · 2013-10-18T14:12:06Z

Could you elaborate on the issue you're hitting? A test case would be awesome.

kovyrin · 2013-10-18T16:36:03Z

Actually, now I'm not sure if our issue is related to this one. Here is our problem:

irb(main):001:0> Addressable::URI.parse(PostRank::URI.unescape("http://foo.com/blah%ef%bc%9f"))
=> #<Addressable::URI:0x5648890 URI:http://foo.com/blah？>
irb(main):002:0> Addressable::URI.parse(PostRank::URI.unescape("http://foo.com/blah%ef%bc%9f")).normalize!
=> #<Addressable::URI:0x564ed08 URI:http://foo.com/blah%3F>

Normalize call screws up a perfectly valid (AFAIU) unicode symbol and replaces it with a latin1 question mark.

sporkmonger · 2013-10-20T14:32:18Z

It's doing the right thing actually. IRIs (unicode-friendly URIs) use unicode normalization form KC to limit phishing. NFKC tends to do perceptual codepoint conversions, like converting '？' to '?'. The solution here is not to normalize the URI if this is causing a problem, or to instead normalize components piecemeal. "http://foo.com/blah%ef%bc%9f" and "http://foo.com/blah%3F" are considered equivalent.

dentarg · 2023-07-19T08:02:50Z

Some more context, %2E is .

irb(main):038:0> CGI.unescapeURIComponent "%2E"
=> "."

Addressable::URI.parse("/%2E/").normalize.to_str.should == "/%2E/"

Not sure why this should be true? If you want to compare URIs, shouldn't you normalize both before comparing?

Hmm, from https://www.rfc-editor.org/rfc/rfc3986#section-2.3

Characters that are allowed in a URI but do not have a reserved
purpose are called unreserved. These include uppercase and lowercase
letters, decimal digits, hyphen, period, underscore, and tilde.
  unreserved  = ALPHA / DIGIT / "-" / "." / "_" / "~"
URIs that differ in the replacement of an unreserved character with
its corresponding percent-encoded US-ASCII octet are equivalent: they
identify the same resource. However, URI comparison implementations
do not always perform normalization prior to comparison (see Section 6).
For consistency, percent-encoded octets in the ranges of ALPHA
(%41-%5A and %61-%7A), DIGIT (%30-%39), hyphen (%2D), period (%2E),
underscore (%5F), or tilde (%7E) should not be created by URI
producers and, when found in a URI, should be decoded to their
corresponding unreserved characters by URI normalizers.

Does this mean that Addressable::URI.parse("/%2E/") should be turned into Addressable::URI.parse("/./") directly at #parse?

Normalization removes the dot and the trailing slash

irb(main):042:0> Addressable::URI.parse("/%2E/").normalize.to_s
=> "/"
irb(main):044:0> Addressable::URI.parse("/./").normalize.to_s
=> "/"

dentarg · 2023-07-19T08:08:51Z

Does this mean that Addressable::URI.parse("/%2E/") should be turned into Addressable::URI.parse("/./") directly at #parse?

That would go against what's suggested in #477

sporkmonger added the Accepted label Mar 24, 2014

sporkmonger added this to the 3.0 Release milestone Mar 24, 2014

sporkmonger self-assigned this Mar 24, 2014

sporkmonger added a commit that referenced this issue Mar 24, 2014

Added pending test for long-standing issue #8.

fff36bb

sporkmonger added a commit that referenced this issue Mar 24, 2014

Added pending test for long-standing issue #8.

c2dba5d

unarist mentioned this issue Sep 7, 2017

Don't normalize URIs in Unicode NFKC mastodon/mastodon#4837

Closed

2 tasks

dentarg mentioned this issue Jan 14, 2020

normalized_path full-width chars japanese #375

Closed

dentarg mentioned this issue Mar 14, 2021

normalize_component changes unexpectedly UTF-8 characters #400

Closed

sync-by-unito bot mentioned this issue Dec 27, 2021

File-Editor: Can't open files with multi-byte UTF-8 characters OSC/ood_core#671

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Normalization of path segments should probably happen before normalization of percent escaping #8

Normalization of path segments should probably happen before normalization of percent escaping #8

sporkmonger commented Mar 20, 2010

sporkmonger commented Mar 20, 2010

kovyrin commented Oct 18, 2013

sporkmonger commented Oct 18, 2013

kovyrin commented Oct 18, 2013

sporkmonger commented Oct 20, 2013

dentarg commented Jul 19, 2023

dentarg commented Jul 19, 2023

Normalization of path segments should probably happen before normalization of percent escaping #8

Normalization of path segments should probably happen before normalization of percent escaping #8

Comments

sporkmonger commented Mar 20, 2010

sporkmonger commented Mar 20, 2010

kovyrin commented Oct 18, 2013

sporkmonger commented Oct 18, 2013

kovyrin commented Oct 18, 2013

sporkmonger commented Oct 20, 2013

dentarg commented Jul 19, 2023

dentarg commented Jul 19, 2023