Normalization: don't decode percent-encoded reserved characters #366

janklimo · 2019-11-04T07:26:17Z

Given the following example URL:

url = "https://i.guim.co.uk/img/media/97b07b907a75e7f1b4aecb092f8181ca63d0ad44/2_254_1183_709/master/1183.jpg?width=1200&height=630&quality=85&auto=format&fit=crop&overlay-align=bottom%2Cleft&overlay-width=100p&overlay-base64=L2ltZy9zdGF0aWMvb3ZlcmxheXMvdGctZGVmYXVsdC5wbmc&enable=upscale&s=4c9af90b3d91c2269bad342e6b78d577"
addressable_uri = Addressable::URI.parse(url)
addressable_uri.normalize.to_s
=> "https://i.guim.co.uk/img/media/97b07b907a75e7f1b4aecb092f8181ca63d0ad44/2_254_1183_709/master/1183.jpg?width=1200&height=630&quality=85&auto=format&fit=crop&overlay-align=bottom,left&overlay-width=100p&overlay-base64=L2ltZy9zdGF0aWMvb3ZlcmxheXMvdGctZGVmYXVsdC5wbmc&enable=upscale&s=4c9af90b3d91c2269bad342e6b78d577"

normalization changes overlay-align=bottom%2Cleft to overlay-align=bottom,left.

Looks harmless but this change results in getting a 401 response instead of the image itself.

Looking at the RFC, I believe this deviates from the spec which (to my understanding) suggests sub-delims should not be decoded in the normalization process.

URIs that differ in the replacement of a reserved character with its
corresponding percent-encoded octet are not equivalent.

This SO post supports that. I came across #320 which touches on the same issue.

Please correct me if I'm reading this wrong 👍

Duplicates of this issues:

Maintainer notes:

This discourse PR has some tests cases that seems useful

The text was updated successfully, but these errors were encountered:

dentarg · 2023-07-19T06:52:37Z

Sounds correct to me that percent-encoded reserved characters (in path and query) should not be decoded, from https://www.rfc-editor.org/rfc/rfc3986#section-6.2.2.2 (linked from the SO post)

The percent-encoding mechanism (Section 2.1) is a frequent source of
variance among otherwise identical URIs. In addition to the case
normalization issue noted above, some URI producers percent-encode
octets that do not require percent-encoding, resulting in URIs that
are equivalent to their non-encoded counterparts. These URIs should
be normalized by decoding any percent-encoded octet that corresponds
to an unreserved character, as described in Section 2.3.

Looking at the RFC, I believe this deviates from the spec which (to my understanding) suggests sub-delims should not be decoded in the normalization process.

Why did you write sub-delims (and not all reserved characters) there @janklimo? Just making sure we're on the same page

janklimo · 2023-07-19T07:18:57Z

This was a very long time ago but I believe I referred to sub-delims since that was the most specific group , (the bug I originally stumbled upon in my example) can be found in but I agree we should be looking at reserved characters in general.

janklimo mentioned this issue Nov 4, 2019

Undesirable changing of URLs by Addressable janko/down#35

Closed

sporkmonger added the Accepted label Jan 29, 2020

dentarg mentioned this issue Apr 28, 2020

Ignore %2B in normalize #386

Open

davidtaylorhq mentioned this issue Jul 28, 2021

normalized_encode incorrectly unencodes %26 to & #424

Open

dentarg changed the title ~~Normalization possibly deviating from RFC~~ Normalization: don't decode percent-encoded reserved characters Jul 19, 2023

This was referenced Jul 19, 2023

normalized_encode incorrectly replaces %3A and %2F in path #472

Open

Normalization issue with # (%23) #295

Open

dentarg mentioned this issue Jul 19, 2023

Suggestion: conservative_normalize! #475

Open

c960657 mentioned this issue Aug 3, 2023

Do more conservative URL normalization httprb/http#758

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Normalization: don't decode percent-encoded reserved characters #366

Normalization: don't decode percent-encoded reserved characters #366

janklimo commented Nov 4, 2019 •

edited by dentarg

dentarg commented Jul 19, 2023

janklimo commented Jul 19, 2023

Normalization: don't decode percent-encoded reserved characters #366

Normalization: don't decode percent-encoded reserved characters #366

Comments

janklimo commented Nov 4, 2019 • edited by dentarg

dentarg commented Jul 19, 2023

janklimo commented Jul 19, 2023

janklimo commented Nov 4, 2019 •

edited by dentarg