Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Undesirable changing of URLs by Addressable #35

Closed
janklimo opened this issue Nov 1, 2019 · 15 comments
Closed

Undesirable changing of URLs by Addressable #35

janklimo opened this issue Nov 1, 2019 · 15 comments

Comments

@janklimo
Copy link

janklimo commented Nov 1, 2019

Hi Janko,

came across this odd problem today. The following URL works fine in the browser, curl, HTTParty, etc but always returns a 401 when using this gem (by means of remote URL plugin in shrine), i.e.:

url = "https://i.guim.co.uk/img/media/97b07b907a75e7f1b4aecb092f8181ca63d0ad44/2_254_1183_709/master/1183.jpg?width=1200&height=630&quality=85&auto=format&fit=crop&overlay-align=bottom%2Cleft&overlay-width=100p&overlay-base64=L2ltZy9zdGF0aWMvb3ZlcmxheXMvdGctZGVmYXVsdC5wbmc&enable=upscale&s=4c9af90b3d91c2269bad342e6b78d577"
Down.download(url)

I thought this was strange because the following works:

URI(url).open

I've narrowed it down to how the URL is encoded:

down/lib/down/net_http.rb

Lines 287 to 290 in 68754ed

def addressable_normalize(url)
addressable_uri = Addressable::URI.parse(url)
addressable_uri.normalize.to_s
end

this changes the comma in the URL (from bottom%2Cleft to bottom,left, making the signature in s param invalid (works like secure URLs in Imgix, as it seems).

Would you call it a bug or is this the desired behavior?

@janko
Copy link
Owner

janko commented Nov 1, 2019

Hi Jan, thanks for the report 😃

I would consider this a bug in the Addressable gem. Normalizing the URL shouldn't change it, especially not in the way shown in your example.

@jrochkind
Copy link
Collaborator

jrochkind commented Nov 1, 2019

That makes it a bug in down too, yes? Just one we'd want to fix by getting it fixed in Addressable (unless we stop using addressable to fix it!).

@jrochkind
Copy link
Collaborator

jrochkind commented Nov 1, 2019

I wonder, is it also possible it's inappropriate to use Addressable#normalize on URLs before fetching them?

Why is down normalizing at all, instead of fetching the URL that was specified? What is the use case where we want to fetch something other than the calling code specified (a "normalized" version)?

Even without any bugs in addressable, could there are cases where an HTTP server may respond to an HTTP request for a "non-normalized" path, but not respond with the same thing to the "normalized" version of the path? I don't entirely understand what "normalized" means here or is intending to be doing...

Do other HTTP fetching gems (say http-rb) HTTP request a "normalized" version of an argument, instead of the argument as supplied?

@janklimo
Copy link
Author

janklimo commented Nov 1, 2019

It's quite likely that the answer from addressable would be that it's correct behavior - it's almost the same case as this issue. The specific method normalizing params is this one and it seems to be following the RFC spec.

Edit: having scanned the RFC, it does look like addressable normalizes percent-encoded reserved characters incorrectly by decoding them. This answer highlights the relevant bits from the RFC.

@janko you might still remember why addressable normalization was originally introduced. I'm sure it's serving some purpose but maybe we could revisit when and how it's applied given that it can lead to side-effects like the one explained above.

@janko
Copy link
Owner

janko commented Nov 1, 2019

Some more context is here: shrinerb/shrine#132

Yes, there was a reason we added this normalization, but I don’t remember it now. Feel free to look at the git blame.

Basically, it seems users would sometimes submit URLs that are not encoded properly, so I belive that what the normalization did.

@janklimo
Copy link
Author

janklimo commented Nov 4, 2019

Maybe this gets acknowledged and resolved upstream with sporkmonger/addressable#366. I'll keep an eye on it.

@jrochkind
Copy link
Collaborator

I'm still thinking it may be inappropriate to insist on normalization here.

There are HTTP servers which will respond to an un-normalized GET request differenetly than the normalized version of that same request, using the addressable normalization algorithm. (Is this a violation of standards? I don't know, probably? But I have seen it happen).

The normalization makes it impossible to make the request with the un-normalized URI/path.

May be a violation of standards, but it happens in the real world, and I don't think (could be wrong?) any other ruby HTTP client implementation insists on normalizing for you. If down really is the only ruby http client implementation that does this normalization, to me it adds evidence that it might be an unhelpful thing to do.

@janko
Copy link
Owner

janko commented Nov 4, 2019

@jrochkind Have you read the history behind the use of Addressable in Down? It was added because the URI standard library would fail parsing some URLs submitted by the user, which browsers would know how to handle (we look at it in terms of Shrine's remote_url plugin).

I don't think (could be wrong?) any other ruby HTTP client implementation insists on normalizing for you.

Down is not an HTTP client implementation, it's more high level than that. It was designed primary for Shrine's remote URL feature. We can see for example that CarrierWave and Dragonfly do a similar thing. Paperclip doesn't as it requires you to parse the URL yourself and give it a URI instance.


However, Dragonfly's implementation sounds actually like a good workaround to me. What it does is it first tries to URI.parse the URL, and if that fails it then normalizes with Addressable and tries to URI.parse the result again.

This would make sense to me. The user should already pass a correctly-encoded URL, but if they don't, we try our best effort to "fix" that. And that was the reason we introduced Addressable in the first place, to fix those cases, so it makes sense to activate it only when needed.

@janklimo that approach sounds like it would fix your particular issue. If you agree with this idea, would you like to submit a PR?

@jrochkind
Copy link
Collaborator

Yeah, as with all escaping issues, it gets really confusing.

An earlier version of your commentI got in email suggested that http.rb and net-http worked differently here... you edited it, so maybe they don't? I am curious if http.rb does forcible normalization.

I like where you ended up though, normalizing only ones that don't parse -- while it seems hacky, if it's what Dragonfly does, I like that others have agreed it's a good compromise.

And I like that you remembered now why you introduced normalization in the first place. :)

@janko
Copy link
Owner

janko commented Nov 4, 2019

@jrochkind I apologize for editing the comment, it was true, but for some reason I thought it wasn't relevant to the discussion. By "true" I mean that Addressable::URI is more lax when it comes to parsing, so it will take even URLs that are not properly encoded, e.g:

URI.parse("https://movies.com/matrix[1991].mp4")              # raises exception
Addressable::URI.parse("https://movies.com/matrix[1991].mp4") # succeeds

I don't think this does any normalization internally, but I haven't checked the source code. In that case http.rb backend should not have this problem.

I like where you ended up though, normalizing only ones that don't parse

Great, I'm glad we've reached a consensus 👌

@jrochkind
Copy link
Collaborator

OK, I'm confused what that has to do with http.rb -- somehow down worked without the normalization when using http.rb but not net-http? Does http.rb do it's own normalization then? Using addressable? [Addressable is a dependency of http.rb](https://github.com/httprb/http/blob/master/http.gemspec#L30], so maybe?

Searching in the project, it uses Addressable to parse, yep. I haven't yet figured out if it also normalizes everything with Addressable (so it would break in the case at the top of this issue too?), or not.

I just always like looking at what others are doing. But it may not matter in this case, and it may be too confusing to figure out, at least for me!

@jrochkind
Copy link
Collaborator

jrochkind commented Nov 4, 2019

Aha, it looks like http.rb does normalize most of the URI components... but NOT the URI query component! (the part after the ?).

https://github.com/httprb/http/blob/6240672bc84b23339fc9a9878040fcb45db78fb5/lib/http/uri.rb#L34-L38

It is normalizing the URI query that is particularly causing the problem mentioned in this issue.

I wonder if the bug in Addressable is it is applying a normalization routine meant for "path" and other URI components (perhaps in a scheme-agnostic way) -- to the URI query component.

URI escaping gets really confusing, but I believe maybe escaping an HTTP URI query component in the same way as you would the path component is wrong.

@coding-chimp
Copy link
Contributor

We've also hit this issue today. The URL works fine in a browser, curl or wget but we get 404s with Down.

After debugging it, we found it's because Addressable::URI is changing the %2B in the URL to +:

Addressable::URI.parse("http://example.com/2ELk8hUpTC2wqJ%2BZ%25GfTFA.jpg").normalize.to_s
=> "http://example.com/2ELk8hUpTC2wqJ+Z%25GfTFA.jpg"

Since nothing happened here in the last month I opened a PR (#37) where the normalization is only applied when it's needed, as was proposed here at some point.

@janko
Copy link
Owner

janko commented Dec 20, 2019

I've released 5.0.1 with #37, but I would like to discuss improving addressable normalization.

First, I would like to address @jrochkind's find in http.rb, which normalizes everything but the query parameters. This change was done in httprb/http@8c8486c to address httprb/http#246. However, @coding-chimp presented an example of invalid URI normalization in query path, which http.rb would handle incorrectly as well (as it does normalize the query path), so that doesn't sound like a permanent solution.

As of carrierwaveuploader/carrierwave@57a4a3b, CarrierWave uses Addressable as well, and it does so in the same way – normalizing everything but the query parameters. However, this behaviour seems to be there for backwards compatibility, as query parameters were being split from the rest of the URI before introducing Addressable. The main reason for introducing Addressable in CarrierWave was to support non-ASCII domain names, which according to carrierwaveuploader/carrierwave#2086 is important.

I was thinking what the correct normalization would be. Originally, I wanted to support non-encoded URLs, such as ones containing spaces and square brackers. The reason for not using Addressable::URI.encode instead was to prevent double encoding characters which are already encoded. I'm not sure if that can happen, that a URL is partially percent encoded, but Addressable::URI#normalize seemed to handle this. I believe the reason why it leaves characters such as :, ;, + etc. un-encoded is because those are reserved characters, which don't have to be encoded. In case of @janklimo's URL, the server handled , and %2C differently, which is correct according to the RFC:

The purpose of reserved characters is to provide a set of delimiting
characters that are distinguishable from other data within a URI.
URIs that differ in the replacement of a reserved character with its
corresponding percent-encoded octet are not equivalent.
Percent-
encoding a reserved character, or decoding a percent-encoded octet
that corresponds to a reserved character, will change how the URI is
interpreted by most applications. Thus, characters in the reserved
set are protected from normalization and are therefore safe to be
used by scheme-specific and producer-specific algorithms for
delimiting data subcomponents within a URI.

Since Addressable::URI.encode will not encode these special characters either, I believe it makes sense to continue using Addressable::URI#normalize. And I think it makes sense to continue doing so by default, as that handles unescaped and non-ASCII URLs.

However, as @jrochkind suggested in #37 (comment), I would like to add a configuration option for disabling or overriding the normalization (http.rb added one as well in httprb/http#533). That's the only change I would make for now. @jrochkind would you like to make that PR?

@jrochkind
Copy link
Collaborator

your analysis is thorough, makes sense for me.

(Escaping is always a tricky issue, especially in URLs/URIs, in part because of a long history of inconsistent/varying/technically-invalid/changing-standards legacy approaches).

@janko janko closed this as completed in c27a167 Sep 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants