Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

problematic "normalization" of paths #399

Open
shribe opened this issue Mar 29, 2017 · 12 comments
Open

problematic "normalization" of paths #399

shribe opened this issue Mar 29, 2017 · 12 comments
Labels

Comments

@shribe
Copy link

shribe commented Mar 29, 2017

Hackney should NOT attempt to encode provided URLs, at least not without being asked to. At this level the library should assume that the URL is already properly encoded. The reason for this is that attempting to mimic the "convenience" of browsers introduces all sorts of ambiguities. (That functionality could be exposed via utility functions for clients that need it.)

For just one example, from hackney_url:

%% special case, when a % is passed to the path, check if
%% it's a valid escape sequence. If the sequence is valid we
%% don't try to encode it and continue, else, we encode it.
%% the behaviour is similar to the one you find in chrome:
%% http://src.chromium.org/viewvc/chrome/trunk/src/url/url_canon_path.cc

Eh? So, it is possible that in a single URL some "%" characters will get encoded and some will not??? What if calling code depends on hackney encoding the URL, but then one day is handed an unencoded URL with the literal characters "%2A" for instance? So hackney will not encode the "%", and the URL will eventually be mangled to contain "*" where "%2A" was intended.

Also, in the attempt to clean up for poorly-coded clients, it encodes "*" as "%2a", which is unnecessary and actually breaks with some URLs on some servers (us.rd.yahoo.com use "*" to demarcate a URL to syndicated content, and does not accept "%2a").

An HTTP client library, in contrast to a browser, should do the minimal parsing to separate out the scheme, host, port, user & password then pass the entire raw path[?query][#fragment] through without mangling it.

(Likewise, it should not second-guess the client and try to encode the host.)

@benoitc
Copy link
Owner

benoitc commented Mar 29, 2017

You can pass your own encoding function to Hackney with the setting path_encode_fun if the default behaviour doesn't work with the API you're trying to call. On which URLs the current encoding function was an issue?

Hackney has different goals from others "libs". We can of course relax the way the urls are encoded or provide an option. I'm not sure this feature should not be the default though since a lot of the current user base is used to it. Let see what can be done on the next major release.

@benoitc benoitc added the api label Mar 29, 2017
@shribe
Copy link
Author

shribe commented Mar 29, 2017

Here's an example URL that had a problem:

http://us.rd.yahoo.com/finance/external/investopedia/rss/SIG=12kf3hrkl/*http://www.investopedia.com/news/microsoft-tests-chat-transcription-xbox-one-msft/?partner=YahooSA

The fundamental issue is that it is not possible to know whether a URL passed into hackney is already encoded. You can guess, and be right nearly all the time, but there will be rare occasions where the guess will be wrong--there is simply no way around this. Hackney should either expect unencoded URLs and encode them, or expect encoded URLs and leave them alone (the latter is my preference for a library). But it should definitely not encode some character sequences and leave others alone--that's just wrong.

Are you sure your users are depending on this? Or are they mostly just getting lucky?

@deadtrickster
Copy link
Contributor

related?

#246
#249
#176
#251

I feel like some people want to encode and others don't.

@benoitc
Copy link
Owner

benoitc commented Mar 29, 2017

@shribe you read the algorithm wrong. What is does is considering that the normal case is that you pass an unencoded URI end encode it following the RFC 2732. But still do its best to detect the case where some are still passing partially encoded URI.

There is nothing wrong in doing it it. This is a choice made by Hackney (and not only) that you apparently dislike that much. It actually helped in a lot case. Also considering the users of hackney it's unlikely that they can be just lucky. But anyway what will be done for the current 1.x is adding an option to not encode the URI.I will ask around what the default should be in the coming major release.

@benoitc
Copy link
Owner

benoitc commented Mar 29, 2017

i will let the ticket open until the action above are done.

@shribe
Copy link
Author

shribe commented Mar 29, 2017

I don't believe I am misreading it. You can have a URL where hackney will literally encode some "%" characters but not others. That's just flat-out wrong. (And, in my opinion, a good example of why trying to "detect" whether a URL needs encoding is a bad idea for a library.)

@shribe
Copy link
Author

shribe commented Mar 29, 2017

I should also have said that, as far as I am concerned, an option to not attempt to encode the URI solves it.

@benoitc
Copy link
Owner

benoitc commented Mar 29, 2017

@shribe yeah I heard your opinion.

The plan is:

  • add an option to disable uri encoding in config and setting for 1.x . (no breaking change)
  • ask long time users what should be the default for the next major release

@benoitc
Copy link
Owner

benoitc commented Mar 29, 2017

@deadtrickster thanks for the links, was looking at it. I will check them again.

@benoitc
Copy link
Owner

benoitc commented Mar 29, 2017

@shribe just to not be misunderstood, i'm taking your suggestion in consideration. But it needs to be done step by step.

@shribe
Copy link
Author

shribe commented Mar 29, 2017

I understand that it is the way it is right now, and that you have to be careful about changing the behavior.

@kyleboe
Copy link
Contributor

kyleboe commented Jul 19, 2023

Here's my fix as a partial first step: #720

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants