Inconsistent Behavior In Paths Starting With `//` #401

mvastola · 2020-10-10T03:36:44Z

NB: So this is really long, but I wanted to provide ample background in the hopes that this will make a decision and/or PR a lot easier. (I'm willing to do the PR if there's go-ahead.)

Background

I was just parsing a URL that was the result of HTTP redirect, and due to a bug on that server (which I don't control), the user is redirected to a URL along the lines of https://example.com//login.

This posed an issue for me because I was attempting to compare the path to the simple /login.

It looks like this was a change made several years ago (see a14e0cb / #240) in order to comply with the standard (RFC 3986) for URI syntax., due to the ambiguity that can result in a URI without an authority.

Unexpected/Incompatible Behavior

That being said, however, silent acceptance of // prefixes as if they were simply /, are exceedingly common, from web servers, to Linux filesystem paths, to Rails itself.

Furthermore, it's not at all uncommon to see such errors with errant /s in paths, precisely because they are pretty much always silently ignored and there is no impetus to fix (or even notice) them.

Standards

While -- per RFC 3986 -- // at the start of a path is not legal in the absence of an authority, the RFC is a bit fuzzy on the details.

In particular, while they say: When authority is not present, the path cannot begin with two slash characters ("//"), the ABNF listed in the RFC (which I've duplicated at the end of this issue) doesn't seem to allow such a path regardless of the presence/absence of an authority component.

On top of this, the algorithm provided in the RFC for normalizing paths (also duplicated below), converts a leading //s to a /.

The Solution

Given how software interprets paths beginning with // combined with the ambiguity of the relevant RFC on the matter, I would argue that there is strong justification for -- while continuing to consider paths with // illegal -- normalizing them into //, perhaps silently, if they are encountered.

At a bare minimum, it makes sense to de-duplicate //s in Addressable::URI::heuristic_parse for the https? schemes.

Additional arguments for doing this is that not doing so can produce some confusing artifacts/inconsistencies. Among them that I've noticed:

an initially valid URI can currently become unexpectedly invalid and throw an exception if, for instance uri.only(:path) is called.
doing something like uri.route_from(uri2) (assuming uri2 has the same authority) will throw an unexpected error.

Extra Credit (A related bug)

This may or may not be considered a different issue depending on how this issue is resolved, but in researching this issue, I also found a bug in addressable when // prefixes the path without an authority.

Namely, the code:

uri =Addressable::URI.parse('file:////root')

does not throw in any errors because the validation checks if the authority is nil rather than if it is an empty string.

If you were to then evaluate uri.authority = nil, however, the exception is thrown.

Lastly, for what it's worth, I'm not sure if file URIs along the lines of file://relative/path are officially supported in the standards (I've seen them a bit before, at least), they parse (arguably incorrectly) with 'relative' (in this case) being included in the host component.

Reference (Excerpts cited above)

ANBF for URI Paths from RFC 3986

 path = path-abempty    ; begins with "/" or is empty
      / path-absolute   ; begins with "/" but not "//"
      / path-noscheme   ; begins with a non-colon segment
      / path-rootless   ; begins with a segment
      / path-empty      ; zero characters

path-abempty  = *( "/" segment )
path-absolute = "/" [ segment-nz *( "/" segment ) ]
path-noscheme = segment-nz-nc *( "/" segment )
path-rootless = segment-nz *( "/" segment )
path-empty    = 0<pchar>

segment       = *pchar
segment-nz    = 1*pchar
segment-nz-nc = 1*( unreserved / pct-encoded / sub-delims / "@" )  ; non-zero-length segment without any colon ":"
              
pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"

Algorithm for normalizing/resolving relative paths within a URI's path in RFC 3986

5.2.4. Remove Dot Segments

The pseudocode also refers to a "remove_dot_segments" routine for interpreting and removing the special "." and ".." complete path segments from a referenced path. This is done after the path is extracted from a reference, whether or not the path was relative, in order to remove any invalid or extraneous dot-segments prior to forming the target URI. Although there are many ways to accomplish this removal process, we describe a simple method using two string buffers.

The input buffer is initialized with the now-appended path components and the output buffer is initialized to the empty string.
While the input buffer is not empty, loop as follows:

A. If the input buffer begins with a prefix of "../" or "./", then remove that prefix from the input buffer; otherwise,

B. if the input buffer begins with a prefix of "/./" or "/.", where "." is a complete path segment, then replace that prefix with "/" in the input buffer; otherwise,

C. if the input buffer begins with a prefix of "/../" or "/..", where ".." is a complete path segment, then replace that prefix with "/" in the input buffer and remove the last segment and its preceding "/" (if any) from the output buffer; otherwise,

D. if the input buffer consists only of "." or "..", then remove that from the input buffer; otherwise,

E. move the first path segment in the input buffer to the end of the output buffer, including the initial "/" character (if any) and any subsequent characters up to, but not including, the next "/" character or the end of the input buffer.
Finally, the output buffer is returned as the result of remove_dot_segments.

The text was updated successfully, but these errors were encountered:

dentarg · 2020-10-10T09:39:44Z

silent acceptance of // prefixes as if they were simply /, are exceedingly common, from web servers, to Linux filesystem paths, to Rails itself

Do you have more concrete examples? I'm not sure those are the best software examples to compare Addressable with. A web browser would be better IMHO. I tried https://example.com//login in Chrome and Firefox, and they both make the request to https://example.com//login.

The Solution

Given how software interprets paths beginning with // combined with the ambiguity of the relevant RFC on the matter, I would argue that there is strong justification for -- while continuing to consider paths with // illegal -- normalizing them into //, perhaps silently, if they are encountered.

Can you clarify this paragraph? Perhaps show some code snippets of what is happening today and how you would like Addressable to behave?

mvastola · 2020-10-11T03:53:42Z

Do you have more concrete examples? I'm not sure those are the best software examples to compare Addressable with. A web browser would be better IMHO. I tried https://example.com//login in Chrome and Firefox, and they both make the request to https://example.com//login.

I wasn't referring to web browsers. Browsers (or at least FF and Chrome, which I've tried) appear to loyally relay the URLs entered. (One exception though is with the file:// scheme -- Chrome removes the redundant leading / while FF simply ignores it.)

Web servers, however seem to consistently overlook such double slashes. Some examples I chose at random right now (I didn't run into a single site that didn't ignore it):
NB: items followed by a * indicate I was redirected to the correct URL

https://github.com//sporkmonger//addressable/
https://www.kernel.org//category/faq.html
https://www.google.com//search?q=ruby
https://twitter.com//barackobama *
http://localhost//folder (my local Apache daemon -- standard install, with no relevant config changes)
https://www.nginx.com//pricing/ * (presumably uses nginix, though seems to use Cloudfare on top of that; headers seem to suggest redirect is from nginx though)
https://en.wikipedia.org//wiki/Ruby_(programming_language) *
https://www.amazon.com//gp/css/order-history
https://www.nytimes.com//section/us
https://www.facebook.com//help/ *
https://www.youtube.com//feed/trending *
https://www.ruby-lang.org//en/

Can you clarify this paragraph? Perhaps show some code snippets of what is happening today and how you would like Addressable to behave?

Sure. Here are several.

Assume for all:

examples = []
examples << Addressable::URI.parse('https://example.com//directory')
examples << Addressable::URI.parse('https://example.com/directory')
examples << Addressable::URI.parse('https://example.com//directory2')
examples << Addressable::URI.parse('https://example.com/directory2')
examples << Addressable::URI.parse('https://example.com//')
examples << Addressable::URI.parse('https://example.com/')
examples.each(&:freeze)
examples.freeze

Current:

examples[0].normalized_path
=> "//directory"

Preferred:

"/directory"

Current:

examples[0].omit(:authority, :scheme)

Traceback (most recent call last):
  7: from /usr/local/rvm/rubies/ruby-2.7.0/bin/irb:23:in `<main>'
  6: from /usr/local/rvm/rubies/ruby-2.7.0/bin/irb:23:in `load'
  5: from /usr/local/rvm/rubies/ruby-2.7.0/lib/ruby/gems/2.7.0/gems/irb-1.2.1/exe/irb:11:in `<top (required)>'
  4: from (irb):15
  3: from /usr/local/rvm/gems/ruby-2.7.0/gems/addressable-2.7.0/lib/addressable/uri.rb:2300:in `omit'
  2: from /usr/local/rvm/gems/ruby-2.7.0/gems/addressable-2.7.0/lib/addressable/uri.rb:2394:in `defer_validation'
  1: from /usr/local/rvm/gems/ruby-2.7.0/gems/addressable-2.7.0/lib/addressable/uri.rb:2466:in `validate'
Addressable::URI::InvalidURIError (Cannot have a path with two leading slashes without an authority set: '//directory')

Preferred:

"#<Addressable::URI:0xc8 URI:/directory>"

Current:

{ |uri| pp examples[0].route_to(uri) }
#<Addressable::URI:0x5c8 URI:#>
#<Addressable::URI:0x5dc URI:../directory>
#<Addressable::URI:0x5f0 URI:directory2>
#<Addressable::URI:0x604 URI:../directory2>
#<Addressable::URI:0x618 URI:../../>
#<Addressable::URI:0x62c URI:../../>

Preferred:

#<Addressable::URI:0x5c8 URI:#> # unsure if this is expected (if not, should be empty or '.')
#<Addressable::URI:0x5dc URI:#> # same as previous
#<Addressable::URI:0x5f0 URI:directory2>
#<Addressable::URI:0x604 URI:directory2>
#<Addressable::URI:0x618 URI:../>
#<Addressable::URI:0x62c URI:../>

Current:

examples.each { |uri| pp (examples[0].route_from(uri) rescue $!.inspect) }
#<Addressable::URI:0x960 URI:#>
#<Addressable::URI:0x974 URI:/directory>
#<Addressable::URI:0x988 URI:directory>
#<Addressable::URI:0x99c URI:/directory>
"#<Addressable::URI::InvalidURIError: Cannot have a path with two leading slashes without an authority set: '//directory'>"
"#<Addressable::URI::InvalidURIError: Cannot have a path with two leading slashes without an authority set: '//directory'>"

Preferred:

#<Addressable::URI:0x960 URI:#> # see previous
#<Addressable::URI:0x974 URI:#>
#<Addressable::URI:0x988 URI:directory>
#<Addressable::URI:0x99c URI:directory>
#<Addressable::URI:0xXXX URI:/directory>
#<Addressable::URI:0xXXX URI:/directory>

Current:

Addressable::URI.parse('file://file.txt').path
=> ""
Addressable::URI.parse('file://../file.txt').absolute?
=> true
Addressable::URI.parse('file:///../file.txt').normalized_path
=> "/file.txt"

Preferred:

=> "file.txt"
=> false
=> "../file.txt" # technically this should actually be an error, but the original value would do

Note: the preferred outputs above assume the gem is always, immediately translating the // into / when necessary. It may be preferred to only show these preferred values after a normalize, however, but errors would need to not be thrown when using the above modifications. (I'm not sure if the gem currently promises all URIs are valid in their non-normalized form in any case.)

mvastola · 2020-10-11T04:03:42Z

Do you have more concrete examples? I'm not sure those are the best software examples to compare Addressable with. A web browser would be better IMHO. I tried https://example.com//login in Chrome and Firefox, and they both make the request to https://example.com//login.

Sorry. Just re-read this. I guess I would disagree that web servers aren't a good example here, though I can look for other software.

The thing with HTTP clients is that they intentionally don't tweak much in URLs because different servers might parse things differently. (I suppose this is an argument for only making the changes I listed happen during normalization.) The consequence (or perhaps the cause) of this is that HTTP clients don't have to make the conversion because HTTP servers tend to.

This means that if we care about compatibility, this change would fix the issue on servers using this code, while not breaking it on clients who use this because the servers those clients connect to are prone to ignore the difference.

dentarg · 2020-10-11T09:05:53Z

Re: the file URI scheme, I think Addressable is doing the correct things, see https://en.wikipedia.org/wiki/File_URI_scheme#How_many_slashes? (until I looked this up, I didn't know that file: supported hostnames, so TIL 😅)

$ irb -raddressable/uri
irb(main):001:0> Addressable::VERSION::STRING
=> "2.7.0"
irb(main):002:0> Addressable::URI.parse('file://file.txt').hostname
=> "file.txt"

I guess I would disagree that web servers aren't a good example here, though I can look for other software.

Yeah, I think we should compare Addressable with other software that parse URIs. I'm sure there are popular libraries in other languages we could compare with. I don't think the web servers are great examples because it is another level of the stack. Also, visting a popular website we go even higher up the stack: it could be their web framework that decides the behaviour around //, or even their application logic. I'm not sure Addressable should make this decision. Perhaps it should be easier to use Addressable to make the decision you want.

mvastola · 2020-10-12T00:17:34Z

Re: the file URI scheme, I think Addressable is doing the correct things, see https://en.wikipedia.org/wiki/File_URI_scheme#How_many_slashes? (until I looked this up, I didn't know that file: supported hostnames, so TIL sweat_smile)

That's both interesting and really weird (the hostname thing).

I think this really comes down to how this gem ultimately prioritizes the qualities that make a library like this useful:

conformance -- how well it adheres to specifications
compatibility -- how well it works with different use cases/software
substitutability -- how invisibly this library can be swapped out with similar ones (in this case, mainly URI from stdlib)
robustness -- how tolerant and configurable it is when the first two things conflict

The file scheme issue -- which was really an aside -- is actually a perfect example of this. If we're going by conformance, file://relative_path should never be supported. But if we're going by what's actually most compatible with other software, this should be interpreted as a relative path, once it is taken into account that the file://localhost/absolute_path syntax is not often used (demonstrated by the fact that neither of us had even heard of it) compared to the "incorrect" relative path usage.

"Robustness" would suggest either the desired usage is guessed, or it were configurable.
"Substitutability" of course, would mean it functions exactly like the regular URI, which is really only competitor to this gem. I think this gem -- to a certain extent -- implicitly de-prioritizes substitutability because it's entire argument for existing is that it's better than URI, which makes no sense if they work identically.

Yeah, I think we should compare Addressable with other software that parse URIs. I'm sure there are popular libraries in other languages we could compare with.

So while I'm not even very passionate about the file issue, I think it has some crossover with this issue. In the case of file URLs, there are no web servers involved, and it simplifies the issue. There, as I noted above, both FF and Chrome handle file:////blah transparently (at least on Linux), which means they either translate the path to remove the errant / or they pass the value onto the kernel, which then ignores it. So I think that's a strong argument that -- for file URIs at least -- a fourth / should be ignored.

I don't think the web servers are great examples because it is another level of the stack. Also, visiting a popular website we go even higher up the stack: it could be their web framework that decides the behaviour around //, or even their application logic.

This really goes back to my point before about priorities and compatibility. I just tried the // think on a plain nginx install, and -- like apache2 -- it also ignores the extra /.

I would agree that when a web framework or app logic is involved, URIs are often simply passed down to that component to handle the issue. Here's the thing though: ruby is perhaps most often used is used in connection in said web frameworks and application logic. So it's hardly inconceivable that authors of such libraries would want to hand off the parsing of URIs to gems such as this and expect them to be the one to handle this nuance. It certainly would seem awkward for such authors to need to correct the output of this library before being able to use it.

My original use case involved attempting to compare two URLs when one of the values was mangled by another server to have an extra /. It's hard for me to see why this gem shouldn't have normalized away this inconsistency.

I'm not sure Addressable should make this decision.

But not making a decision (as is the case now) is itself a decision. And while various tools will leave even non-conforming URIs be and pass them on dutifully as-is, if that functionality were desired, addressable is pretty much overkill as there would be little reason to parse the whole thing.

Perhaps it should be easier to use Addressable to make the decision you want.

That I agree with; it's basically what I was suggesting when saying the #normalize method should do this.

sporkmonger · 2021-02-12T00:18:59Z

I think it's worth calling out that Addressable's heuristic_parse does aim to hew quite closely to whatever browsers are doing in the address bar. The parse function should probably not change behavior here at all, but I could be easily convinced that normalized_path should.

mvastola · 2021-02-12T01:21:24Z

@sporkmonger , good point. my initial post as actually about that function ironically, but totally forgot about it when providing the requested examples. My bad.

Unfortunately though, the output of ::heuristic_parse in the examples I gave seems to be identical to the simple ::parse (well, there is one case where the result is different, but it's still not 'correct', at least as far as heuristic goes). But yes, I totally get keeping ::parse strictly conformant to specs (or at least as much as possible).

Could ::heuristic_parse be updated to work in the cases I provided in this case?

As for #normalized_path, I actually wouldn't try to convince you to change it, as I recall some cases were pointed out (though my memory may not be 100%) where a valid, usable URL could be interpreted differently, and the normalization algorithm is actually spelled out in the specs.

KelseyDH · 2021-08-24T11:00:53Z

+1 to this issue. While conformity to RFC standards is certainly the priority, where possible libraries should strive to remain functional when the non-compliance can be safely recovered from. I encountered problems with both Addressable and Ruby's URI libraries, when attempting to parse params from this non-conforming URL:

http://localhost:4300/webapp/foo/#//controller/action?account=001-001-111&email=john%40email.com

dentarg · 2021-08-27T21:49:26Z

@KelseyDH your problem does not sound related to this issue, you have a perfectly valid URI, with a fragment.

You could do this:

irb(main):003:0> Addressable::VERSION::STRING
=> "2.8.0"
irb(main):004:0> uri = Addressable::URI.parse 'http://localhost:4300/webapp/foo/#//controller/action?account=001-001-111&email=john%40email.com'
=> #<Addressable::URI:0x140 URI:http://localhost:4300/webapp/foo/#//controller/action?account=001-001-111&email=john%40email.com>
irb(main):005:0> uri.query
=> nil
irb(main):006:0> uri.fragment
=> "//controller/action?account=001-001-111&email=john%40email.com"
irb(main):007:0> Addressable::URI.parse(uri.fragment)
=> #<Addressable::URI:0x154 URI://controller/action?account=001-001-111&email=john%40email.com>
irb(main):008:0> Addressable::URI.parse(uri.fragment).query
=> "account=001-001-111&email=john%40email.com"
irb(main):009:0> Addressable::URI.parse(uri.fragment).query_values
=> {"account"=>"001-001-111", "email"=>"john@email.com"}

dentarg mentioned this issue Jan 15, 2022

Inconsistent Addressable::URI#heuristic_parse behavior with schemaless URLs #265

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent Behavior In Paths Starting With `//` #401

Inconsistent Behavior In Paths Starting With `//` #401

mvastola commented Oct 10, 2020

dentarg commented Oct 10, 2020

The Solution

mvastola commented Oct 11, 2020

mvastola commented Oct 11, 2020

dentarg commented Oct 11, 2020

mvastola commented Oct 12, 2020

sporkmonger commented Feb 12, 2021

mvastola commented Feb 12, 2021 •

edited

KelseyDH commented Aug 24, 2021 •

edited

dentarg commented Aug 27, 2021

Inconsistent Behavior In Paths Starting With // #401

Inconsistent Behavior In Paths Starting With // #401

Comments

mvastola commented Oct 10, 2020

Background

Unexpected/Incompatible Behavior

Standards

The Solution

Extra Credit (A related bug)

Reference (Excerpts cited above)

ANBF for URI Paths from RFC 3986

Algorithm for normalizing/resolving relative paths within a URI's path in RFC 3986

dentarg commented Oct 10, 2020

The Solution

mvastola commented Oct 11, 2020

mvastola commented Oct 11, 2020

dentarg commented Oct 11, 2020

mvastola commented Oct 12, 2020

sporkmonger commented Feb 12, 2021

mvastola commented Feb 12, 2021 • edited

KelseyDH commented Aug 24, 2021 • edited

dentarg commented Aug 27, 2021

Inconsistent Behavior In Paths Starting With `//` #401

Inconsistent Behavior In Paths Starting With `//` #401

mvastola commented Feb 12, 2021 •

edited

KelseyDH commented Aug 24, 2021 •

edited