Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use rfc3986.validator.Validator for parse_url #1531

Merged
merged 16 commits into from Apr 21, 2019

Conversation

sethmlarson
Copy link
Member

@sethmlarson sethmlarson commented Jan 27, 2019

Add a set of tests to make sure URLs fail when they've got invalid characters in specific components. Closes #1529

validator = Validator()
try:
validator.check_validity_of(
*validator.COMPONENT_NAMES
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you want to validate literally everything, yes? I wonder if we could make a better API for this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah validate all components. Leaving the hard work to you I can whip up a patch, tests, and docs if you can think of a name for the interface. ;)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

validate_all_the_components_<emoji> :trollface:

@sethmlarson
Copy link
Member Author

Looks like validator is upset with IPv6 addresses in our tests. I'll add some to the URL tests and investigate why we're breaking.

@sethmlarson sethmlarson changed the title Use rfc3986.Validator for parse_url Use rfc3986.validator.Validator for parse_url Jan 28, 2019
@sigmavirus24
Copy link
Contributor

@sethmlarson the failures look related to Zone Identifiers (RFC 6874). rfc3986 added support for those in python-hyper/rfc3986#2 but I wonder if I overlooked support for them in the Validator work or if our IPv6 validation is a bit too strict.

Let me know if you need more information than that.

@sethmlarson
Copy link
Member Author

sethmlarson commented Jan 28, 2019

@sigmavirus24 Thanks for this info, I can look more tonight. I figured out the issue was with zoning and messaged you about it on Keybase. Is that an avenue I can contact you or should I stick to email?

@@ -14,12 +15,12 @@
NORMALIZABLE_SCHEMES = ('http', 'https', None)

# Regex for detecting URLs with schemes. RFC 3986 Section 3.1
SCHEME_REGEX = re.compile(r"^[a-zA-Z][a-zA-Z0-9+\-.]*://")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sigmavirus24 What do you think of taking this . out of this scheme regex? I did this because I don't think we support any scheme that has this . here but we support a lot of schemeless "URLs" where the authority section looks like a scheme (www.google.com is a valid "scheme"). Should we get even more strict and only support schemes that start with http? I'm not sure.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I don't understand the purpose the . is serving. We can't, however, limit ourselves to what we think of as normal schemes because we (fortunately, or not) support http+unix://

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The . is part of the scheme spec but I couldn't find an example scheme that contained a period in it. The three schemes I know of that we support are http, https, and http+unix case insensitive.

Removing the period prevents a few issues like us thinking that google.com:433/path is scheme=google.com, host=433, path=/path and instead forcing a parse on ://google.com:433/path which gives us a correct result.

@sigmavirus24
Copy link
Contributor

I figured out the issue was with zoning and messaged you about it on Keybase. Is that an avenue I can contact you or should I stick to email?

I don't really look at or check Keybase much.

@sethmlarson
Copy link
Member Author

sethmlarson commented Jan 28, 2019

@sigmavirus24 I think I may have found the source of the issue: Our unit tests use Zone ID format from RFC 4007 which says to use a literal percent character instead of a percent-encoded percent character (% instead of %25) which causes host validation to fail. Per RFC 3986 % isn't valid within the host component.

Should we drop support for this now that RFC 6874 calls out that RFC 4007 breaks URI syntax rules? If that's the case I can fix all the unit tests that use this syntax.

One alternative is to allow Zone IDs to be parsed with RFC 4007 with just % if the next two bytes aren't 25 (In either rfc3986 or urllib3)? I don't know how prevalent RFC 4007 zone IDs are. I'd assume common based on a quick Google search?

@sigmavirus24
Copy link
Contributor

One alternative is to allow Zone IDs to be parsed with RFC 4007 with just % if the next two bytes aren't 25 (In either rfc3986 or urllib3)?

I think going from

[::1%eth0] to [::1%25eth0] and normalizing 4007 syntax to 6874 makes sense to me for rfc3986 to do. But I think that 3986 probably needs an update to accommodate 6874 zone Ids.

@sethmlarson
Copy link
Member Author

Sounds good to me, I'll create an issue and PR for that.

@codecov-io
Copy link

codecov-io commented Jan 29, 2019

Codecov Report

Merging #1531 into master will decrease coverage by 0.1%.
The diff coverage is 97.14%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1531      +/-   ##
==========================================
- Coverage     100%   99.89%   -0.11%     
==========================================
  Files          22       22              
  Lines        1857     1873      +16     
==========================================
+ Hits         1857     1871      +14     
- Misses          0        2       +2
Impacted Files Coverage Δ
src/urllib3/connectionpool.py 100% <100%> (ø) ⬆️
src/urllib3/util/url.py 98.01% <96.55%> (-1.99%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1d3e60e...15be9ca. Read the comment docs.

Copy link
Member

@theacodes theacodes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM pending @sigmavirus24's approval

@sethmlarson
Copy link
Member Author

Pushing rfc3986 master to this PR for testing purposes, will need another release of rfc3986 before we can merge this PR.

@@ -220,6 +221,10 @@ def test_parse_url(self, url, expected_url):

@pytest.mark.parametrize('url, expected_url', parse_url_host_map)
def test_unparse_url(self, url, expected_url):

if '/../' in url:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is odd

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, maybe I can split it out into a different test case about path normalization / unsplitting.

@sethmlarson
Copy link
Member Author

sethmlarson commented Jan 31, 2019

@sigmavirus24 Is it acceptable to normalize a path of /.. to /? That's what our current behavior is and it seems strange to me. I expected a result of /..

@sigmavirus24
Copy link
Contributor

99% certain the folks who wrote the RFC3986 section on Resolution of Partial references intended it to act like cd ..

@sigmavirus24
Copy link
Contributor

Nope. Mis-remembering that

    C.  if the input buffer begins with a prefix of "/../" or "/..",
       where ".." is a complete path segment, then replace that
       prefix with "/" in the input buffer and remove the last
       segment and its preceding "/" (if any) from the output
       buffer; otherwise,

Source: https://tools.ietf.org/html/rfc3986#section-5.2.4

@sethmlarson
Copy link
Member Author

So /.. -> / is expected behavior due to the "(if any)"?

@sigmavirus24
Copy link
Contributor

No, reading further, I think /.. -> `` the if any means if it looks like `/foo/` and you have `/..` then you it becomes `/foo` and if you have `/foo` and `/..` would become `/` if I remember correctly. There's examples of input and output for the "remove_dot_segments" routine they describe. I don't recall if I use those to test rfc3986 though

@sethmlarson
Copy link
Member Author

Sure, so there needs to be an update to rfc3986 here? I can open another PR if you'd like.

@sigmavirus24
Copy link
Contributor

Pretty sure that's why we need:

I think Requests already relies on idna but I'm not sure that we do here? I think we can parse a URIReference but normalization/validation is where we fall down right now.

I think we definitely intended to handle IRIs, I just forgot we handled them when helping us over onto rfc3986.

@sethmlarson
Copy link
Member Author

I guess I'll add those issues to my backlog as well.

@sethmlarson
Copy link
Member Author

@sigmavirus24 What are your thoughts on adding the following changes to this branches urllib3.packages.rfc3986 into rfc3986? I figured I try things out here before moving them to that repo.

@sigmavirus24
Copy link
Contributor

is_iri is not something I'd prefer. Instead let's have separate types IRI versus URI. And a IRI can be converted to URI via something like an encode method.

@sigmavirus24
Copy link
Contributor

That is to say "Yes! Let's add it to rfc3986, with these caveats"

@sethmlarson
Copy link
Member Author

Sounds good to me, I'll make those updates and once we get (hopefully) one more release I can update this PR.

@sethmlarson
Copy link
Member Author

I've opened python-hyper/rfc3986#50, once that is merged we can close this PR out

@sethmlarson
Copy link
Member Author

IRI support has landed in rfc3986 v1.3.0 so now this PR can continue! 🎉

@sethmlarson sethmlarson merged commit 5d52370 into urllib3:master Apr 21, 2019
@sethmlarson sethmlarson deleted the url-fixes branch April 21, 2019 01:43
@sethmlarson
Copy link
Member Author

Woo!!! :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

URLs without // don't parse correctly
4 participants