Use rfc3986.validator.Validator for parse_url #1531

sethmlarson · 2019-01-27T23:40:09Z

Add a set of tests to make sure URLs fail when they've got invalid characters in specific components. Closes #1529

sigmavirus24 · 2019-01-28T00:23:03Z

src/urllib3/util/url.py

+    validator = Validator()
+    try:
+        validator.check_validity_of(
+            *validator.COMPONENT_NAMES


So you want to validate literally everything, yes? I wonder if we could make a better API for this.

Yeah validate all components. Leaving the hard work to you I can whip up a patch, tests, and docs if you can think of a name for the interface. ;)

validate_all_the_components_<emoji>

sethmlarson · 2019-01-28T00:47:36Z

Looks like validator is upset with IPv6 addresses in our tests. I'll add some to the URL tests and investigate why we're breaking.

sigmavirus24 · 2019-01-28T14:53:48Z

@sethmlarson the failures look related to Zone Identifiers (RFC 6874). rfc3986 added support for those in python-hyper/rfc3986#2 but I wonder if I overlooked support for them in the Validator work or if our IPv6 validation is a bit too strict.

Let me know if you need more information than that.

sethmlarson · 2019-01-28T15:04:09Z

@sigmavirus24 Thanks for this info, I can look more tonight. I figured out the issue was with zoning and messaged you about it on Keybase. Is that an avenue I can contact you or should I stick to email?

sethmlarson · 2019-01-28T15:23:05Z

src/urllib3/util/url.py

@@ -14,12 +15,12 @@
 NORMALIZABLE_SCHEMES = ('http', 'https', None)

 # Regex for detecting URLs with schemes. RFC 3986 Section 3.1
-SCHEME_REGEX = re.compile(r"^[a-zA-Z][a-zA-Z0-9+\-.]*://")


@sigmavirus24 What do you think of taking this . out of this scheme regex? I did this because I don't think we support any scheme that has this . here but we support a lot of schemeless "URLs" where the authority section looks like a scheme (www.google.com is a valid "scheme"). Should we get even more strict and only support schemes that start with http? I'm not sure.

So, I don't understand the purpose the . is serving. We can't, however, limit ourselves to what we think of as normal schemes because we (fortunately, or not) support http+unix://

The . is part of the scheme spec but I couldn't find an example scheme that contained a period in it. The three schemes I know of that we support are http, https, and http+unix case insensitive.

Removing the period prevents a few issues like us thinking that google.com:433/path is scheme=google.com, host=433, path=/path and instead forcing a parse on ://google.com:433/path which gives us a correct result.

sigmavirus24 · 2019-01-28T17:05:19Z

I figured out the issue was with zoning and messaged you about it on Keybase. Is that an avenue I can contact you or should I stick to email?

I don't really look at or check Keybase much.

sethmlarson · 2019-01-28T18:38:35Z

@sigmavirus24 I think I may have found the source of the issue: Our unit tests use Zone ID format from RFC 4007 which says to use a literal percent character instead of a percent-encoded percent character (% instead of %25) which causes host validation to fail. Per RFC 3986 % isn't valid within the host component.

Should we drop support for this now that RFC 6874 calls out that RFC 4007 breaks URI syntax rules? If that's the case I can fix all the unit tests that use this syntax.

One alternative is to allow Zone IDs to be parsed with RFC 4007 with just % if the next two bytes aren't 25 (In either rfc3986 or urllib3)? I don't know how prevalent RFC 4007 zone IDs are. I'd assume common based on a quick Google search?

sigmavirus24 · 2019-01-28T19:06:11Z

One alternative is to allow Zone IDs to be parsed with RFC 4007 with just % if the next two bytes aren't 25 (In either rfc3986 or urllib3)?

I think going from

[::1%eth0] to [::1%25eth0] and normalizing 4007 syntax to 6874 makes sense to me for rfc3986 to do. But I think that 3986 probably needs an update to accommodate 6874 zone Ids.

sethmlarson · 2019-01-28T19:08:21Z

Sounds good to me, I'll create an issue and PR for that.

codecov-io · 2019-01-29T16:52:39Z

Codecov Report

Merging #1531 into master will decrease coverage by 0.1%.
The diff coverage is 97.14%.

@@            Coverage Diff             @@
##           master    #1531      +/-   ##
==========================================
- Coverage     100%   99.89%   -0.11%     
==========================================
  Files          22       22              
  Lines        1857     1873      +16     
==========================================
+ Hits         1857     1871      +14     
- Misses          0        2       +2

Impacted Files	Coverage Δ
src/urllib3/connectionpool.py	`100% <100%> (ø)`	⬆️
src/urllib3/util/url.py	`98.01% <96.55%> (-1.99%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1d3e60e...15be9ca. Read the comment docs.

theacodes

This LGTM pending @sigmavirus24's approval

sethmlarson · 2019-01-29T19:08:49Z

Pushing rfc3986 master to this PR for testing purposes, will need another release of rfc3986 before we can merge this PR.

sigmavirus24 · 2019-01-31T14:15:31Z

test/test_util.py

@@ -220,6 +221,10 @@ def test_parse_url(self, url, expected_url):

    @pytest.mark.parametrize('url, expected_url', parse_url_host_map)
    def test_unparse_url(self, url, expected_url):
+
+        if '/../' in url:


This is odd

I agree, maybe I can split it out into a different test case about path normalization / unsplitting.

sethmlarson · 2019-01-31T14:46:30Z

@sigmavirus24 Is it acceptable to normalize a path of /.. to /? That's what our current behavior is and it seems strange to me. I expected a result of /..

sigmavirus24 · 2019-01-31T14:52:55Z

99% certain the folks who wrote the RFC3986 section on Resolution of Partial references intended it to act like cd ..

sigmavirus24 · 2019-01-31T14:54:16Z

Nope. Mis-remembering that

    C.  if the input buffer begins with a prefix of "/../" or "/..",
       where ".." is a complete path segment, then replace that
       prefix with "/" in the input buffer and remove the last
       segment and its preceding "/" (if any) from the output
       buffer; otherwise,

Source: https://tools.ietf.org/html/rfc3986#section-5.2.4

sethmlarson · 2019-01-31T15:00:54Z

So /.. -> / is expected behavior due to the "(if any)"?

sigmavirus24 · 2019-01-31T15:52:31Z

No, reading further, I think /.. -> `` the if any means if it looks like `/foo/` and you have `/..` then you it becomes `/foo` and if you have `/foo` and `/..` would become `/` if I remember correctly. There's examples of input and output for the "remove_dot_segments" routine they describe. I don't recall if I use those to test rfc3986 though

sethmlarson · 2019-01-31T18:00:53Z

Sure, so there needs to be an update to rfc3986 here? I can open another PR if you'd like.

sigmavirus24 · 2019-02-04T16:44:48Z

Pretty sure that's why we need:

I think Requests already relies on idna but I'm not sure that we do here? I think we can parse a URIReference but normalization/validation is where we fall down right now.

I think we definitely intended to handle IRIs, I just forgot we handled them when helping us over onto rfc3986.

sethmlarson · 2019-02-04T16:50:44Z

I guess I'll add those issues to my backlog as well.

sethmlarson · 2019-03-19T03:25:11Z

@sigmavirus24 What are your thoughts on adding the following changes to this branches urllib3.packages.rfc3986 into rfc3986? I figured I try things out here before moving them to that repo.

sigmavirus24 · 2019-03-19T12:48:13Z

is_iri is not something I'd prefer. Instead let's have separate types IRI versus URI. And a IRI can be converted to URI via something like an encode method.

sigmavirus24 · 2019-03-19T12:48:37Z

That is to say "Yes! Let's add it to rfc3986, with these caveats"

sethmlarson · 2019-03-19T13:05:20Z

Sounds good to me, I'll make those updates and once we get (hopefully) one more release I can update this PR.

sethmlarson · 2019-03-20T01:19:16Z

I've opened python-hyper/rfc3986#50, once that is merged we can close this PR out

sethmlarson · 2019-04-20T20:41:25Z

IRI support has landed in rfc3986 v1.3.0 so now this PR can continue! 🎉

…fixes

sethmlarson · 2019-04-21T01:43:29Z

Woo!!! :D

Use the validator for parse_url

5135adf

sethmlarson requested a review from sigmavirus24 January 27, 2019 23:40

Add more URLs

40b6e20

sigmavirus24 reviewed Jan 28, 2019

View reviewed changes

sigmavirus24 approved these changes Jan 28, 2019

View reviewed changes

sethmlarson changed the title ~~Use rfc3986.Validator for parse_url~~ Use rfc3986.validator.Validator for parse_url Jan 28, 2019

sethmlarson commented Jan 28, 2019

View reviewed changes

sethmlarson mentioned this pull request Jan 28, 2019

Normalize RFC 4007 delimiter for IPv6 Zone IDs python-hyper/rfc3986#43

Closed

Use rfc3986.URIReference instead of ParseResult

637bd13

theacodes approved these changes Jan 29, 2019

View reviewed changes

Update rfc3986 to master

a05d157

Update parse_url to require components too

821c108

sigmavirus24 reviewed Jan 31, 2019

View reviewed changes

Split path test cases

134367e

sethmlarson force-pushed the url-fixes branch from cce5323 to 134367e Compare January 31, 2019 15:32

temp commit

65f6cb2

sethmlarson mentioned this pull request Mar 18, 2019

CRLF injection vulnerability #1553

Closed

sethmlarson added 4 commits March 18, 2019 19:38

Add unit tests for international hosts

cc82882

use re

323eb56

Add all international components

7866572

fix flake

0ef8b81

sethmlarson mentioned this pull request Mar 23, 2019

Handling "Location: http:///" header #1556

Closed

This was referenced Apr 8, 2019

Add support for TLS 1.3 to all HTTPSConnection implementations #1496

Merged

Add support for brotli content encoding via brotlipy package #1532

Merged

sethmlarson added 5 commits April 20, 2019 15:45

Merge branch 'master' of https://github.com/urllib3/urllib3 into url-…

1e6681e

…fixes

Upgrade rfc3986 to v1.3.0

357d701

Update parse_url and tests

0b1727e

Merge

c2367d8

Fix lint issues in tests

15be9ca

theacodes approved these changes Apr 20, 2019

View reviewed changes

sethmlarson merged commit 5d52370 into urllib3:master Apr 21, 2019

sethmlarson deleted the url-fixes branch April 21, 2019 01:43

sethmlarson mentioned this pull request Apr 21, 2019

util.parse_url doesnt parse path when schema https: or http: etc is missing #1539

Closed

BKPepe mentioned this pull request Apr 24, 2019

[OpenWrt 18.06] python-urllib3: update to 1.24.3 openwrt/packages#8765

Merged

Dobatymo pushed a commit to Dobatymo/urllib3 that referenced this pull request Mar 16, 2022

Use rfc3986.validator.Validator for parse_url (urllib3#1531)

b69956e

delroth mentioned this pull request Jun 10, 2022

Can't connect to IPv6 Address with Zone ID #1641

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use rfc3986.validator.Validator for parse_url #1531

Use rfc3986.validator.Validator for parse_url #1531

sethmlarson commented Jan 27, 2019 •

edited

sigmavirus24 Jan 28, 2019

sethmlarson Jan 28, 2019

sigmavirus24 Jan 28, 2019

sethmlarson commented Jan 28, 2019

sigmavirus24 commented Jan 28, 2019

sethmlarson commented Jan 28, 2019 •

edited

sethmlarson Jan 28, 2019 •

edited

sigmavirus24 Jan 28, 2019

sethmlarson Jan 28, 2019 •

edited

sigmavirus24 commented Jan 28, 2019

sethmlarson commented Jan 28, 2019 •

edited

sigmavirus24 commented Jan 28, 2019

sethmlarson commented Jan 28, 2019

codecov-io commented Jan 29, 2019 •

edited

theacodes left a comment

sethmlarson commented Jan 29, 2019

sigmavirus24 Jan 31, 2019

sethmlarson Jan 31, 2019

sethmlarson commented Jan 31, 2019 •

edited

sigmavirus24 commented Jan 31, 2019

sigmavirus24 commented Jan 31, 2019

sethmlarson commented Jan 31, 2019

sigmavirus24 commented Jan 31, 2019

sethmlarson commented Jan 31, 2019

sigmavirus24 commented Feb 4, 2019

sethmlarson commented Feb 4, 2019

sethmlarson commented Mar 19, 2019

sigmavirus24 commented Mar 19, 2019

sigmavirus24 commented Mar 19, 2019

sethmlarson commented Mar 19, 2019

sethmlarson commented Mar 20, 2019

sethmlarson commented Apr 20, 2019

sethmlarson commented Apr 21, 2019

Use rfc3986.validator.Validator for parse_url #1531

Use rfc3986.validator.Validator for parse_url #1531

Conversation

sethmlarson commented Jan 27, 2019 • edited

sigmavirus24 Jan 28, 2019

Choose a reason for hiding this comment

sethmlarson Jan 28, 2019

Choose a reason for hiding this comment

sigmavirus24 Jan 28, 2019

Choose a reason for hiding this comment

sethmlarson commented Jan 28, 2019

sigmavirus24 commented Jan 28, 2019

sethmlarson commented Jan 28, 2019 • edited

sethmlarson Jan 28, 2019 • edited

Choose a reason for hiding this comment

sigmavirus24 Jan 28, 2019

Choose a reason for hiding this comment

sethmlarson Jan 28, 2019 • edited

Choose a reason for hiding this comment

sigmavirus24 commented Jan 28, 2019

sethmlarson commented Jan 28, 2019 • edited

sigmavirus24 commented Jan 28, 2019

sethmlarson commented Jan 28, 2019

codecov-io commented Jan 29, 2019 • edited

Codecov Report

theacodes left a comment

Choose a reason for hiding this comment

sethmlarson commented Jan 29, 2019

sigmavirus24 Jan 31, 2019

Choose a reason for hiding this comment

sethmlarson Jan 31, 2019

Choose a reason for hiding this comment

sethmlarson commented Jan 31, 2019 • edited

sigmavirus24 commented Jan 31, 2019

sigmavirus24 commented Jan 31, 2019

sethmlarson commented Jan 31, 2019

sigmavirus24 commented Jan 31, 2019

sethmlarson commented Jan 31, 2019

sigmavirus24 commented Feb 4, 2019

sethmlarson commented Feb 4, 2019

sethmlarson commented Mar 19, 2019

sigmavirus24 commented Mar 19, 2019

sigmavirus24 commented Mar 19, 2019

sethmlarson commented Mar 19, 2019

sethmlarson commented Mar 20, 2019

sethmlarson commented Apr 20, 2019

sethmlarson commented Apr 21, 2019

sethmlarson commented Jan 27, 2019 •

edited

sethmlarson commented Jan 28, 2019 •

edited

sethmlarson Jan 28, 2019 •

edited

sethmlarson Jan 28, 2019 •

edited

sethmlarson commented Jan 28, 2019 •

edited

codecov-io commented Jan 29, 2019 •

edited

sethmlarson commented Jan 31, 2019 •

edited