New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use rfc3986.validator.Validator for parse_url #1531
Conversation
validator = Validator() | ||
try: | ||
validator.check_validity_of( | ||
*validator.COMPONENT_NAMES |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you want to validate literally everything, yes? I wonder if we could make a better API for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah validate all components. Leaving the hard work to you I can whip up a patch, tests, and docs if you can think of a name for the interface. ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
validate_all_the_components_<emoji>
Looks like validator is upset with IPv6 addresses in our tests. I'll add some to the URL tests and investigate why we're breaking. |
@sethmlarson the failures look related to Zone Identifiers (RFC 6874). rfc3986 added support for those in python-hyper/rfc3986#2 but I wonder if I overlooked support for them in the Validator work or if our IPv6 validation is a bit too strict. Let me know if you need more information than that. |
@sigmavirus24 Thanks for this info, I can look more tonight. I figured out the issue was with zoning and messaged you about it on Keybase. Is that an avenue I can contact you or should I stick to email? |
@@ -14,12 +15,12 @@ | |||
NORMALIZABLE_SCHEMES = ('http', 'https', None) | |||
|
|||
# Regex for detecting URLs with schemes. RFC 3986 Section 3.1 | |||
SCHEME_REGEX = re.compile(r"^[a-zA-Z][a-zA-Z0-9+\-.]*://") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sigmavirus24 What do you think of taking this .
out of this scheme regex? I did this because I don't think we support any scheme that has this .
here but we support a lot of schemeless "URLs" where the authority section looks like a scheme (www.google.com
is a valid "scheme"). Should we get even more strict and only support schemes that start with http
? I'm not sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, I don't understand the purpose the .
is serving. We can't, however, limit ourselves to what we think of as normal schemes because we (fortunately, or not) support http+unix://
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The .
is part of the scheme spec but I couldn't find an example scheme that contained a period in it. The three schemes I know of that we support are http
, https
, and http+unix
case insensitive.
Removing the period prevents a few issues like us thinking that google.com:433/path
is scheme=google.com
, host=433
, path=/path
and instead forcing a parse on ://google.com:433/path
which gives us a correct result.
I don't really look at or check Keybase much. |
@sigmavirus24 I think I may have found the source of the issue: Our unit tests use Zone ID format from RFC 4007 which says to use a literal percent character instead of a percent-encoded percent character ( Should we drop support for this now that RFC 6874 calls out that RFC 4007 breaks URI syntax rules? If that's the case I can fix all the unit tests that use this syntax. One alternative is to allow Zone IDs to be parsed with RFC 4007 with just |
I think going from
|
Sounds good to me, I'll create an issue and PR for that. |
Codecov Report
@@ Coverage Diff @@
## master #1531 +/- ##
==========================================
- Coverage 100% 99.89% -0.11%
==========================================
Files 22 22
Lines 1857 1873 +16
==========================================
+ Hits 1857 1871 +14
- Misses 0 2 +2
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM pending @sigmavirus24's approval
Pushing rfc3986 master to this PR for testing purposes, will need another release of rfc3986 before we can merge this PR. |
test/test_util.py
Outdated
@@ -220,6 +221,10 @@ def test_parse_url(self, url, expected_url): | |||
|
|||
@pytest.mark.parametrize('url, expected_url', parse_url_host_map) | |||
def test_unparse_url(self, url, expected_url): | |||
|
|||
if '/../' in url: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is odd
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, maybe I can split it out into a different test case about path normalization / unsplitting.
@sigmavirus24 Is it acceptable to normalize a path of |
99% certain the folks who wrote the RFC3986 section on Resolution of Partial references intended it to act like |
Nope. Mis-remembering that
|
So |
cce5323
to
134367e
Compare
No, reading further, I think |
Sure, so there needs to be an update to rfc3986 here? I can open another PR if you'd like. |
Pretty sure that's why we need:
I think Requests already relies on I think we definitely intended to handle IRIs, I just forgot we handled them when helping us over onto rfc3986. |
I guess I'll add those issues to my backlog as well. |
@sigmavirus24 What are your thoughts on adding the following changes to this branches |
|
That is to say "Yes! Let's add it to rfc3986, with these caveats" |
Sounds good to me, I'll make those updates and once we get (hopefully) one more release I can update this PR. |
I've opened python-hyper/rfc3986#50, once that is merged we can close this PR out |
IRI support has landed in rfc3986 v1.3.0 so now this PR can continue! 🎉 |
Woo!!! :D |
Add a set of tests to make sure URLs fail when they've got invalid characters in specific components. Closes #1529