Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement RFC 3986 URL parsing #1487

Merged
merged 13 commits into from
Dec 7, 2018
Merged

Conversation

sethmlarson
Copy link
Member

@sethmlarson sethmlarson commented Nov 29, 2018

This PR adds the rfc3986 module into urllib3.packages and replaces our current URL parser with the one from rfc3986. This brings our URL parser in-line with Python standard library parsing which all use RFC 3986.

As far as I can see the only backwards compatibility we lose here is we can no longer parse http://@. I'm hoping breaking this use case is acceptable. :) I've added a few URLs from various sources that other URL parsers had trouble with.

Hoping I can get a review on your thoughts @sigmavirus24?

Reference for one of the example URLs (maybe more here?): https://www.blackhat.com/docs/us-17/thursday/us-17-Tsai-A-New-Era-Of-SSRF-Exploiting-URL-Parser-In-Trending-Programming-Languages.pdf

Closes #466
Closes #859
Closes #952
Closes #1096

@sigmavirus24
Copy link
Contributor

I think there are some open bugs that should be fixed before we move over to this wholesale.

@sethmlarson
Copy link
Member Author

I can work on some of those bugs now, is there a list of bugs you'd like squashed before you're comfortable with this merge?

@sigmavirus24
Copy link
Contributor

python-hyper/rfc3986#32 is the main one I think

@sigmavirus24
Copy link
Contributor

This is probably a nice to have as well: python-hyper/rfc3986#36

@sethmlarson
Copy link
Member Author

I'll take a look at these issues, thanks!

@sigmavirus24
Copy link
Contributor

And if we ever use parsed URIs as dictionary keys, we'll likely need: python-hyper/rfc3986#35 (based on the description)

@sethmlarson
Copy link
Member Author

I might pick that one off while I have my environment open anyways. ;)

@sethmlarson
Copy link
Member Author

Thanks for reviewing that change @sigmavirus24.

Pending python-hyper/rfc3986#37 being merged if we assume that python-hyper/rfc3986#32 is a non-issue until @kennethreitz gets back to us is there anything else that needs to be done for a new release of rfc3986 being tagged?

@sigmavirus24
Copy link
Contributor

I think we need release notes and a version bump. Once merged I can run the automation for it.

@sigmavirus24
Copy link
Contributor

Once updated to 1.2.0 I'm 👍 on this

@codecov-io
Copy link

codecov-io commented Dec 5, 2018

Codecov Report

Merging #1487 into master will decrease coverage by 2.92%.
The diff coverage is 67.56%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1487      +/-   ##
==========================================
- Coverage   69.49%   66.57%   -2.93%     
==========================================
  Files          22       22              
  Lines        2685     2776      +91     
==========================================
- Hits         1866     1848      -18     
- Misses        819      928     +109
Impacted Files Coverage Δ
src/urllib3/contrib/socks.py 35.29% <100%> (-0.54%) ⬇️
src/urllib3/util/ssl_.py 36.2% <50%> (-8.82%) ⬇️
src/urllib3/util/url.py 52.76% <68.75%> (-37.15%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cbe5732...06acee2. Read the comment docs.

abnf_regexp.IPv4_RE,
abnf_regexp.IPv6_RE,
abnf_regexp.IPv6_ADDRZ_RE,
abnf_regexp.IPv_FUTURE_RE,
abnf_regexp.IP_LITERAL_RE
abnf_regexp.IPv_FUTURE_RE
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if rfc3986 could just provide a regular expression for use here. 🤷‍♀️

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably, something like misc.IP_ADDRESS_MATCHER?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know that rfc3986 will ever need to use the matcher so I'd prefer not to add more time compiling a regexp we're not going to use. We could just have them all in one RegExp though in abnf_regexp

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true, I can create that issue to track the idea.

@sethmlarson
Copy link
Member Author

Current issue is specifically Windows Python 2.7 having issues with socks5:// URLs sending the incorrect IP address. Looks like somehow getting 0.0.0.1 after deserializing socket.inet_ntoa() which implies it's reading b'\x00\x00\x00\x01' from the socket? Need to figure out why that's occurring and why only on Windows+Python 2.7, all other tests are running fine.

@sethmlarson
Copy link
Member Author

sethmlarson commented Dec 6, 2018

It looks like win_inet_pton package is still getting installed by PySocks correctly. Maybe this is an issue with the URL parser returning unicode now on Python 2.7? Will investigate this tonight.

Also according to our documentation IPv6 isn't supported by PySocks in some situations but I might want to revisit that.

@sethmlarson
Copy link
Member Author

Looks like the unicode addresses with win_inet_pton was the problem. Going to open an issue on that repo as socket.inet_pton supports unicode addresses on non-Windows Python 2.7. I'll clean up all the little commits I've made from the PR and add some test cases for ensuring that each piece of the URL is not unicode.

src/urllib3/contrib/socks.py Outdated Show resolved Hide resolved
test/contrib/test_socks.py Outdated Show resolved Hide resolved
@sethmlarson
Copy link
Member Author

Ready for a rereview @sigmavirus24, thanks for catching those artifacts.

@sethmlarson sethmlarson merged commit 0aa3e24 into urllib3:master Dec 7, 2018
@sethmlarson
Copy link
Member Author

Woo! 🎉

@sethmlarson sethmlarson deleted the url-rfc3986 branch December 7, 2018 16:19
@shazow
Copy link
Member

shazow commented Dec 7, 2018

Yay this is great!! :)

@theacodes
Copy link
Member

Yay this is great!! :)

For real. Thanks for doing this, @SethMichaelLarson and thank you for reviewing, @sigmavirus24!

@sigmavirus24
Copy link
Contributor

I did almost no hard work whatsoever. 👏 @SethMichaelLarson

@trustedsec
Copy link

There appears to be normalization of URL's occurring that removes intended functionality through GET/POST requests.

Built a scanner that looks for a directory traversal issue to help to troubleshoot missing patches and it strips out the ../ in the request.

Even with allow_redirects=False the request still gets stripped.

Example:

$ python3
Python 3.7.4 (default, Dec 13 2019, 01:02:18)
[GCC 7.3.1 20180712 (Red Hat 7.3.1-6)] on linux
Type "help", "copyright", "credits" or "license" for more information.

import requests
req = requests.get('https://x.x.x.x/vpn/../vpns/etc/smb.conf', allow_redirects=False)
req.content
b'<script type="text/javascript" src="/vpn/resources.js"></script><script type="text/javascript" language="javascript">var Resources = new ResourceManager("/logon/themes/Default/resources/{lang}", "REDIRECTION_BODY");</script><script type="text/javascript" language="javascript">Resources.Load();</script>'
req.status_code
302
req.url
'https://x.x.x.x/vpns/etc/smb.conf'

Notice the request for ../ is stripped/removed.

Dobatymo pushed a commit to Dobatymo/urllib3 that referenced this pull request Mar 16, 2022
@jbwdevries
Copy link

FYI: This broke an integration with PLC software for us, because it changes the URLs send out (via requests).

This is the changed behaviour:

python3 -m venv venv

venv/bin/pip install urllib3==1.23
venv/bin/python -c 'from urllib3.util import parse_url ; print(parse_url("http://foo.tld/?a=[]"))'
# http://foo.tld/?a=[]

venv/bin/pip install urllib3==1.26.11
venv/bin/python -c 'from urllib3.util import parse_url ; print(parse_url("http://foo.tld/?a=[]"))'
# http://foo.tld/?a=%5B%5D

Currently trying to find a workaround.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
7 participants