Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hypothesis: builtins.UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte #153

Open
wsanchez opened this issue Jan 25, 2021 · 7 comments · May be fixed by #178
Assignees
Labels

Comments

@wsanchez
Copy link
Contributor

The Hypothesis strategies now shipping with Hyperlink are producing this error occasionally in Klein:

Traceback (most recent call last):
324
  File "/home/runner/work/klein/klein/.tox/coverage-py37-tw192/lib/python3.7/site-packages/klein/test/test_request_compat.py", line 74, in test_uri
325
    def test_uri(self, url: DecodedURL) -> None:
326
  File "/home/runner/work/klein/klein/.tox/coverage-py37-tw192/lib/python3.7/site-packages/hypothesis/core.py", line 1163, in wrapped_test
327
    raise the_error_hypothesis_found
328
  File "/home/runner/work/klein/klein/.tox/coverage-py37-tw192/lib/python3.7/site-packages/hyperlink/hypothesis.py", line 321, in decoded_urls
329
    return DecodedURL(draw(encoded_urls()))
330
  File "/home/runner/work/klein/klein/.tox/coverage-py37-tw192/lib/python3.7/site-packages/hyperlink/_url.py", line 2046, in __init__
331
    self.host, self.userinfo, self.path, self.query, self.fragment
332
  File "/home/runner/work/klein/klein/.tox/coverage-py37-tw192/lib/python3.7/site-packages/hyperlink/_url.py", line 2179, in path
333
    for p in self._url.path
334
  File "/home/runner/work/klein/klein/.tox/coverage-py37-tw192/lib/python3.7/site-packages/hyperlink/_url.py", line 2179, in <listcomp>
335
    for p in self._url.path
336
  File "/home/runner/work/klein/klein/.tox/coverage-py37-tw192/lib/python3.7/site-packages/hyperlink/_url.py", line 766, in _percent_decode
337
    return unquoted_bytes.decode(subencoding)
338
builtins.UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
339

340
klein.test.test_request_compat.HTTPRequestWrappingIRequestTests.test_uri
341
@wsanchez wsanchez added the bug label Jan 25, 2021
@wsanchez
Copy link
Contributor Author

It would be helpful to catch this error and print the URL that produced it, so one might see what data is tripping us up.

@wsanchez
Copy link
Contributor Author

wsanchez commented Jan 29, 2021

Here are some failing examples:

error-causing bytes: b'\x80'
URL: URL.from_text('http://0.0/%80')
error-causing bytes: b'\xe1\x8c\x84\xc3\xa9\xf1\xb1\xa9\x9d\x9b'
URL: URL.from_text('https://ɓ.ő𣫫á:26/ጄé\U00071a5d%9b')
error-causing bytes: b'\xe1\x8c\x84\xc3\xa9\xf1\xb1\xa9\x9d\x9b0'
URL: URL.from_text('https://𐎹pɓ.ő𣫫á:51159/ጄé\U00071a5d%9b0/E7*\x13𐬃\x94\x8e')
error-causing bytes: b'\xe1\x8c\x84\xc3\xa9\xf1\xb1\xa9\x9d\x9b0'
URL: URL.from_text('https://𐎹p1ɜ10貭.в.𢙑dɓ.ő𣫫á:51159/ጄé\U00071a5d%9b0/E7*\x13\U0004216a\x9d𠤈\x94\x8e')

@wsanchez
Copy link
Contributor Author

wsanchez commented Jan 29, 2021

…which one can reproduce in the REPL:

>>> from hyperlink import EncodedURL, DecodedURL
>>> encodedURL = EncodedURL.from_text('http://0.0/%80')
>>> encodedURL
URL.from_text('http://0.0/%80')
>>> DecodedURL(encodedURL)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/wsanchez/Dropbox/Developer/Twisted/klein/.tox/coverage-py38-twcurrent/lib/python3.8/site-packages/hyperlink/_url.py", line 2046, in __init__
    self.host, self.userinfo, self.path, self.query, self.fragment
  File "/Users/wsanchez/Dropbox/Developer/Twisted/klein/.tox/coverage-py38-twcurrent/lib/python3.8/site-packages/hyperlink/_url.py", line 2177, in path
    [
  File "/Users/wsanchez/Dropbox/Developer/Twisted/klein/.tox/coverage-py38-twcurrent/lib/python3.8/site-packages/hyperlink/_url.py", line 2178, in <listcomp>
    _percent_decode(p, raise_subencoding_exc=True)
  File "/Users/wsanchez/Dropbox/Developer/Twisted/klein/.tox/coverage-py38-twcurrent/lib/python3.8/site-packages/hyperlink/_url.py", line 766, in _percent_decode
    return unquoted_bytes.decode(subencoding)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
>>> encodedURL = EncodedURL.from_text('https://ɓ.ő𣫫á:26/ጄé\U00071a5d%9b')
>>> encodedURL
URL.from_text('https://ɓ.ő𣫫á:26/ጄé\U00071a5d%9b')
>>> DecodedURL(encodedURL)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/wsanchez/Dropbox/Developer/Twisted/klein/.tox/coverage-py38-twcurrent/lib/python3.8/site-packages/hyperlink/_url.py", line 2046, in __init__
    self.host, self.userinfo, self.path, self.query, self.fragment
  File "/Users/wsanchez/Dropbox/Developer/Twisted/klein/.tox/coverage-py38-twcurrent/lib/python3.8/site-packages/hyperlink/_url.py", line 2177, in path
    [
  File "/Users/wsanchez/Dropbox/Developer/Twisted/klein/.tox/coverage-py38-twcurrent/lib/python3.8/site-packages/hyperlink/_url.py", line 2178, in <listcomp>
    _percent_decode(p, raise_subencoding_exc=True)
  File "/Users/wsanchez/Dropbox/Developer/Twisted/klein/.tox/coverage-py38-twcurrent/lib/python3.8/site-packages/hyperlink/_url.py", line 766, in _percent_decode
    return unquoted_bytes.decode(subencoding)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x9b in position 9: invalid start byte

@wsanchez
Copy link
Contributor Author

@glyph @mahmoud I'm curious if you think this may suggest a bug in Hyperlink… that we have allowed the creation of an EncodedURL which cannot be decoded…?

@wsanchez wsanchez self-assigned this Jan 29, 2021
@glyph
Copy link
Collaborator

glyph commented Jan 29, 2021

@wsanchez Yes.

@glyph
Copy link
Collaborator

glyph commented Jan 29, 2021

I think DecodedURL maybe has a bit of leeway with a URL like this to mangle it or make it not completely round-trip-able through every API. Browsers have to cope with this kind of a mess, and they definitely do some mangling. For example, if you try pasting https://example.com/%80é into Safari or Chrome, you get https://example.com/%80%C3%A9. Now, granted, that's a bit more like an EncodedURL, but you can deliver the percent-encoded text directly to the application in that case. Because if you manually delete the %80, you'll notice that you get https://example.com/é back again, visually.

@glyph
Copy link
Collaborator

glyph commented Jan 29, 2021

If you were to manipulate a busted URL like this, or manually create a copy via moving strings with DecodedURL, you'd get %2580%25C3%25A9 - but I think that's fine. Maybe there should be a switch about whether to raise or mangle on encoding errors when you create the object?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants