Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't make request with non-ascii/utf8 query params #303

Closed
jerith opened this issue Nov 11, 2020 · 2 comments · Fixed by #304
Closed

Can't make request with non-ascii/utf8 query params #303

jerith opened this issue Nov 11, 2020 · 2 comments · Fixed by #304
Labels
Projects

Comments

@jerith
Copy link
Contributor

jerith commented Nov 11, 2020

Starting from 20.4.1, I can no longer make requests with weirdly-encoded query parameters.

Here's a small example program that makes such a request:

import treq
from twisted.internet.task import react

URL = 'http://requestbin.net/r/1l2gc1o1'

def main(reactor):
    d = treq.get(URL, params={b'foo': 'Hello'.encode('utf-32')})
    return d.addCallback(print)

react(main)

When run with treq 20.3.0:

(tmp-a84c01591caa48e) [16:54|jerith@meklon] ~/.virtualenvs/tmp-a84c01591caa48e% python foo.py
<treq.response._Response 200 'text/html; charset=utf-8' unknown size>

And with treq 20.4.1:

(tmp-a84c01591caa48e) [16:56|jerith@meklon] ~/.virtualenvs/tmp-a84c01591caa48e% python foo.py
Traceback (most recent call last):
  File "foo.py", line 10, in <module>
    react(main)
  File "/Users/jerith/.virtualenvs/tmp-a84c01591caa48e/lib/python3.8/site-packages/twisted/internet/task.py", line 909, in react
    finished = main(_reactor, *argv)
  File "foo.py", line 7, in main
    d = treq.get(URL, params={b'foo': 'Hello'.encode('utf-32')})
  File "/Users/jerith/.virtualenvs/tmp-a84c01591caa48e/lib/python3.8/site-packages/treq/api.py", line 24, in get
    return _client(**kwargs).get(url, headers=headers, **kwargs)
  File "/Users/jerith/.virtualenvs/tmp-a84c01591caa48e/lib/python3.8/site-packages/treq/client.py", line 119, in get
    return self.request('GET', url, **kwargs)
  File "/Users/jerith/.virtualenvs/tmp-a84c01591caa48e/lib/python3.8/site-packages/treq/client.py", line 167, in request
    query=parsed_url.query + tuple(_coerced_query_params(params))
  File "/Users/jerith/.virtualenvs/tmp-a84c01591caa48e/lib/python3.8/site-packages/treq/client.py", line 351, in _coerced_query_params
    value = value.decode('ascii')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0: ordinal not in range(128)

#265 (comment) seems like relevant context here.

@twm
Copy link
Contributor

twm commented Nov 30, 2020

Sorry for the slow response here.

I didn't realize that arbitrary bytes were ever supported as part of params. I don't think that's really desirable, but I didn't mean to break it. I will have to add test coverage for this.

It looks like there is a similar issue if an oddly-encoded querystring is passed as part of the url parameter. You'll get a UnicodeDecodeError when passing a URL like:

treq.get('http://requestbin.net/r/1l2gc1o1?foo=%FF%FE%00%00H%00%00%00e%00%00%00l%00%00%00l%00%00%00o%00%00%00')

It looks like we can avoid this by passing lazy=True when decoding the URL. Then Hyperlink won't try to decode the URL-encoded segments. treq should definitely do this. That doesn't help with params, though.

Unfortunately, #265 (comment) isn't the solution either. Decoding UTF-32 bytes to unicode using charmap (latin-1) and then encoding as UTF-8 gives mojibake rather than the same UTF-32 bytes.

I see two options:

  • Rewrite treq's URL-generation routines to use the low-level hyperlink.URL rather than hyperlink.DecodedURL.
  • Make hyperlink.DecodedURL accept bytes in addition to text.

Could you provide a little color on what you're trying to do to help with this?

twm added a commit that referenced this issue Nov 30, 2020
Per discussion on #303, bytes should pass through even if they aren't
UTF-8 encoded.
@jerith
Copy link
Contributor Author

jerith commented Dec 1, 2020

I work with a wide variety of third-party messaging APIs, many of which are poorly designed and require all the message fields to be in url query parameters. This is fine when we're sending English-language text with no special characters (which fits into 7-bit ASCII), but can be a problem for languages like Swahili (often written in Arabic script) or Amharic when the API we're talking to expects UTF-16 or some weirdly-packed GSM encoding.

@twm twm added this to To do in Treq 21.1.0 Dec 2, 2020
@twm twm moved this from To do to In progress in Treq 21.1.0 Dec 23, 2020
@twm twm added the bug label Dec 23, 2020
@twm twm closed this as completed in #304 Jan 15, 2021
Treq 21.1.0 automation moved this from In progress to Done Jan 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

2 participants