Skip to content
This repository has been archived by the owner on Apr 14, 2022. It is now read-only.

Choosing content-Length vs chunked encoding for request bodies #135

Open
njsmith opened this issue Nov 6, 2019 · 2 comments
Open

Choosing content-Length vs chunked encoding for request bodies #135

njsmith opened this issue Nov 6, 2019 · 2 comments

Comments

@njsmith
Copy link
Member

njsmith commented Nov 6, 2019

@RatanShreshtha pointed to this PR: urllib3/urllib3#1715, asking whether it applies to us. I looked into it and realized it involves a larger issue that I don't understand, so I'm moving it from chat into a proper issue :-)

So: Body encoding is tricky. In HTTP/1.0, the only option was to use Content-Length. Of course this caused problems for streaming uploads. So HTTP/1.1 added the option to use Transfer-Encoding: chunked. For response bodies, this is fine – we can handle whichever one the server gives us. But for request bodies, we have a problem: we have to pick which one we're using before we've talked to the server, and we don't know yet whether it's a HTTP/1.0 server or a HTTP/1.1 server!

So in some cases you MUST use Content-Length, because the server doesn't support anything else, and in some cases you MUST use Transfer-Encoding: chunked, because you don't know the length of the data ahead of time and can't afford to generate it all up front in-memory (either because it's too big, or because your semantics actually demand streaming, e.g. consider a Travis worker streaming logs up to the central Travis server). There's no way to automatically detect which case we're in. So, it kind of has to be exposed to the user as an option they can set.

In upstream urllib3, this is done through the chunked=True/False kwarg. In our branch currently, I guess the only way to do it is by explicitly setting the Content-Length or Transfer-Encoding headers? And if you don't set either, then we unconditionally use Transfer-Encoding: chunked, see _add_transport_headers in connectionpool.py.

I don't think our current thing is going to work... there are a lot of cases where using Content-Length is easy and free (e.g. when the body is specified as an explicit string, or a file whose length we can query). And since Content-Length is more broadly compatible, we need to be using it in these cases; anything else will be a regression compared to other HTTP libraries. But there are a bunch of other cases that are trickier, and I'm not sure what we want to do with them.

One thing we definitely know is what the user passed as the body. If there's no body, that's always Content-Length: 0 (or possibly no header at all, for some methods like GET where Content-Length: 0 is the default). If the body is a fixed-length thing we know up front, then that's always Content-Length: <the length>. Or it could be something more complicated, like an arbitrary iterable or a file object that doesn't support seek/tell/fstat.

So one option would be to say that we default to Content-Length in the former cases, and default to Transfer-Encoding: chunked in the latter case, and if you want to override that then you can set the header explicitly.

Another option would be to do like urllib3 does now, and have an explicit chunked= kwarg. Then we have to decide:

  • If we get chunked=False + and iterable body, what do we do? I guess the options are:
    • Immediately read the whole iterable into memory, so we can figure out its length
    • Give an error, saying that the user should either use chunked=True or set a Content-Length
  • If the user doesn't specify chunked=, what's the default? I guess the options are:
    • chunked=False
    • chunked="guess"

This is a 2x2 space of possibilities.

@pquentin
Copy link
Member

pquentin commented Nov 7, 2019

Thanks for the analysis! I was not aware of this.

requests currently documents a simple rule for using chunked transfer encoding: is the body an iterator? https://requests.kennethreitz.org/en/master/user/advanced/#chunk-encoded-requests

Of course the reality is much more complicated. The prepare_body function decides on the headers, based on a number of factors (you've actually mentioned some of them), see https://github.com/psf/requests/blob/428f7a275914f60a8f1e76a7d69516d617433d30/requests/models.py#L453-L520

This makes sense for a higher-level API but we're not at this level yet, so for now I think we can look at the headers and if they're not there just perform a simple auto-detection to see if we have an iterator.

See urllib3/urllib3@3e586ef, we already look at existing headers, and already check if the body exists, so we only need to check if the body is an iterator to avoid chunking when it's not. requests uses this code:

is_stream = (
    hasattr(data, '__iter__')
    and not isinstance(data, (basestring, list, tuple, Mapping)))

And I guess that's enough for now? I don't think we want to provide the chunked parameter, because advanced users can already specify this differently, and well, urllib3 has too many parameters!

We need to make sure that this plays well with retries: if the first request uses chunked encoding, then we should continue to do so with other requests. This is the bug in urllib3 that prompted urllib3#1715 and urllib#1734.

@sethmlarson
Copy link
Contributor

I'd say the way Requests handles this is the proper technique and has worked well for them. I'll note here that I'd like to change how multipart forms are encoded so a typical use-case for them can rely on Content-Length instead of chunking.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants