Add streaming multipart/form-data encoder #2331

sigmavirus24 · 2021-07-17T14:08:32Z

I initially implemented this for Requests in the requests-toolbelt but
this was already pretty generic because Requests just passes file-like
objects (which is how the streaming encoder behaves) directly to
urllib3. All that needed to change was what we were relying on from the
requests namespace and imports and such.

This also adds the decoder in ther same breath because it's easier to
ensure that's all working together properly in one and it all fits
together nicely.

One thing we _could_ do is consolidate a bunch of the logic too and make
`encode_multipart_formdata` rely on the streaming encoder and call
`getall` instead so that we don't have 2 implementations of the same
logic.

Closes #51
Closes #624

sethmlarson · 2021-07-17T18:23:18Z

Very exciting, thanks for opening this! Let me know when you're ready for reviews.

sigmavirus24 · 2021-12-30T15:29:57Z

src/urllib3/fields.py

@@ -12,7 +13,7 @@
    cast,
 )

-_TYPE_FIELD_VALUE = Union[str, bytes]
+_TYPE_FIELD_VALUE = Union[str, bytes, BinaryIO]


The docstring for all of this shows using a file here so this was missing but it had the knock on effect of rewriting the header params which were overloading the definition of a "FIELD_VALUE"

Hmm, I'm not sure if files were supported. Iirc it was open (x).read()? Need to check on PC

Hm, so I guess requests always papered over this: https://github.com/psf/requests/blob/d09659997cd1e3eca49a07c59ece5557071c0ab9/requests/models.py#L160

I don't know how we consolidate these APIs though in such a way that folks don't need to read the entire file into memory.

src/urllib3/filepost.py

test/with_dummyserver/test_connectionpool.py

sethmlarson

Thanks for this! Here's some initial comments from mobile:

sethmlarson · 2021-12-31T14:06:06Z

src/urllib3/multipart/decoder.py

+
+
+@typing.overload
+def _split_on_find(content: str, bound: str) -> typing.Tuple[str, str]:


Can we use .partition() instead of this function?

src/urllib3/multipart/decoder.py

sethmlarson · 2021-12-31T14:12:58Z

src/urllib3/multipart/decoder.py

+        self.encoding = encoding
+        headers: typing.Dict[str, str] = {}
+        # Split into header section (if any) and the content
+        if b"\r\n\r\n" in content:


Can we use partition and check the middle element of the returned 3-tuple to avoid doing two scans for this bytestring?

sethmlarson · 2021-12-31T14:15:57Z

src/urllib3/multipart/decoder.py

+            return part
+
+    def _parse_body(self, content: bytes) -> None:
+        boundary = b"".join((b"--", self.boundary))


Used a bytes literal + boundary

sethmlarson · 2021-12-31T14:17:13Z

src/urllib3/multipart/decoder.py

+                and part != b"--"
+            )
+
+        parts = content.split(b"".join((b"\r\n", boundary)))


Bytes literal + boundary

sethmlarson · 2021-12-31T14:31:51Z

src/urllib3/fields.py

@@ -12,7 +13,7 @@
    cast,
 )

-_TYPE_FIELD_VALUE = Union[str, bytes]
+_TYPE_FIELD_VALUE = Union[str, bytes, BinaryIO]


Hmm, I'm not sure if files were supported. Iirc it was open (x).read()? Need to check on PC

sethmlarson · 2021-12-31T14:38:55Z

src/urllib3/multipart/__init__.py

@@ -0,0 +1,19 @@
+"""Multipart support for urllib3."""
+
+from .decoder import (


Should we hide the encoder/decoder submodules and reassign the names in this module? (Ie from .encoder import x as x)

hramezani · 2021-12-31T15:02:53Z

src/urllib3/multipart/decoder.py

+from .. import _collections
+from .encoder import encode_with
+
+if typing.TYPE_CHECKING:


I to make the typing compatible with other files, would suggest to don't use typing.x and importing them as from typing import x,y,...

I find that importing things from the typing module pollutes namespaces. I can consolidate to that here but I really hate the practice of importing from typing rather than importing typing or aliasing the import typing to something shorter

I can see both ways of (import typing or from typing import x) in different projects.
I am not against any of them. I would suggest choosing one of them and applying it to all files.

BTW, I am ok with import typing as well. I can update the rest of the files of the project in a PR.

@sethmlarson what do you think?

Let's switch to the namedspaced usage with typing.X.

We can use libcst to do it automagically for us if we want.

src/urllib3/multipart/decoder.py

sethmlarson

More comments for you, this time from the comfort of a PC.

A general comment, if it's possible I'd like to move away from the .len property. You likely understand the historical significance more than I but if there's a way we can accomplish this with only using __len__ and .tell()+.seek() that'd be best imo

src/urllib3/multipart/encoder.py

sethmlarson · 2021-12-31T16:34:21Z

src/urllib3/multipart/encoder.py

+        self.boundary_value: str = boundary or uuid.uuid4().hex
+
+        # Computed boundary
+        self.boundary: str = f"--{self.boundary_value}"


Should we have boundary be what was passed in by the user or generated instead of boundary_value? I'm not sure why users would need --{boundary} as a public attribute.

Generally speaking, this isn't there for the users, but we can make these private/read-only as necessary. It's just more to write everything we use frequently as _{name} because we don't trust users to be intelligent and not muck with things.

Yeah, users will be users. 🤷 At least if these attributes are private we can say "told you so" if it ends up being a problem later.

sethmlarson · 2021-12-31T16:36:49Z

src/urllib3/multipart/encoder.py

+        )
+
+        #: Fields provided by the user
+        self.fields = fields


Should we change the fields, encoding, and finished properties to be read-only properties?

It's so much extra boilerplate to do what? What motivates you wanting significantly more code to make things read-only here? Is the idea if this is read-only users won't search for the private attributes storing the data and then will report bugs that updating the private attributes didn't do what they expected versus what happened changing the undocumented attributes that are 'public' and writable? How adversarial a relationship do you want to have with our users?

sethmlarson · 2021-12-31T16:38:27Z

src/urllib3/multipart/encoder.py

+            "Content-Length": self.content_length,
+        }
+
+    def getall(self) -> bytes:


If we have .read() do we need this function? We can document that .read() can be used to read all the data it can.

I received complaints on toolbelt about not having something that didn't imply a need for a read size 🤷 I don't care either way

Let's stick with .read() for now, thanks!

src/urllib3/multipart/encoder.py

sigmavirus24 · 2022-01-01T15:31:41Z

A general comment, if it's possible I'd like to move away from the .len property. You likely understand the historical significance more than I but if there's a way we can accomplish this with only using __len__ and .tell()+.seek() that'd be best imo

So today, urllib3 is more-or-less architecture agnostic. We know from the harassment of the cryptography project for adopting Rust that there are people still operating older architectures and compiling things from scratch for themselves to make that work. On the face of things, I'd guess 90+% of our users are not on 32-bit architectures but we had a fair number of users of requests + urllib3 that were on 32-bit architectures that couldn't use the MultipartEncoder because __len__ had restrictions (even on Py3) due to the underlying system. I think we can definitely migrate to __len__ but I thoroughly suspect that we should head this off by explicitly documenting the difference. The other thing is that MultipartEncoder probably doesn't need anything other than being able to calculate it's own length for the headers - unless we want users to be able to pass this into Requests as well in which case we need to provide something for them.

Also, I haven't done a deep reconsideration of the design here just the minimal effort to migrate it here. This was designed to work best within Requests and its idiosyncrasies.

I initially implemented this for Requests in the requests-toolbelt but this was already pretty generic because Requests just passes file-like objects (which is how the streaming encoder behaves) directly to urllib3. All that needed to change was what we were relying on from the requests namespace and imports and such. This also adds the decoder in ther same breath because it's easier to ensure that's all working together properly in one and it all fits together nicely. One thing we _could_ do is consolidate a bunch of the logic too and make `encode_multipart_formdata` rely on the streaming encoder and call `getall` instead so that we don't have 2 implementations of the same logic.

…ta-encoder

Type annotations made it clear that a dangling comma was added by mistake, converting a value into a tuple.

pquentin · 2022-06-12T16:47:33Z

Regarding CI, there are two issues left:

35 lines are not covered, which is against our policy of 100% coverage. We can always exclude those lines from coverage, but it would be nice to add tests.
The docs is complaining due to our new nitpicky settings. @hramezani Given your experience here, would you mind taking a look at the errors and tell what you think? For reference I copied them below.

/home/docs/checkouts/readthedocs.org/user_builds/urllib3/envs/2331/lib/python3.7/site-packages/urllib3/multipart/decoder.py:docstring of urllib3.multipart.decoder.MultipartDecoder.from_response:: WARNING: py:class reference target not found: _response.HTTPResponse
/home/docs/checkouts/readthedocs.org/user_builds/urllib3/envs/2331/lib/python3.7/site-packages/urllib3/multipart/decoder.py:docstring of urllib3.multipart.decoder.MultipartDecoder.from_response:: WARNING: py:class reference target not found: urllib3.multipart.decoder.MD
/home/docs/checkouts/readthedocs.org/user_builds/urllib3/envs/2331/lib/python3.7/site-packages/urllib3/multipart/decoder.py:docstring of urllib3.multipart.MultipartDecoder.parts:: WARNING: py:class reference target not found: urllib3.multipart.decoder.BodyPart

It's incorrect in the context of the streaming encoder changes.

pquentin · 2022-08-19T12:39:21Z

The docs are fine now, coverage is left. (And also make sure all concerns are addressed)

spacether · 2023-02-28T05:06:54Z

What's the status of this? Can it be merged? Openapi generator users want this and we use this library.

nioncode · 2023-10-04T20:44:59Z

@sigmavirus24 are you still working on this? I'm thinking of taking this over to get this over the line.

pquentin · 2023-10-07T07:30:46Z

@nioncode Please do. The main task is figuring out what is left to do.

ecerulm · 2024-01-30T20:00:24Z

@sigmavirus24 this has been open for 3 years almost, shall we close this or you have any plan to revive this PR?

sigmavirus24 · 2024-01-30T23:59:32Z

The project wants to land it. I don't have the time for it and others have done work intermittently to get it over the line.

As it stands urllib3 can consume gigs of memory without proper warning to users

This comment was marked as outdated.

Sign in to view

sigmavirus24 force-pushed the add-multipart-formdata-encoder branch from a8dcd6d to f52eb3e Compare December 18, 2021 15:02

sigmavirus24 force-pushed the add-multipart-formdata-encoder branch 3 times, most recently from 13635bb to 0e2a726 Compare December 30, 2021 15:14

sigmavirus24 changed the title ~~WIP: Add multipart/form-data streaming encoder~~ Add streaming multipart/form-data encoder Dec 30, 2021

sigmavirus24 marked this pull request as ready for review December 30, 2021 15:16

sigmavirus24 requested review from sethmlarson and shazow as code owners December 30, 2021 15:16

sigmavirus24 commented Dec 30, 2021

View reviewed changes

src/urllib3/filepost.py Show resolved Hide resolved

sigmavirus24 force-pushed the add-multipart-formdata-encoder branch 2 times, most recently from eaedfa1 to 8c9ec05 Compare December 30, 2021 20:41

sigmavirus24 commented Dec 30, 2021

View reviewed changes

test/with_dummyserver/test_connectionpool.py Outdated Show resolved Hide resolved

sigmavirus24 force-pushed the add-multipart-formdata-encoder branch 2 times, most recently from 9b7e6e2 to 47cf7d9 Compare December 31, 2021 14:00

sethmlarson reviewed Dec 31, 2021

View reviewed changes

hramezani reviewed Dec 31, 2021

View reviewed changes

src/urllib3/multipart/decoder.py Outdated Show resolved Hide resolved

sethmlarson requested changes Dec 31, 2021

View reviewed changes

sethmlarson mentioned this pull request Dec 31, 2021

Streaming File Uploads #1030

Closed

sigmavirus24 force-pushed the add-multipart-formdata-encoder branch 2 times, most recently from e352da1 to 112bcb9 Compare January 2, 2022 01:43

sigmavirus24 force-pushed the add-multipart-formdata-encoder branch from 112bcb9 to c8c153e Compare January 2, 2022 16:05

pquentin added 2 commits June 4, 2022 23:26

Merge remote-tracking branch 'urllib3/main' into add-multipart-formda…

6c6117c

…ta-encoder

Fix mypy

b2f8061

Type annotations made it clear that a dangling comma was added by mistake, converting a value into a tuple.

pquentin dismissed a stale review via b2f8061 June 4, 2022 19:40

pquentin removed the request for review from shazow June 4, 2022 19:41

Measure test coverage

ba64375

pquentin mentioned this pull request Jun 27, 2022

Twice memory usage in encode_multipart_formdata #624

Open

pquentin added 3 commits August 19, 2022 14:09

Merge branch 'main' into add-multipart-formdata-encoder

c996550

Revert back to using Union[str, bytes] for rfc2231

2ecc96d

It's incorrect in the context of the streaming encoder changes.

Fix docs generation

06ba311

skinkie mentioned this pull request Dec 20, 2022

[REQ] Python file_type streaming upload OpenAPITools/openapi-generator#14300

Open

alexwlchan mentioned this pull request Mar 11, 2024

Add streaming multipart/form-data encoder #3361

Draft

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add streaming multipart/form-data encoder #2331

Add streaming multipart/form-data encoder #2331

sigmavirus24 commented Jul 17, 2021 •

edited

This comment was marked as outdated.

sethmlarson commented Jul 17, 2021

sigmavirus24 Dec 30, 2021

sethmlarson Dec 31, 2021

sigmavirus24 Jan 1, 2022

sethmlarson left a comment

sethmlarson Dec 31, 2021

sethmlarson Dec 31, 2021

sethmlarson Dec 31, 2021

sethmlarson Dec 31, 2021

sethmlarson Dec 31, 2021

sethmlarson Dec 31, 2021

hramezani Dec 31, 2021

sigmavirus24 Dec 31, 2021

hramezani Dec 31, 2021 •

edited

sethmlarson Jan 1, 2022

sigmavirus24 Jan 1, 2022

sethmlarson left a comment

sethmlarson Dec 31, 2021

sigmavirus24 Dec 31, 2021

sethmlarson Dec 31, 2021

sethmlarson Dec 31, 2021

sigmavirus24 Jan 1, 2022

sethmlarson Dec 31, 2021

sigmavirus24 Dec 31, 2021

sethmlarson Dec 31, 2021

sigmavirus24 commented Jan 1, 2022

pquentin commented Jun 12, 2022

pquentin commented Aug 19, 2022

spacether commented Feb 28, 2023 •

edited

nioncode commented Oct 4, 2023

pquentin commented Oct 7, 2023

ecerulm commented Jan 30, 2024

sigmavirus24 commented Jan 30, 2024 •

edited



		@typing.overload
		def _split_on_find(content: str, bound: str) -> typing.Tuple[str, str]:

		@@ -0,0 +1,19 @@
		"""Multipart support for urllib3."""

		from .decoder import (

Add streaming multipart/form-data encoder #2331

Are you sure you want to change the base?

Add streaming multipart/form-data encoder #2331

Conversation

sigmavirus24 commented Jul 17, 2021 • edited

This comment was marked as outdated.

sethmlarson commented Jul 17, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sethmlarson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hramezani Dec 31, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sethmlarson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sigmavirus24 commented Jan 1, 2022

pquentin commented Jun 12, 2022

pquentin commented Aug 19, 2022

spacether commented Feb 28, 2023 • edited

nioncode commented Oct 4, 2023

pquentin commented Oct 7, 2023

ecerulm commented Jan 30, 2024

sigmavirus24 commented Jan 30, 2024 • edited

sigmavirus24 commented Jul 17, 2021 •

edited

hramezani Dec 31, 2021 •

edited

spacether commented Feb 28, 2023 •

edited

sigmavirus24 commented Jan 30, 2024 •

edited