Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add streaming multipart/form-data encoder #2331

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

sigmavirus24
Copy link
Contributor

@sigmavirus24 sigmavirus24 commented Jul 17, 2021

I initially implemented this for Requests in the requests-toolbelt but
this was already pretty generic because Requests just passes file-like
objects (which is how the streaming encoder behaves) directly to
urllib3. All that needed to change was what we were relying on from the
requests namespace and imports and such.

This also adds the decoder in ther same breath because it's easier to
ensure that's all working together properly in one and it all fits
together nicely.

One thing we _could_ do is consolidate a bunch of the logic too and make
`encode_multipart_formdata` rely on the streaming encoder and call
`getall` instead so that we don't have 2 implementations of the same
logic.

Closes #51
Closes #624

@codecov

This comment was marked as outdated.

@sethmlarson
Copy link
Member

Very exciting, thanks for opening this! Let me know when you're ready for reviews.

@sigmavirus24 sigmavirus24 force-pushed the add-multipart-formdata-encoder branch 3 times, most recently from 13635bb to 0e2a726 Compare December 30, 2021 15:14
@sigmavirus24 sigmavirus24 changed the title WIP: Add multipart/form-data streaming encoder Add streaming multipart/form-data encoder Dec 30, 2021
@sigmavirus24 sigmavirus24 marked this pull request as ready for review December 30, 2021 15:16
@@ -12,7 +13,7 @@
cast,
)

_TYPE_FIELD_VALUE = Union[str, bytes]
_TYPE_FIELD_VALUE = Union[str, bytes, BinaryIO]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring for all of this shows using a file here so this was missing but it had the knock on effect of rewriting the header params which were overloading the definition of a "FIELD_VALUE"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I'm not sure if files were supported. Iirc it was open (x).read()? Need to check on PC

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, so I guess requests always papered over this: https://github.com/psf/requests/blob/d09659997cd1e3eca49a07c59ece5557071c0ab9/requests/models.py#L160

I don't know how we consolidate these APIs though in such a way that folks don't need to read the entire file into memory.

@sigmavirus24 sigmavirus24 force-pushed the add-multipart-formdata-encoder branch 2 times, most recently from eaedfa1 to 8c9ec05 Compare December 30, 2021 20:41
@sigmavirus24 sigmavirus24 force-pushed the add-multipart-formdata-encoder branch 2 times, most recently from 9b7e6e2 to 47cf7d9 Compare December 31, 2021 14:00
Copy link
Member

@sethmlarson sethmlarson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this! Here's some initial comments from mobile:



@typing.overload
def _split_on_find(content: str, bound: str) -> typing.Tuple[str, str]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use .partition() instead of this function?

src/urllib3/multipart/decoder.py Show resolved Hide resolved
src/urllib3/multipart/decoder.py Show resolved Hide resolved
self.encoding = encoding
headers: typing.Dict[str, str] = {}
# Split into header section (if any) and the content
if b"\r\n\r\n" in content:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use partition and check the middle element of the returned 3-tuple to avoid doing two scans for this bytestring?

return part

def _parse_body(self, content: bytes) -> None:
boundary = b"".join((b"--", self.boundary))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Used a bytes literal + boundary

and part != b"--"
)

parts = content.split(b"".join((b"\r\n", boundary)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bytes literal + boundary

@@ -12,7 +13,7 @@
cast,
)

_TYPE_FIELD_VALUE = Union[str, bytes]
_TYPE_FIELD_VALUE = Union[str, bytes, BinaryIO]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I'm not sure if files were supported. Iirc it was open (x).read()? Need to check on PC

@@ -0,0 +1,19 @@
"""Multipart support for urllib3."""

from .decoder import (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we hide the encoder/decoder submodules and reassign the names in this module? (Ie from .encoder import x as x)

from .. import _collections
from .encoder import encode_with

if typing.TYPE_CHECKING:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I to make the typing compatible with other files, would suggest to don't use typing.x and importing them as from typing import x,y,...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find that importing things from the typing module pollutes namespaces. I can consolidate to that here but I really hate the practice of importing from typing rather than importing typing or aliasing the import typing to something shorter

Copy link
Member

@hramezani hramezani Dec 31, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see both ways of (import typing or from typing import x) in different projects.
I am not against any of them. I would suggest choosing one of them and applying it to all files.

BTW, I am ok with import typing as well. I can update the rest of the files of the project in a PR.

@sethmlarson what do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's switch to the namedspaced usage with typing.X.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use libcst to do it automagically for us if we want.

Copy link
Member

@sethmlarson sethmlarson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More comments for you, this time from the comfort of a PC.

A general comment, if it's possible I'd like to move away from the .len property. You likely understand the historical significance more than I but if there's a way we can accomplish this with only using __len__ and .tell()+.seek() that'd be best imo

src/urllib3/multipart/encoder.py Outdated Show resolved Hide resolved
src/urllib3/multipart/encoder.py Outdated Show resolved Hide resolved
src/urllib3/multipart/encoder.py Outdated Show resolved Hide resolved
src/urllib3/multipart/encoder.py Outdated Show resolved Hide resolved
self.boundary_value: str = boundary or uuid.uuid4().hex

# Computed boundary
self.boundary: str = f"--{self.boundary_value}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have boundary be what was passed in by the user or generated instead of boundary_value? I'm not sure why users would need --{boundary} as a public attribute.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally speaking, this isn't there for the users, but we can make these private/read-only as necessary. It's just more to write everything we use frequently as _{name} because we don't trust users to be intelligent and not muck with things.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, users will be users. 🤷 At least if these attributes are private we can say "told you so" if it ends up being a problem later.

)

#: Fields provided by the user
self.fields = fields
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we change the fields, encoding, and finished properties to be read-only properties?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's so much extra boilerplate to do what? What motivates you wanting significantly more code to make things read-only here? Is the idea if this is read-only users won't search for the private attributes storing the data and then will report bugs that updating the private attributes didn't do what they expected versus what happened changing the undocumented attributes that are 'public' and writable? How adversarial a relationship do you want to have with our users?

"Content-Length": self.content_length,
}

def getall(self) -> bytes:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have .read() do we need this function? We can document that .read() can be used to read all the data it can.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I received complaints on toolbelt about not having something that didn't imply a need for a read size 🤷 I don't care either way

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's stick with .read() for now, thanks!

src/urllib3/multipart/encoder.py Outdated Show resolved Hide resolved
src/urllib3/multipart/encoder.py Show resolved Hide resolved
src/urllib3/multipart/encoder.py Outdated Show resolved Hide resolved
@sigmavirus24
Copy link
Contributor Author

A general comment, if it's possible I'd like to move away from the .len property. You likely understand the historical significance more than I but if there's a way we can accomplish this with only using __len__ and .tell()+.seek() that'd be best imo

So today, urllib3 is more-or-less architecture agnostic. We know from the harassment of the cryptography project for adopting Rust that there are people still operating older architectures and compiling things from scratch for themselves to make that work. On the face of things, I'd guess 90+% of our users are not on 32-bit architectures but we had a fair number of users of requests + urllib3 that were on 32-bit architectures that couldn't use the MultipartEncoder because __len__ had restrictions (even on Py3) due to the underlying system. I think we can definitely migrate to __len__ but I thoroughly suspect that we should head this off by explicitly documenting the difference. The other thing is that MultipartEncoder probably doesn't need anything other than being able to calculate it's own length for the headers - unless we want users to be able to pass this into Requests as well in which case we need to provide something for them.

Also, I haven't done a deep reconsideration of the design here just the minimal effort to migrate it here. This was designed to work best within Requests and its idiosyncrasies.

@sigmavirus24 sigmavirus24 force-pushed the add-multipart-formdata-encoder branch 2 times, most recently from e352da1 to 112bcb9 Compare January 2, 2022 01:43
I initially implemented this for Requests in the requests-toolbelt but
this was already pretty generic because Requests just passes file-like
objects (which is how the streaming encoder behaves) directly to
urllib3. All that needed to change was what we were relying on from the
requests namespace and imports and such.

This also adds the decoder in ther same breath because it's easier to
ensure that's all working together properly in one and it all fits
together nicely.

One thing we _could_ do is consolidate a bunch of the logic too and make
`encode_multipart_formdata` rely on the streaming encoder and call
`getall` instead so that we don't have 2 implementations of the same
logic.
Type annotations made it clear that a dangling comma was added by
mistake, converting a value into a tuple.
@pquentin pquentin dismissed a stale review via b2f8061 June 4, 2022 19:40
@pquentin pquentin removed the request for review from shazow June 4, 2022 19:41
@pquentin
Copy link
Member

Regarding CI, there are two issues left:

  • 35 lines are not covered, which is against our policy of 100% coverage. We can always exclude those lines from coverage, but it would be nice to add tests.
  • The docs is complaining due to our new nitpicky settings. @hramezani Given your experience here, would you mind taking a look at the errors and tell what you think? For reference I copied them below.
/home/docs/checkouts/readthedocs.org/user_builds/urllib3/envs/2331/lib/python3.7/site-packages/urllib3/multipart/decoder.py:docstring of urllib3.multipart.decoder.MultipartDecoder.from_response:: WARNING: py:class reference target not found: _response.HTTPResponse
/home/docs/checkouts/readthedocs.org/user_builds/urllib3/envs/2331/lib/python3.7/site-packages/urllib3/multipart/decoder.py:docstring of urllib3.multipart.decoder.MultipartDecoder.from_response:: WARNING: py:class reference target not found: urllib3.multipart.decoder.MD
/home/docs/checkouts/readthedocs.org/user_builds/urllib3/envs/2331/lib/python3.7/site-packages/urllib3/multipart/decoder.py:docstring of urllib3.multipart.MultipartDecoder.parts:: WARNING: py:class reference target not found: urllib3.multipart.decoder.BodyPart

@pquentin
Copy link
Member

The docs are fine now, coverage is left. (And also make sure all concerns are addressed)

@spacether
Copy link

spacether commented Feb 28, 2023

What's the status of this? Can it be merged? Openapi generator users want this and we use this library.

@nioncode
Copy link

nioncode commented Oct 4, 2023

@sigmavirus24 are you still working on this? I'm thinking of taking this over to get this over the line.

@pquentin
Copy link
Member

pquentin commented Oct 7, 2023

@nioncode Please do. The main task is figuring out what is left to do.

@ecerulm
Copy link
Contributor

ecerulm commented Jan 30, 2024

@sigmavirus24 this has been open for 3 years almost, shall we close this or you have any plan to revive this PR?

@sigmavirus24
Copy link
Contributor Author

sigmavirus24 commented Jan 30, 2024

The project wants to land it. I don't have the time for it and others have done work intermittently to get it over the line.

As it stands urllib3 can consume gigs of memory without proper warning to users

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Twice memory usage in encode_multipart_formdata Support for posting big files/streaming objects
7 participants