-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider HTML 5 draft for multipart/form-data #303
Comments
Oh boy. First, my sincerest thanks for your thorough research and presenting of the situation. :) As I work through my backlog of emails, please excuse and remind me if I miss an important detail. The questions I noticed:
Did I miss anything? I would be more than happy to continue having you lead this if you're up for it. :) |
This passes the request object along for field rendering. Two attributes of the request object control that rendering: * `field_encoding_style` which may be `HTML5` or `RFC2231` controls the encoding of non-ascii field names and file names. * `form_data_encoding` can be set to encode the form with something other than UTF-8. See the current (as of this commit) version of the HTML 5 standard for details on their idea of [how to format multipart/form-data][1]. This addresses urllib3#303. [1]: http://www.w3.org/TR/2013/WD-html51-20130528/forms.html#multipart-form-data
Have a look at my current work in progress, and let me know whether it's going in the right direction. I've managed to pass the request object along, so that things can be configured there. The new attributes which will control behavior still need to be documented somewhere, please let me know where you'd like to see them. Should I file a pull request to discuss the code, or wait till I consider it ready for merging (i.e. docs added)? With respect to that encoding issue: I've also introduced a request parameter to control encoding, set to UTF-8 by default. The main reason why someone might want to change that is because some HTML 5 web form without accept-charset attribute was part of a HTML file with charset other than UTF-8. In that case, HTML 5 requires using that charset, so urllib3 should provide some means to do so. Manually encoding field bodies would be part of the solution, but would not work for headers like file names which have to be Unicode for our mechanisms to work. Now everything is encoded using a configurable encoding, with xml character references as substitution, just like HTML 5 specifies. |
@gagern Yes, please open a PR so we can continue the discussion there and I can leave comments on the diff. :) |
The above summary and links were very helpful in understading the situation, thanks! Unfortunately, there still isn't a single, clear, widely adopted way to encode non-ascii file names, and web servers don't agree on what format they accept. It's a fairly simple monkeypatch to workaround if you're interacting with a single web server, but there is still a need to be able to change encoding formats per-request. @shazow @gagern Is any one working on this (or even wants to)? Also, here is an unbroken link to the relevant section in the current HTML5 draft: multipart-form-data |
I'm not aware of anyone working on it, so you're welcome to claim it. :) |
I just submitted a pull request, #856, which is a proposed solution (with some code from #304) for just one aspect of this issue: encoding non-ASCII file names the HTML5 way by default. I organized this so that the concept of a file name encoding style was kept within the These are the things which aren't included:
|
Will this ever be fixed? Over 4 years have passed and HTML 5 isn't a draft anymore but an official standard. All major browsers now do it this way and most servers cannot handle the But urllib3 is still using the RFC2231 encoding which not only doesn't work in the real world but is also clearly a violation of current standards. |
@sigmavirus24, would like your thoughts on this. |
Some thoughts on this:
|
RFC2231 is the completely wrong standard and does not apply here. The HTML5 spec has a description of how to properly encode multipart/form-data: https://www.w3.org/TR/html52/sec-forms.html#multipart-form-data RFC7578 (which is referenced by the HTML5 spec) also very clearly states that the current way urllib3 does it (using the
Many - if not most - server implementations in the real world will NOT work with the current way urllib3 does it (e.g. any PHP applications).
As this is only a problem if one tries to send non-ASCII characters in a non-file field of a multipart encoded form many of your users wont notice this problem in the first place and the rest has resigned to using a workaround or a different library. But still you will find a huge number of people discussing this problem of urllib3 with just a quick search. So this is not a valid argument. As I understand the standards this should be a rather easy fix as well. Just using |
This is not actually our experience, otherwise we wouldn't have gone through the trouble to implement RFC2231. @zigarrre you seem passionate about this and seem to have time on your hands. We'd be open to using a library that you would produce that would do the correct thing per the HTML5 spec. |
Workaround psf/requests#4652 (urllib3/urllib3#303), where uploaded files in multipart/form-data are improperly given RFC 2231 encoded filenames. That format is not accepted by Mailgun's API (and is prohibited by RFC 7578), resulting in the attachments being silently dropped. Fix is to patch up the multipart/form-data before posting to remove the RFC 2231 encoding. Fixes #125
In several past posts, I've tried to make urllib3 more standards-compliant, the way I saw things. In particular, #120 aimed at adding a proper
Content-Type
header for every form field, and #119 / #223 introduced RFC 2231 format for file name encoding.One consequence of that last change was my noticing that tornado doesn't decode that format, causing some test cases to fail. e11e036 monkey-patched tornado to deal with that, and I filed tornadoweb/tornado#868 to take this upstream. There bdarnell pointed out the following:
The current HTML 5 draft has a section on multipart/form-data which explicitely forbids the use of
Content-Type
for non-file fields, and also the use of RFC 2231 format for file names. It also provides its own mechanism for encoding field names.Although I'm reluctant to abandon conformance with Standards Track RFCs for the sake of some lines in a draft, it appears that the HTML 5 draft might better reflect what current web servers actually implement, or are going to implement in the intermediate future. Perople have already reported problems with our format in the wild. For this reason, I wonder whether the current choice of always using RFC 2231 might have to be revised.
I can imagine two approaches, one would be dropping RFC 2231 support completely, the other would be introducing some switch to distinguish cases. I guess (although I don't like it) that the switch should default to the HTML 5 way. I guess I'd implement the switch as a global variable, since every field needs acces to it and concurrent use of multiple standards should be very unlikely. Do you have a better suggestion?
So what would be needed for HTML 5 conforming handling of encodings? We'd need to properly distinguish between files and non-files. The
RequestField
class could probably check whether both file name and content type areNone
to recognize non-file fields. That choice could be used to control the addition of theContent-Type
header.For file names, we'd have to encode them to UTF-8 (or some user-configurable request encoding?), after making sure the result does not contain any invalid data: newlines must be stripped (or converted to spaces), quotation marks and backslashes quoted by
\
or substituted. I'm not sure which of these, and how much control we want to give our users over that process.If we allow for encodings other than UTF-8, which are not able to represent all Unicode codepoints, then we might need some more work. File name characters which are not representable must then be substituted or approximated somehow. For filed names we'd have to make sure we replace non-representable characters as XML decimal character entities instead. Other non-ASCII header fields should usually not occur, but if they do, I guess they should be treated the same way file names are.
I notice that I made a mistake in reading RFC 2388: its section 5.4 states that field names are subject to a different encoding than file names. Grrr! Whoever though of that?!? In any case, to be conforming to that as well, we might even want to revise the RFC2231-using mode if we decide to keep it at all.
I might find the time to write some of this myself, but before I start, I'd like to discuss the open questions above as to how this should get implemented.
The text was updated successfully, but these errors were encountered: