Consider HTML 5 draft for multipart/form-data #303

gagern · 2014-01-04T23:59:35Z

In several past posts, I've tried to make urllib3 more standards-compliant, the way I saw things. In particular, #120 aimed at adding a proper Content-Type header for every form field, and #119 / #223 introduced RFC 2231 format for file name encoding.

One consequence of that last change was my noticing that tornado doesn't decode that format, causing some test cases to fail. e11e036 monkey-patched tornado to deal with that, and I filed tornadoweb/tornado#868 to take this upstream. There bdarnell pointed out the following:

The current HTML 5 draft has a section on multipart/form-data which explicitely forbids the use of Content-Type for non-file fields, and also the use of RFC 2231 format for file names. It also provides its own mechanism for encoding field names.

Although I'm reluctant to abandon conformance with Standards Track RFCs for the sake of some lines in a draft, it appears that the HTML 5 draft might better reflect what current web servers actually implement, or are going to implement in the intermediate future. Perople have already reported problems with our format in the wild. For this reason, I wonder whether the current choice of always using RFC 2231 might have to be revised.

I can imagine two approaches, one would be dropping RFC 2231 support completely, the other would be introducing some switch to distinguish cases. I guess (although I don't like it) that the switch should default to the HTML 5 way. I guess I'd implement the switch as a global variable, since every field needs acces to it and concurrent use of multiple standards should be very unlikely. Do you have a better suggestion?

So what would be needed for HTML 5 conforming handling of encodings? We'd need to properly distinguish between files and non-files. The RequestField class could probably check whether both file name and content type are None to recognize non-file fields. That choice could be used to control the addition of the Content-Type header.

For file names, we'd have to encode them to UTF-8 (or some user-configurable request encoding?), after making sure the result does not contain any invalid data: newlines must be stripped (or converted to spaces), quotation marks and backslashes quoted by \ or substituted. I'm not sure which of these, and how much control we want to give our users over that process.

If we allow for encodings other than UTF-8, which are not able to represent all Unicode codepoints, then we might need some more work. File name characters which are not representable must then be substituted or approximated somehow. For filed names we'd have to make sure we replace non-representable characters as XML decimal character entities instead. Other non-ASCII header fields should usually not occur, but if they do, I guess they should be treated the same way file names are.

I notice that I made a mistake in reading RFC 2388: its section 5.4 states that field names are subject to a different encoding than file names. Grrr! Whoever though of that?!? In any case, to be conforming to that as well, we might even want to revise the RFC2231-using mode if we decide to keep it at all.

I might find the time to write some of this myself, but before I start, I'd like to discuss the open questions above as to how this should get implemented.

The text was updated successfully, but these errors were encountered:

shazow · 2014-01-05T02:54:21Z

Oh boy. First, my sincerest thanks for your thorough research and presenting of the situation. :) As I work through my backlog of emails, please excuse and remind me if I miss an important detail. The questions I noticed:

Should we use a global variable to switch between "RFC 2231" and "HTML 5" modes?

-1 on this. Whenever possible, urllib3 tries to be configurable per-request so that you could execute multiple different libraries using different features/states of urllib3 without a conflicting global state.
Should we abandon RFC 2231 altogether in favour of HTML 5's spec?

-1 on this too. I'd like to at least continue to support as much customizability of the request as possible, even if the functionality is a superset of an HTTP spec. That said, it may be wise to assume a conservative default (seems to be HTML 5 in this case).
What about distinguishing between files and non-files in the RequestField class?

+1, that sounds perfect.
<Stuff about encoding that I don't quite understand>

Honestly I'm not sure what's the Right Thing to do here. UTF8 does sound like the most sensible Python thing to do, but about other normalization... I'm inclined to start with a helper that lives in urllib3.util and see where we want to go from there. I'm reluctant to support anything other than UTF8 by default, but we should not get in the way if people want to encode things themselves.

Did I miss anything?

I would be more than happy to continue having you lead this if you're up for it. :)

This passes the request object along for field rendering. Two attributes of the request object control that rendering: * `field_encoding_style` which may be `HTML5` or `RFC2231` controls the encoding of non-ascii field names and file names. * `form_data_encoding` can be set to encode the form with something other than UTF-8. See the current (as of this commit) version of the HTML 5 standard for details on their idea of [how to format multipart/form-data][1]. This addresses urllib3#303. [1]: http://www.w3.org/TR/2013/WD-html51-20130528/forms.html#multipart-form-data

gagern · 2014-01-05T11:28:28Z

Have a look at my current work in progress, and let me know whether it's going in the right direction. I've managed to pass the request object along, so that things can be configured there. The new attributes which will control behavior still need to be documented somewhere, please let me know where you'd like to see them. Should I file a pull request to discuss the code, or wait till I consider it ready for merging (i.e. docs added)?

With respect to that encoding issue: I've also introduced a request parameter to control encoding, set to UTF-8 by default. The main reason why someone might want to change that is because some HTML 5 web form without accept-charset attribute was part of a HTML file with charset other than UTF-8. In that case, HTML 5 requires using that charset, so urllib3 should provide some means to do so. Manually encoding field bodies would be part of the solution, but would not work for headers like file names which have to be Unicode for our mechanisms to work. Now everything is encoded using a configurable encoding, with xml character references as substitution, just like HTML 5 specifies.

shazow · 2014-01-05T20:48:54Z

@gagern Yes, please open a PR so we can continue the discussion there and I can leave comments on the diff. :)

mocmex-pollen · 2016-04-28T17:01:59Z

The above summary and links were very helpful in understading the situation, thanks! Unfortunately, there still isn't a single, clear, widely adopted way to encode non-ascii file names, and web servers don't agree on what format they accept. It's a fairly simple monkeypatch to workaround if you're interacting with a single web server, but there is still a need to be able to change encoding formats per-request. @shazow @gagern Is any one working on this (or even wants to)?

Also, here is an unbroken link to the relevant section in the current HTML5 draft: multipart-form-data

shazow · 2016-04-28T20:16:25Z

I'm not aware of anyone working on it, so you're welcome to claim it. :)

mocmex-pollen · 2016-05-04T20:47:52Z

I just submitted a pull request, #856, which is a proposed solution (with some code from #304) for just one aspect of this issue: encoding non-ASCII file names the HTML5 way by default. I organized this so that the concept of a file name encoding style was kept within the RequestField class, which means that it can only be chosen at the time of construction of a RequestField (i.e. __init__ or fromtuples).

These are the things which aren't included:

Encoding a non-ASCII field name parameter

This StackOverflow answer has good collected information on this, but there's no clear answer in theory or practice, so I don't think I'm going to work on this. But if it was decided RFC 2047 is a good choice, email.header.Header.encode will encode it.
Allowing for form-data encodings other than utf-8

I'm not going to submit code for this either (in this PR), but I thought the latest version of Follow HTML 5 draft and avoid RFC 2231 by default. #304 had a good solution: controlling this by a keyword parameter of filepost.encode_multipart_formdata and moving the codecs lookup to within the function.
Only setting the Content-Type header for non-file fields

I think this is different enough from the other points to be fixed in a separate PR.This will make it easier to code and review too.

zigarrre · 2018-02-26T20:50:21Z

Will this ever be fixed?

Over 4 years have passed and HTML 5 isn't a draft anymore but an official standard.

All major browsers now do it this way and most servers cannot handle the filename* field with RFC2231 encoding.

But urllib3 is still using the RFC2231 encoding which not only doesn't work in the real world but is also clearly a violation of current standards.

haikuginger · 2018-02-27T13:35:25Z

@sigmavirus24, would like your thoughts on this.

sigmavirus24 · 2018-03-24T14:54:36Z

Some thoughts on this:

RFC2231 does work in the real world, but in spite of being a standard, it's not implemented in every http server
We have built into this library a whole separate library for dealing with multipart/form-data, instead any future improvements should be developed as an external library (in my opinion) and (maybe) vendored in.
Unfortunately, a quick search of pypi indicates no one has implemented this yet
Provided the fact that for probably 90% of our users the status quo just works, I'm not sure how urgent this issue is, regardless of its age

zigarrre · 2018-03-24T15:37:40Z

RFC2231 does work in the real world, but in spite of being a standard, it's not implemented in every http server

RFC2231 is the completely wrong standard and does not apply here. The HTML5 spec has a description of how to properly encode multipart/form-data: https://www.w3.org/TR/html52/sec-forms.html#multipart-form-data

RFC7578 (which is referenced by the HTML5 spec) also very clearly states that the current way urllib3 does it (using the filename* field) is wrong:

NOTE: The encoding method described in [RFC5987], which would add a
"filename*" parameter to the Content-Disposition header field, MUST
NOT be used.

Many - if not most - server implementations in the real world will NOT work with the current way urllib3 does it (e.g. any PHP applications).

Provided the fact that for probably 90% of our users the status quo just works, I'm not sure how urgent this issue is, regardless of its age

As this is only a problem if one tries to send non-ASCII characters in a non-file field of a multipart encoded form many of your users wont notice this problem in the first place and the rest has resigned to using a workaround or a different library. But still you will find a huge number of people discussing this problem of urllib3 with just a quick search. So this is not a valid argument.

As I understand the standards this should be a rather easy fix as well. Just using filename instead of filename* and UTF-8 instead of RFC2231 should be enough in most cases.

sigmavirus24 · 2018-03-24T19:11:09Z

Just using filename instead of filename* and UTF-8 instead of RFC2231 should be enough in most cases.

This is not actually our experience, otherwise we wouldn't have gone through the trouble to implement RFC2231.

@zigarrre you seem passionate about this and seem to have time on your hands. We'd be open to using a library that you would produce that would do the correct thing per the HTML5 spec.

Workaround psf/requests#4652 (urllib3/urllib3#303), where uploaded files in multipart/form-data are improperly given RFC 2231 encoded filenames. That format is not accepted by Mailgun's API (and is prohibited by RFC 7578), resulting in the attachments being silently dropped. Fix is to patch up the multipart/form-data before posting to remove the RFC 2231 encoding. Fixes #125

gagern mentioned this issue Jan 6, 2014

Follow HTML 5 draft and avoid RFC 2231 by default. #304

Closed

mocmex-pollen mentioned this issue May 4, 2016

HTML5 non-ASCII File Names by Default #856

Closed

bdarnell mentioned this issue May 21, 2018

Non-ASCII filename uploads don't comply with RFC 7578 psf/requests#4652

Closed

jsmnbom mentioned this issue Sep 2, 2018

Could not send file with Chinese character python-telegram-bot/python-telegram-bot#1202

Closed

medmunds mentioned this issue Oct 6, 2018

Mailgun backend loses attachments with non-ASCII filenames anymail/django-anymail#125

Closed

Robbt mentioned this issue Jan 22, 2019

Change fields.py to use HTML5 non-ASCII File Names by Default instead of RFC2231 for encoding #1492

Merged

sethmlarson closed this as completed in #1492 Mar 23, 2019

Datamance mentioned this issue Apr 19, 2019

build function gives broken pipe error docker/docker-py#1824

Closed

medmunds mentioned this issue Aug 21, 2019

Mailgun: disable workaround for Unicode attachment filenames if fixed urllib3 in use anymail/django-anymail#157

Closed

eikek mentioned this issue Nov 7, 2021

multiipart file upload doesn't support unicode file names http4s/http4s#5053

Closed

mkurz mentioned this issue Jan 12, 2023

Escape Content-Disposition params according to WHATWG HTML living standard playframework/playframework#11571

Merged

PixlRainbow mentioned this issue Jun 2, 2023

post using FormData yhirose/cpp-httplib#1557

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider HTML 5 draft for multipart/form-data #303

Consider HTML 5 draft for multipart/form-data #303

gagern commented Jan 4, 2014 •

edited by shazow

shazow commented Jan 5, 2014

gagern commented Jan 5, 2014

shazow commented Jan 5, 2014

mocmex-pollen commented Apr 28, 2016

shazow commented Apr 28, 2016

mocmex-pollen commented May 4, 2016

zigarrre commented Feb 26, 2018 •

edited

haikuginger commented Feb 27, 2018

sigmavirus24 commented Mar 24, 2018

zigarrre commented Mar 24, 2018 •

edited

sigmavirus24 commented Mar 24, 2018

Consider HTML 5 draft for multipart/form-data #303

Consider HTML 5 draft for multipart/form-data #303

Comments

gagern commented Jan 4, 2014 • edited by shazow

shazow commented Jan 5, 2014

gagern commented Jan 5, 2014

shazow commented Jan 5, 2014

mocmex-pollen commented Apr 28, 2016

shazow commented Apr 28, 2016

mocmex-pollen commented May 4, 2016

zigarrre commented Feb 26, 2018 • edited

haikuginger commented Feb 27, 2018

sigmavirus24 commented Mar 24, 2018

zigarrre commented Mar 24, 2018 • edited

sigmavirus24 commented Mar 24, 2018

gagern commented Jan 4, 2014 •

edited by shazow

zigarrre commented Feb 26, 2018 •

edited

zigarrre commented Mar 24, 2018 •

edited