Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider HTML 5 draft for multipart/form-data #303

Closed
gagern opened this issue Jan 4, 2014 · 11 comments · Fixed by #1492
Closed

Consider HTML 5 draft for multipart/form-data #303

gagern opened this issue Jan 4, 2014 · 11 comments · Fixed by #1492

Comments

@gagern
Copy link
Contributor

gagern commented Jan 4, 2014

In several past posts, I've tried to make urllib3 more standards-compliant, the way I saw things. In particular, #120 aimed at adding a proper Content-Type header for every form field, and #119 / #223 introduced RFC 2231 format for file name encoding.

One consequence of that last change was my noticing that tornado doesn't decode that format, causing some test cases to fail. e11e036 monkey-patched tornado to deal with that, and I filed tornadoweb/tornado#868 to take this upstream. There bdarnell pointed out the following:

The current HTML 5 draft has a section on multipart/form-data which explicitely forbids the use of Content-Type for non-file fields, and also the use of RFC 2231 format for file names. It also provides its own mechanism for encoding field names.

Although I'm reluctant to abandon conformance with Standards Track RFCs for the sake of some lines in a draft, it appears that the HTML 5 draft might better reflect what current web servers actually implement, or are going to implement in the intermediate future. Perople have already reported problems with our format in the wild. For this reason, I wonder whether the current choice of always using RFC 2231 might have to be revised.

I can imagine two approaches, one would be dropping RFC 2231 support completely, the other would be introducing some switch to distinguish cases. I guess (although I don't like it) that the switch should default to the HTML 5 way. I guess I'd implement the switch as a global variable, since every field needs acces to it and concurrent use of multiple standards should be very unlikely. Do you have a better suggestion?

So what would be needed for HTML 5 conforming handling of encodings? We'd need to properly distinguish between files and non-files. The RequestField class could probably check whether both file name and content type are None to recognize non-file fields. That choice could be used to control the addition of the Content-Type header.

For file names, we'd have to encode them to UTF-8 (or some user-configurable request encoding?), after making sure the result does not contain any invalid data: newlines must be stripped (or converted to spaces), quotation marks and backslashes quoted by \ or substituted. I'm not sure which of these, and how much control we want to give our users over that process.

If we allow for encodings other than UTF-8, which are not able to represent all Unicode codepoints, then we might need some more work. File name characters which are not representable must then be substituted or approximated somehow. For filed names we'd have to make sure we replace non-representable characters as XML decimal character entities instead. Other non-ASCII header fields should usually not occur, but if they do, I guess they should be treated the same way file names are.

I notice that I made a mistake in reading RFC 2388: its section 5.4 states that field names are subject to a different encoding than file names. Grrr! Whoever though of that?!? In any case, to be conforming to that as well, we might even want to revise the RFC2231-using mode if we decide to keep it at all.

I might find the time to write some of this myself, but before I start, I'd like to discuss the open questions above as to how this should get implemented.

@shazow
Copy link
Member

shazow commented Jan 5, 2014

Oh boy. First, my sincerest thanks for your thorough research and presenting of the situation. :) As I work through my backlog of emails, please excuse and remind me if I miss an important detail. The questions I noticed:

  • Should we use a global variable to switch between "RFC 2231" and "HTML 5" modes?

    -1 on this. Whenever possible, urllib3 tries to be configurable per-request so that you could execute multiple different libraries using different features/states of urllib3 without a conflicting global state.

  • Should we abandon RFC 2231 altogether in favour of HTML 5's spec?

    -1 on this too. I'd like to at least continue to support as much customizability of the request as possible, even if the functionality is a superset of an HTTP spec. That said, it may be wise to assume a conservative default (seems to be HTML 5 in this case).

  • What about distinguishing between files and non-files in the RequestField class?

    +1, that sounds perfect.

  • <Stuff about encoding that I don't quite understand>

    Honestly I'm not sure what's the Right Thing to do here. UTF8 does sound like the most sensible Python thing to do, but about other normalization... I'm inclined to start with a helper that lives in urllib3.util and see where we want to go from there. I'm reluctant to support anything other than UTF8 by default, but we should not get in the way if people want to encode things themselves.

Did I miss anything?

I would be more than happy to continue having you lead this if you're up for it. :)

gagern added a commit to gagern/urllib3 that referenced this issue Jan 5, 2014
This passes the request object along for field rendering.
Two attributes of the request object control that rendering:

* `field_encoding_style` which may be `HTML5` or `RFC2231` controls
  the encoding of non-ascii field names and file names.
* `form_data_encoding` can be set to encode the form with something other than UTF-8.

See the current (as of this commit) version of the HTML 5 standard
for details on their idea of [how to format multipart/form-data][1].
This addresses urllib3#303.

[1]: http://www.w3.org/TR/2013/WD-html51-20130528/forms.html#multipart-form-data
@gagern
Copy link
Contributor Author

gagern commented Jan 5, 2014

Have a look at my current work in progress, and let me know whether it's going in the right direction. I've managed to pass the request object along, so that things can be configured there. The new attributes which will control behavior still need to be documented somewhere, please let me know where you'd like to see them. Should I file a pull request to discuss the code, or wait till I consider it ready for merging (i.e. docs added)?

With respect to that encoding issue: I've also introduced a request parameter to control encoding, set to UTF-8 by default. The main reason why someone might want to change that is because some HTML 5 web form without accept-charset attribute was part of a HTML file with charset other than UTF-8. In that case, HTML 5 requires using that charset, so urllib3 should provide some means to do so. Manually encoding field bodies would be part of the solution, but would not work for headers like file names which have to be Unicode for our mechanisms to work. Now everything is encoded using a configurable encoding, with xml character references as substitution, just like HTML 5 specifies.

@shazow
Copy link
Member

shazow commented Jan 5, 2014

@gagern Yes, please open a PR so we can continue the discussion there and I can leave comments on the diff. :)

@mocmex-pollen
Copy link

The above summary and links were very helpful in understading the situation, thanks! Unfortunately, there still isn't a single, clear, widely adopted way to encode non-ascii file names, and web servers don't agree on what format they accept. It's a fairly simple monkeypatch to workaround if you're interacting with a single web server, but there is still a need to be able to change encoding formats per-request. @shazow @gagern Is any one working on this (or even wants to)?

Also, here is an unbroken link to the relevant section in the current HTML5 draft: multipart-form-data

@shazow
Copy link
Member

shazow commented Apr 28, 2016

I'm not aware of anyone working on it, so you're welcome to claim it. :)

@mocmex-pollen
Copy link

I just submitted a pull request, #856, which is a proposed solution (with some code from #304) for just one aspect of this issue: encoding non-ASCII file names the HTML5 way by default. I organized this so that the concept of a file name encoding style was kept within the RequestField class, which means that it can only be chosen at the time of construction of a RequestField (i.e. __init__ or fromtuples).

These are the things which aren't included:

  • Encoding a non-ASCII field name parameter

    This StackOverflow answer has good collected information on this, but there's no clear answer in theory or practice, so I don't think I'm going to work on this. But if it was decided RFC 2047 is a good choice, email.header.Header.encode will encode it.

  • Allowing for form-data encodings other than utf-8

    I'm not going to submit code for this either (in this PR), but I thought the latest version of Follow HTML 5 draft and avoid RFC 2231 by default. #304 had a good solution: controlling this by a keyword parameter of filepost.encode_multipart_formdata and moving the codecs lookup to within the function.

  • Only setting the Content-Type header for non-file fields

    I think this is different enough from the other points to be fixed in a separate PR.This will make it easier to code and review too.

@zigarrre
Copy link

zigarrre commented Feb 26, 2018

Will this ever be fixed?

Over 4 years have passed and HTML 5 isn't a draft anymore but an official standard.

All major browsers now do it this way and most servers cannot handle the filename* field with RFC2231 encoding.

But urllib3 is still using the RFC2231 encoding which not only doesn't work in the real world but is also clearly a violation of current standards.

@haikuginger
Copy link
Contributor

@sigmavirus24, would like your thoughts on this.

@sigmavirus24
Copy link
Contributor

Some thoughts on this:

  1. RFC2231 does work in the real world, but in spite of being a standard, it's not implemented in every http server
  2. We have built into this library a whole separate library for dealing with multipart/form-data, instead any future improvements should be developed as an external library (in my opinion) and (maybe) vendored in.
  3. Unfortunately, a quick search of pypi indicates no one has implemented this yet
  4. Provided the fact that for probably 90% of our users the status quo just works, I'm not sure how urgent this issue is, regardless of its age

@zigarrre
Copy link

zigarrre commented Mar 24, 2018

  1. RFC2231 does work in the real world, but in spite of being a standard, it's not implemented in every http server

RFC2231 is the completely wrong standard and does not apply here. The HTML5 spec has a description of how to properly encode multipart/form-data: https://www.w3.org/TR/html52/sec-forms.html#multipart-form-data

RFC7578 (which is referenced by the HTML5 spec) also very clearly states that the current way urllib3 does it (using the filename* field) is wrong:

NOTE: The encoding method described in [RFC5987], which would add a
"filename*" parameter to the Content-Disposition header field, MUST
NOT be used.

Many - if not most - server implementations in the real world will NOT work with the current way urllib3 does it (e.g. any PHP applications).

  1. Provided the fact that for probably 90% of our users the status quo just works, I'm not sure how urgent this issue is, regardless of its age

As this is only a problem if one tries to send non-ASCII characters in a non-file field of a multipart encoded form many of your users wont notice this problem in the first place and the rest has resigned to using a workaround or a different library. But still you will find a huge number of people discussing this problem of urllib3 with just a quick search. So this is not a valid argument.

As I understand the standards this should be a rather easy fix as well. Just using filename instead of filename* and UTF-8 instead of RFC2231 should be enough in most cases.

@sigmavirus24
Copy link
Contributor

Just using filename instead of filename* and UTF-8 instead of RFC2231 should be enough in most cases.

This is not actually our experience, otherwise we wouldn't have gone through the trouble to implement RFC2231.

@zigarrre you seem passionate about this and seem to have time on your hands. We'd be open to using a library that you would produce that would do the correct thing per the HTML5 spec.

medmunds added a commit to anymail/django-anymail that referenced this issue Oct 10, 2018
Workaround psf/requests#4652 (urllib3/urllib3#303), where
uploaded files in multipart/form-data are improperly given RFC 2231
encoded filenames. That format is not accepted by Mailgun's API (and is
prohibited by RFC 7578), resulting in the attachments being silently
dropped.

Fix is to patch up the multipart/form-data before posting to remove
the RFC 2231 encoding.

Fixes #125
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants