Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Drop disk buffering and support for rewindable inputs. #1148

Closed
thesmart opened this issue Mar 7, 2017 · 27 comments
Closed

Drop disk buffering and support for rewindable inputs. #1148

thesmart opened this issue Mar 7, 2017 · 27 comments
Milestone

Comments

@thesmart
Copy link

thesmart commented Mar 7, 2017

I'm curious about Rack's requirement for rewindable input, and because of this requirement, the Rack::Multipart class will buffer a read-once stream to disk presumably so rewind is available. Modern cloud-based NAS throughput imposes limitation on how well this feature scales. Also, RAM is wildly cheap these days. Considering how the entire Rack ecosystem could depend on rewindable streams, I was wondering...

Is it a good time for future versions of Rack 2.x to drop disk buffering as a default behavior, and move it to an optional Rack middleware specifically for handling "large" bodies / file uploads?

I'd like to propose an alternative strategy for handling "large" bodies.

(0) Remove any use of Tempfile in Rack::Multipart or other default body parsers.

(1) Provide a configurable post body buffer limit max_body_buffer_length. If content_length.present? && content_length <= max_body_buffer_length then read it ASAP (ideally non-blocking) into rack.input as StringIO or other. Else, if content_length.nil? || content_length > max_body_buffer_length, then do not read the stream. This would mean that no chunked encodings are handled by default, and no buffering to disk for streams larger than the configured limit.

(2) When request.params or request.body is accessed, if body was not parsed due to length or chunked encoding, report a warning message that can be disabled.

(3) Provide automatic "large body" parsing as a middleware module plugin that can be optionally installed.

@tenderlove
Copy link
Member

Is it a good time for future versions of Rack 2.x to drop disk buffering as a default behavior, and move it to an optional Rack middleware specifically for handling "large" bodies / file uploads?

Yes. I plan on revising the spec for Rack 3 (master should be 3). It will remove the requirement for a rewindable body. If you have time to do that work, then please start sending PRs (starting with updates to the SPEC file). Thanks!

@ioquatix
Copy link
Member

ioquatix commented Apr 9, 2018

I recently ran into this limitation when implementing streaming requests/responses with async-http and the rack-compatible server falcon.

With HTTP/2, one can stream a request/response indefinitely, and the requirement for supporting #rewind is a bit of a pain. In addition, read(length) also requires buffering within individual HTTP2 DATA frames which makes things slightly more complicated.

I'd suggest that the rack.input API is good, but 1/ drop #rewind and limit #read to having no arguments, i.e. it just returns the next available chunk of data from the stream. This would suit HTTP/2 perfectly.

@ioquatix
Copy link
Member

ioquatix commented Apr 9, 2018

I would be happy to work on the SPEC file but do you want to have a discussion here first? What, if any, progress/decisions have already been made?

@ioquatix
Copy link
Member

ioquatix commented Apr 9, 2018

Just as an addendum to the first point, chunks of data returned by #read are managed by HTTP2 settings, which limits the amount of data per chunk (AFAIK) to 16MB.

@ioquatix
Copy link
Member

ioquatix commented Apr 9, 2018

I noticed that here

parser.on_read io.read(bufsize), io.eof?
rack.input must also respond to eof? which is not documented AFAIK.

@ioquatix
Copy link
Member

Here is what I implemented to handle reading from a chunked data source (e.g. Transfer-Encoding: chunked, HTTP2 data frames, etc) to what is required by rack:

https://github.com/socketry/falcon/blob/master/lib/falcon/input.rb

It works reasonably well but the requirement for rewinding means that all data must be cached...

@ioquatix
Copy link
Member

After implementing a client and server over HTTP/1.0, HTTP/1.1 and HTTP/2.0, I'd like to suggest that having body.empty? is a particularly useful method and works for like 99% of existing code.

When writing the response, it's useful to terminate the stream after the headers are written in HTTP/2.0 and in order to do this you need to ask body.empty? The alternative is to call #each and buffer the chunks which is inefficient for streaming responses.

@ioquatix
Copy link
Member

I'd propose that the response body supports both #each and #empty? which make for efficient implementation of server response handling especially for HTTP/2.

https://github.com/socketry/async-http/blob/b466350580e1ab0c9eca27a056fb0062e43b6294/lib/async/http/body/readable.rb#L31-L43

Just for completeness, but outside the scope of rack, for writable bodies, the following definition is sufficient:

https://github.com/socketry/async-http/blob/b466350580e1ab0c9eca27a056fb0062e43b6294/lib/async/http/body/writable.rb#L70-L84

@ioquatix
Copy link
Member

ioquatix commented May 28, 2018

I've been studying more about the various options available. I found some interesting information about WebSocket/XMLHTTPRequest. Both are essentially considered deprecated according to w3c/ServiceWorker#947 (comment) and the surrounding discussion.

It looks like the forward looking API is centred around streams (fetch and ReadableStream). It's apparently possible to implement bi-directional communication using this approach, but it's something I need to continue exploring. I just wanted to provide some updates based on my own research.

@ioquatix
Copy link
Member

I wanted to add two more links for reference:

1/ Here is a proposal for WebSockets over HTTP/2 which may or may not see practical usage: https://github.com/mcmanus/draft-h2ws/blob/master/draft-mcmanus-httpbis-h2-websockets-03.mkd

2/ Here is an interesting discussion about the state of bi-directional streaming, specifically with clients implemented within the browser (e.g. using fetch): whatwg/fetch#229

It will be interesting to see how these issues pan out.

@ioquatix
Copy link
Member

ioquatix commented Jun 8, 2018

@tenderlove How can we move forward with this?

@janko
Copy link
Contributor

janko commented Jun 8, 2018

I'd suggest that the rack.input API is good, but 1/ drop #rewind and limit #read to having no arguments, i.e. it just returns the next available chunk of data from the stream. This would suit HTTP/2 perfectly.

I think we all agree on dropping the #rewind requirement, but let's hold off on #read and discuss it separately, as that would be a breaking change. The original issue is related to disk buffering, which is caused by Rack inputs having to be #rewind-able.

Is there any place where Rack needs to read the request body more than once? If it reads it the first time and parses it, I don't see why it would need to read it again. I also think it's fine that it continues to automatically parse multipart form data encoded request bodies (handled by Rack::Multipart), at least for now.

The pain point that I'm personally interested in are web servers needing to implement rewindability. The existing Rack::RewindableInput wrapper is far from ideal, as it eagerly reads the Rack input and writes all content to the tempfile, in which case the Rack application cannot utilize potential request-streaming capability that the web server might have (Unicorn, Passenger, and Falcon have it).

I noticed that here rack.input must also respond to eof? which is not documented AFAIK.

I opened a pull request a while ago to eliminate the #eof? calls, as they're not necessary, and the Rack input is not guaranteed to respond to #eof?. Puma uses a StringIO/Tempfile object for the Rackinput, which happens to respond to #eof?, but Unicorn's and Phusion Passenger's TeeInput doesn't, so multipart uploads won't work on chunked requests (in which Rack::Multipart::Parser::BoundedIO isn't used).

@ioquatix
Copy link
Member

ioquatix commented Jun 8, 2018

Is there any place where Rack needs to read the request body more than once?

  • rack/lib/rack/request.rb

    Lines 332 to 355 in b72bfc9

    def POST
    if get_header(RACK_INPUT).nil?
    raise "Missing rack.input"
    elsif get_header(RACK_REQUEST_FORM_INPUT) == get_header(RACK_INPUT)
    get_header(RACK_REQUEST_FORM_HASH)
    elsif form_data? || parseable_data?
    unless set_header(RACK_REQUEST_FORM_HASH, parse_multipart)
    form_vars = get_header(RACK_INPUT).read
    # Fix for Safari Ajax postings that always append \0
    # form_vars.sub!(/\0\z/, '') # performance replacement:
    form_vars.slice!(-1) if form_vars[-1] == ?\0
    set_header RACK_REQUEST_FORM_VARS, form_vars
    set_header RACK_REQUEST_FORM_HASH, parse_query(form_vars, '&')
    get_header(RACK_INPUT).rewind
    end
    set_header RACK_REQUEST_FORM_INPUT, get_header(RACK_INPUT)
    get_header RACK_REQUEST_FORM_HASH
    else
    {}
    end
    end
  • def extract_multipart(req, params = Rack::Utils.default_query_parser)
    io = req.get_header(RACK_INPUT)
    io.rewind
    content_length = req.content_length
    content_length = content_length.to_i if content_length
    tempfile = req.get_header(RACK_MULTIPART_TEMPFILE_FACTORY) || Parser::TEMPFILE_FACTORY
    bufsize = req.get_header(RACK_MULTIPART_BUFFER_SIZE) || Parser::BUFSIZE
    info = Parser.parse io, content_length, req.get_header('CONTENT_TYPE'), tempfile, bufsize, params
    req.set_header(RACK_TEMPFILES, info.tmp_files)
    info.params
    end
    def build_multipart(params, first = true)
    Generator.new(params, first).dump
    end
    end
    (and a bunch of places in https://github.com/rack/rack/blob/b72bfc9435c118c54019efae1fedd119521b76df/lib/rack/multipart/parser.rb)

@ioquatix
Copy link
Member

ioquatix commented Jun 8, 2018

I think we all agree on dropping the #rewind requirement, but let's hold off on #read and discuss it separately, as that would be a breaking change. The original issue is related to disk buffering, which is caused by Rack inputs having to be #rewind-able.

Agreed.

Is there any place where Rack needs to read the request body more than once? If it reads it the first time and parses it, I don't see why it would need to read it again. I also think it's fine that it continues to automatically parse multipart form data encoded request bodies (handled by Rack::Multipart), at least for now.

Agreed. I also think it would make more sense for the request body to be parsed in a similar fashion to rack middleware. e.g. a stack of parsers which transform the input body into something more useful.

The pain point that I'm personally interested in are web servers needing to implement rewindability. The existing Rack::RewindableInput wrapper is far from ideal, as it eagerly reads the Rack input and writes all content to the tempfile, in which case the Rack application cannot utilize potential request-streaming capability that the web server might have (Unicorn, Passenger, and Falcon have it).

Agreed.

I noticed that here rack.input must also respond to eof? which is not documented AFAIK.

Yes, that's unfortunately underspecified. +1 on your PR to fix this issue.

@ioquatix
Copy link
Member

ioquatix commented Jun 8, 2018

Something like:

def call(env)
    if body = env['rack.body'] and multipart?(env)
        env['rack.body'] = Multipart.new(body)
    end

    return @app.call(env)
end

The same could be done for POST requests, e.g. detect and parse www-form-urlencoded, JSON.

Middleware would obviously have to be aware of the input body format some how. I've deliberately used rack.body as a new thing as changing the semantics of rack.input would break a lot of stuff.

@ioquatix
Copy link
Member

ioquatix commented Jun 8, 2018

I guess I should mention that if we went down the route of adding rack.body this would be the chance to change the semantics of #read (and drop #gets) to be better in alignment with HTTP buffering. Handling a length argument ensures that any conforming implementation of #read must maintain an internal buffer at some point.

@janko
Copy link
Contributor

janko commented Jun 8, 2018

@ioquatix I can see where Rack calls #rewind on the Rack input object, what I was wondering was where does Rack read the Rack input again after it has been rewinded. For all I know, it could call #rewind just so that the user of Rack has the request body rewinded if they want to read it themselves. And this a feature that would officially be removed in Rack v3 if we stop demanding #rewind to be implemented.

I also think it would make more sense for the request body to be parsed in a similar fashion to rack middleware. e.g. a stack of parsers which transform the input body into something more useful.

Yeah, an opt-in middleware sounds nice.

@ioquatix
Copy link
Member

ioquatix commented Jun 8, 2018

read the Rack input again after it has been rewinded.

Ha, still looking for that :D

@ioquatix
Copy link
Member

ioquatix commented Jun 26, 2018

Actually I found one place where we use it (but I gate the functionality around respond_to?(:rewind)), is our error/exception mailer. It rewinds the input body and attaches it to the email.

I'm not sure I want to find out what would happen if someone was uploading a multi-GB file.

@ioquatix
Copy link
Member

I've been doing some more work in this area.

Another problem that becomes apparent is the mismatch of high level response and low level transport.

There are at least two examples of this: Rack::ContentLength and Rack::Chunked. These two middleware are very HTTP/1 specific. HTTP/2 has no need for either of them, and in some cases it appears that Rack::Chunked could actually damage the response body.

The way I've been thinking about it, there are transport level headers such as the above, which really should be the responsibility of the web server.

Rack should be providing the basic infrastructure on which to handle requests and responses, but the actual protocol/transport level headers/behaviour should be specified by the underlying server/protocol. In this case, there is a bit of a mash up between the layers.

@ioquatix
Copy link
Member

ioquatix commented Aug 7, 2018

I worked around this problem.

Essentially, I use a memory buffer to cache the input, and provide support for rewind, but only if it matches a content-type which rack internally calls rewind on.

https://github.com/socketry/falcon/blob/master/lib/falcon/adapters/rewindable.rb

I know this is a hack but it was the simplest way to work around this issue which has been outstanding for almost a year. I still believe that removing the need for rewind is ultimately a good idea. Legacy apps that need it could use a middleware to implement the above logic.

@ioquatix
Copy link
Member

@jeremyevans are we in agreement to remove the requirement for input to be rewindable?

@ioquatix ioquatix changed the title Proposal for dealing with body streams differently in 2017 Drop disk buffering and support for rewindable inputs. Jan 25, 2022
@jeremyevans
Copy link
Contributor

I'm OK with removing the requirement if we ship a middleware that will take a non-rewindable input and make it rewindable by buffering it. That way users depending on the previous behavior can use the middleware for backwards compatibility.

@ioquatix
Copy link
Member

Can we move such middleware to rack-contrib?

@jeremyevans
Copy link
Contributor

I think we should ship such a middleware in rack itself. We have Rack::RewindableInput already, so maybe we can just add a middleware to that file that wraps rack.input with Rack::RewindableInput.

jeremyevans added a commit to jeremyevans/rack that referenced this issue Feb 4, 2022
This will automatically wrap rack.input with Rack::RewindableInput,
for compatibility with middleware and applications that expect
rewindable input.

Related to rack#1148, but this does not contain any SPEC changes. It's
possible for servers targetting Rack 2 compatibility to use this
middleware to implement the compatibility.
jeremyevans added a commit to jeremyevans/rack that referenced this issue Feb 4, 2022
This will automatically wrap rack.input with Rack::RewindableInput,
for compatibility with middleware and applications that expect
rewindable input.

Related to rack#1148, but this does not contain any SPEC changes. It's
possible for servers targetting Rack 2 compatibility to use this
middleware to implement the compatibility.
@ioquatix
Copy link
Member

I've just been thinking about what streaming multi-part might look like:

input = env['rack.input']
multipart = Multipart.new(input)

# Streaming
multipart.each do |part|
  part.name
  contents = part.read
end

ioquatix pushed a commit that referenced this issue Feb 12, 2022
This will automatically wrap rack.input with Rack::RewindableInput,
for compatibility with middleware and applications that expect
rewindable input.

Related to #1148, but this does not contain any SPEC changes. It's
possible for servers targetting Rack 2 compatibility to use this
middleware to implement the compatibility.
@jeremyevans
Copy link
Contributor

This is basically implemented by #1804, so I'll close this now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants