Drop disk buffering and support for rewindable inputs. #1148

thesmart · 2017-03-07T20:59:15Z

I'm curious about Rack's requirement for rewindable input, and because of this requirement, the Rack::Multipart class will buffer a read-once stream to disk presumably so rewind is available. Modern cloud-based NAS throughput imposes limitation on how well this feature scales. Also, RAM is wildly cheap these days. Considering how the entire Rack ecosystem could depend on rewindable streams, I was wondering...

Is it a good time for future versions of Rack 2.x to drop disk buffering as a default behavior, and move it to an optional Rack middleware specifically for handling "large" bodies / file uploads?

I'd like to propose an alternative strategy for handling "large" bodies.

(0) Remove any use of Tempfile in Rack::Multipart or other default body parsers.

(1) Provide a configurable post body buffer limit max_body_buffer_length. If content_length.present? && content_length <= max_body_buffer_length then read it ASAP (ideally non-blocking) into rack.input as StringIO or other. Else, if content_length.nil? || content_length > max_body_buffer_length, then do not read the stream. This would mean that no chunked encodings are handled by default, and no buffering to disk for streams larger than the configured limit.

(2) When request.params or request.body is accessed, if body was not parsed due to length or chunked encoding, report a warning message that can be disabled.

(3) Provide automatic "large body" parsing as a middleware module plugin that can be optionally installed.

The text was updated successfully, but these errors were encountered:

tenderlove · 2017-05-15T18:14:51Z

Is it a good time for future versions of Rack 2.x to drop disk buffering as a default behavior, and move it to an optional Rack middleware specifically for handling "large" bodies / file uploads?

Yes. I plan on revising the spec for Rack 3 (master should be 3). It will remove the requirement for a rewindable body. If you have time to do that work, then please start sending PRs (starting with updates to the SPEC file). Thanks!

ioquatix · 2018-04-09T01:10:12Z

I recently ran into this limitation when implementing streaming requests/responses with async-http and the rack-compatible server falcon.

With HTTP/2, one can stream a request/response indefinitely, and the requirement for supporting #rewind is a bit of a pain. In addition, read(length) also requires buffering within individual HTTP2 DATA frames which makes things slightly more complicated.

I'd suggest that the rack.input API is good, but 1/ drop #rewind and limit #read to having no arguments, i.e. it just returns the next available chunk of data from the stream. This would suit HTTP/2 perfectly.

ioquatix · 2018-04-09T01:10:50Z

I would be happy to work on the SPEC file but do you want to have a discussion here first? What, if any, progress/decisions have already been made?

ioquatix · 2018-04-09T01:52:25Z

Just as an addendum to the first point, chunks of data returned by #read are managed by HTTP2 settings, which limits the amount of data per chunk (AFAIK) to 16MB.

ioquatix · 2018-04-09T12:20:56Z

I noticed that here

rack/lib/rack/multipart/parser.rb

Line 68 in 4b33af1

parser.on_read io.read(bufsize), io.eof?

rack.input must also respond to eof? which is not documented AFAIK.

ioquatix · 2018-04-10T23:01:18Z

Here is what I implemented to handle reading from a chunked data source (e.g. Transfer-Encoding: chunked, HTTP2 data frames, etc) to what is required by rack:

https://github.com/socketry/falcon/blob/master/lib/falcon/input.rb

It works reasonably well but the requirement for rewinding means that all data must be cached...

ioquatix · 2018-04-15T21:43:16Z

After implementing a client and server over HTTP/1.0, HTTP/1.1 and HTTP/2.0, I'd like to suggest that having body.empty? is a particularly useful method and works for like 99% of existing code.

When writing the response, it's useful to terminate the stream after the headers are written in HTTP/2.0 and in order to do this you need to ask body.empty? The alternative is to call #each and buffer the chunks which is inefficient for streaming responses.

ioquatix · 2018-04-15T21:49:52Z

I'd propose that the response body supports both #each and #empty? which make for efficient implementation of server response handling especially for HTTP/2.

https://github.com/socketry/async-http/blob/b466350580e1ab0c9eca27a056fb0062e43b6294/lib/async/http/body/readable.rb#L31-L43

Just for completeness, but outside the scope of rack, for writable bodies, the following definition is sufficient:

https://github.com/socketry/async-http/blob/b466350580e1ab0c9eca27a056fb0062e43b6294/lib/async/http/body/writable.rb#L70-L84

ioquatix · 2018-05-28T05:31:57Z

I've been studying more about the various options available. I found some interesting information about WebSocket/XMLHTTPRequest. Both are essentially considered deprecated according to w3c/ServiceWorker#947 (comment) and the surrounding discussion.

It looks like the forward looking API is centred around streams (fetch and ReadableStream). It's apparently possible to implement bi-directional communication using this approach, but it's something I need to continue exploring. I just wanted to provide some updates based on my own research.

ioquatix · 2018-05-28T08:02:19Z

I wanted to add two more links for reference:

1/ Here is a proposal for WebSockets over HTTP/2 which may or may not see practical usage: https://github.com/mcmanus/draft-h2ws/blob/master/draft-mcmanus-httpbis-h2-websockets-03.mkd

2/ Here is an interesting discussion about the state of bi-directional streaming, specifically with clients implemented within the browser (e.g. using fetch): whatwg/fetch#229

It will be interesting to see how these issues pan out.

ioquatix · 2018-06-08T12:45:50Z

@tenderlove How can we move forward with this?

janko · 2018-06-08T23:08:39Z

I'd suggest that the rack.input API is good, but 1/ drop #rewind and limit #read to having no arguments, i.e. it just returns the next available chunk of data from the stream. This would suit HTTP/2 perfectly.

I think we all agree on dropping the #rewind requirement, but let's hold off on #read and discuss it separately, as that would be a breaking change. The original issue is related to disk buffering, which is caused by Rack inputs having to be #rewind-able.

Is there any place where Rack needs to read the request body more than once? If it reads it the first time and parses it, I don't see why it would need to read it again. I also think it's fine that it continues to automatically parse multipart form data encoded request bodies (handled by Rack::Multipart), at least for now.

The pain point that I'm personally interested in are web servers needing to implement rewindability. The existing Rack::RewindableInput wrapper is far from ideal, as it eagerly reads the Rack input and writes all content to the tempfile, in which case the Rack application cannot utilize potential request-streaming capability that the web server might have (Unicorn, Passenger, and Falcon have it).

I noticed that here rack.input must also respond to eof? which is not documented AFAIK.

I opened a pull request a while ago to eliminate the #eof? calls, as they're not necessary, and the Rack input is not guaranteed to respond to #eof?. Puma uses a StringIO/Tempfile object for the Rackinput, which happens to respond to #eof?, but Unicorn's and Phusion Passenger's TeeInput doesn't, so multipart uploads won't work on chunked requests (in which Rack::Multipart::Parser::BoundedIO isn't used).

ioquatix · 2018-06-08T23:11:57Z

Is there any place where Rack needs to read the request body more than once?

rack/lib/rack/request.rb

Lines 332 to 355 in b72bfc9

    
           def POST 
        
             if get_header(RACK_INPUT).nil? 
        
               raise "Missing rack.input" 
        
             elsif get_header(RACK_REQUEST_FORM_INPUT) == get_header(RACK_INPUT) 
        
               get_header(RACK_REQUEST_FORM_HASH) 
        
             elsif form_data? || parseable_data? 
        
               unless set_header(RACK_REQUEST_FORM_HASH, parse_multipart) 
        
                 form_vars = get_header(RACK_INPUT).read 
        
                 # Fix for Safari Ajax postings that always append \0 
        
                 # form_vars.sub!(/\0\z/, '') # performance replacement: 
        
                 form_vars.slice!(-1) if form_vars[-1] == ?\0 
        
                 set_header RACK_REQUEST_FORM_VARS, form_vars 
        
                 set_header RACK_REQUEST_FORM_HASH, parse_query(form_vars, '&') 
        
                 get_header(RACK_INPUT).rewind 
        
               end 
        
               set_header RACK_REQUEST_FORM_INPUT, get_header(RACK_INPUT) 
        
               get_header RACK_REQUEST_FORM_HASH 
        
             else 
        
               {} 
        
             end 
        
           end

rack/lib/rack/multipart.rb

Lines 45 to 62 in dc204b1

    
             def extract_multipart(req, params = Rack::Utils.default_query_parser) 
        
               io = req.get_header(RACK_INPUT) 
        
               io.rewind 
        
               content_length = req.content_length 
        
               content_length = content_length.to_i if content_length 
        
               tempfile = req.get_header(RACK_MULTIPART_TEMPFILE_FACTORY) || Parser::TEMPFILE_FACTORY 
        
               bufsize = req.get_header(RACK_MULTIPART_BUFFER_SIZE) || Parser::BUFSIZE 
        
               info = Parser.parse io, content_length, req.get_header('CONTENT_TYPE'), tempfile, bufsize, params 
        
               req.set_header(RACK_TEMPFILES, info.tmp_files) 
        
               info.params 
        
             end 
        
             def build_multipart(params, first = true) 
        
               Generator.new(params, first).dump 
        
             end 
        
           end

(and a bunch of places in https://github.com/rack/rack/blob/b72bfc9435c118c54019efae1fedd119521b76df/lib/rack/multipart/parser.rb)

ioquatix · 2018-06-08T23:16:01Z

I think we all agree on dropping the #rewind requirement, but let's hold off on #read and discuss it separately, as that would be a breaking change. The original issue is related to disk buffering, which is caused by Rack inputs having to be #rewind-able.

Agreed.

Is there any place where Rack needs to read the request body more than once? If it reads it the first time and parses it, I don't see why it would need to read it again. I also think it's fine that it continues to automatically parse multipart form data encoded request bodies (handled by Rack::Multipart), at least for now.

Agreed. I also think it would make more sense for the request body to be parsed in a similar fashion to rack middleware. e.g. a stack of parsers which transform the input body into something more useful.

The pain point that I'm personally interested in are web servers needing to implement rewindability. The existing Rack::RewindableInput wrapper is far from ideal, as it eagerly reads the Rack input and writes all content to the tempfile, in which case the Rack application cannot utilize potential request-streaming capability that the web server might have (Unicorn, Passenger, and Falcon have it).

Agreed.

I noticed that here rack.input must also respond to eof? which is not documented AFAIK.

Yes, that's unfortunately underspecified. +1 on your PR to fix this issue.

ioquatix · 2018-06-08T23:18:01Z

Something like:

def call(env)
    if body = env['rack.body'] and multipart?(env)
        env['rack.body'] = Multipart.new(body)
    end

    return @app.call(env)
end

The same could be done for POST requests, e.g. detect and parse www-form-urlencoded, JSON.

Middleware would obviously have to be aware of the input body format some how. I've deliberately used rack.body as a new thing as changing the semantics of rack.input would break a lot of stuff.

ioquatix · 2018-06-08T23:27:29Z

I guess I should mention that if we went down the route of adding rack.body this would be the chance to change the semantics of #read (and drop #gets) to be better in alignment with HTTP buffering. Handling a length argument ensures that any conforming implementation of #read must maintain an internal buffer at some point.

janko · 2018-06-08T23:47:10Z

@ioquatix I can see where Rack calls #rewind on the Rack input object, what I was wondering was where does Rack read the Rack input again after it has been rewinded. For all I know, it could call #rewind just so that the user of Rack has the request body rewinded if they want to read it themselves. And this a feature that would officially be removed in Rack v3 if we stop demanding #rewind to be implemented.

I also think it would make more sense for the request body to be parsed in a similar fashion to rack middleware. e.g. a stack of parsers which transform the input body into something more useful.

Yeah, an opt-in middleware sounds nice.

ioquatix · 2018-06-08T23:48:14Z

read the Rack input again after it has been rewinded.

Ha, still looking for that :D

ioquatix · 2018-06-26T02:16:51Z

Actually I found one place where we use it (but I gate the functionality around respond_to?(:rewind)), is our error/exception mailer. It rewinds the input body and attaches it to the email.

I'm not sure I want to find out what would happen if someone was uploading a multi-GB file.

ioquatix · 2018-06-26T22:43:23Z

I've been doing some more work in this area.

Another problem that becomes apparent is the mismatch of high level response and low level transport.

There are at least two examples of this: Rack::ContentLength and Rack::Chunked. These two middleware are very HTTP/1 specific. HTTP/2 has no need for either of them, and in some cases it appears that Rack::Chunked could actually damage the response body.

The way I've been thinking about it, there are transport level headers such as the above, which really should be the responsibility of the web server.

Rack should be providing the basic infrastructure on which to handle requests and responses, but the actual protocol/transport level headers/behaviour should be specified by the underlying server/protocol. In this case, there is a bit of a mash up between the layers.

ioquatix · 2018-08-07T09:19:56Z

I worked around this problem.

Essentially, I use a memory buffer to cache the input, and provide support for rewind, but only if it matches a content-type which rack internally calls rewind on.

https://github.com/socketry/falcon/blob/master/lib/falcon/adapters/rewindable.rb

I know this is a hack but it was the simplest way to work around this issue which has been outstanding for almost a year. I still believe that removing the need for rewind is ultimately a good idea. Legacy apps that need it could use a middleware to implement the above logic.

ioquatix · 2022-01-25T06:55:07Z

@jeremyevans are we in agreement to remove the requirement for input to be rewindable?

jeremyevans · 2022-01-25T16:35:16Z

I'm OK with removing the requirement if we ship a middleware that will take a non-rewindable input and make it rewindable by buffering it. That way users depending on the previous behavior can use the middleware for backwards compatibility.

ioquatix · 2022-01-26T02:29:35Z

Can we move such middleware to rack-contrib?

jeremyevans · 2022-02-04T03:24:17Z

I think we should ship such a middleware in rack itself. We have Rack::RewindableInput already, so maybe we can just add a middleware to that file that wraps rack.input with Rack::RewindableInput.

This will automatically wrap rack.input with Rack::RewindableInput, for compatibility with middleware and applications that expect rewindable input. Related to rack#1148, but this does not contain any SPEC changes. It's possible for servers targetting Rack 2 compatibility to use this middleware to implement the compatibility.

ioquatix · 2022-02-12T23:11:03Z

I've just been thinking about what streaming multi-part might look like:

input = env['rack.input']
multipart = Multipart.new(input)

# Streaming
multipart.each do |part|
  part.name
  contents = part.read
end

This will automatically wrap rack.input with Rack::RewindableInput, for compatibility with middleware and applications that expect rewindable input. Related to #1148, but this does not contain any SPEC changes. It's possible for servers targetting Rack 2 compatibility to use this middleware to implement the compatibility.

jeremyevans · 2022-02-18T20:49:41Z

This is basically implemented by #1804, so I'll close this now.

ioquatix mentioned this issue May 1, 2018

Add WebSocket and SSE options #1272

Closed

ioquatix mentioned this issue Jun 8, 2018

Caching retrieved content into a file socketry/falcon#6

Closed

ioquatix added this to the v3.0.0 milestone Nov 14, 2019

This was referenced Dec 22, 2019

getting Malformed JSON in the request body when request.body is not rewindable mwpastore/sinja#19

Open

Falcon::Adapters::Input doesn't have #empty? method socketry/falcon#94

Closed

ioquatix mentioned this issue Feb 4, 2020

Considerations for response body implementing #call. #1550

Closed

ajvondrak mentioned this issue Feb 24, 2020

Twirp::Service fails to rewind the rack.input arthurnn/twirp-ruby#50

Closed

jeremyevans mentioned this issue Jul 13, 2020

Proxy all (or at least more) methods on Rack::RewindableInput? #1668

Closed

ioquatix changed the title ~~Proposal for dealing with body streams differently in 2017~~ Drop disk buffering and support for rewindable inputs. Jan 25, 2022

This was referenced Feb 5, 2022

Add Rack::RewindableInput::Middleware #1803

Merged

Remove rewindable rack.input from SPEC #1804

Merged

jeremyevans closed this as completed Feb 18, 2022

waltjones mentioned this issue Oct 7, 2022

Fix error with body.rewind with rack 3.0 rollbar/rollbar-gem#1107

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Drop disk buffering and support for rewindable inputs. #1148

Drop disk buffering and support for rewindable inputs. #1148

thesmart commented Mar 7, 2017 •

edited

tenderlove commented May 15, 2017

ioquatix commented Apr 9, 2018

ioquatix commented Apr 9, 2018

ioquatix commented Apr 9, 2018 •

edited

ioquatix commented Apr 9, 2018 •

edited

ioquatix commented Apr 10, 2018

ioquatix commented Apr 15, 2018

ioquatix commented Apr 15, 2018

ioquatix commented May 28, 2018 •

edited

ioquatix commented May 28, 2018

ioquatix commented Jun 8, 2018

janko commented Jun 8, 2018 •

edited

ioquatix commented Jun 8, 2018

ioquatix commented Jun 8, 2018

ioquatix commented Jun 8, 2018 •

edited

ioquatix commented Jun 8, 2018

janko commented Jun 8, 2018

ioquatix commented Jun 8, 2018

ioquatix commented Jun 26, 2018 •

edited

ioquatix commented Jun 26, 2018

ioquatix commented Aug 7, 2018

ioquatix commented Jan 25, 2022

jeremyevans commented Jan 25, 2022

ioquatix commented Jan 26, 2022

jeremyevans commented Feb 4, 2022

ioquatix commented Feb 12, 2022

jeremyevans commented Feb 18, 2022

Drop disk buffering and support for rewindable inputs. #1148

Drop disk buffering and support for rewindable inputs. #1148

Comments

thesmart commented Mar 7, 2017 • edited

tenderlove commented May 15, 2017

ioquatix commented Apr 9, 2018

ioquatix commented Apr 9, 2018

ioquatix commented Apr 9, 2018 • edited

ioquatix commented Apr 9, 2018 • edited

ioquatix commented Apr 10, 2018

ioquatix commented Apr 15, 2018

ioquatix commented Apr 15, 2018

ioquatix commented May 28, 2018 • edited

ioquatix commented May 28, 2018

ioquatix commented Jun 8, 2018

janko commented Jun 8, 2018 • edited

ioquatix commented Jun 8, 2018

ioquatix commented Jun 8, 2018

ioquatix commented Jun 8, 2018 • edited

ioquatix commented Jun 8, 2018

janko commented Jun 8, 2018

ioquatix commented Jun 8, 2018

ioquatix commented Jun 26, 2018 • edited

ioquatix commented Jun 26, 2018

ioquatix commented Aug 7, 2018

ioquatix commented Jan 25, 2022

jeremyevans commented Jan 25, 2022

ioquatix commented Jan 26, 2022

jeremyevans commented Feb 4, 2022

ioquatix commented Feb 12, 2022

jeremyevans commented Feb 18, 2022

thesmart commented Mar 7, 2017 •

edited

ioquatix commented Apr 9, 2018 •

edited

ioquatix commented Apr 9, 2018 •

edited

ioquatix commented May 28, 2018 •

edited

janko commented Jun 8, 2018 •

edited

ioquatix commented Jun 8, 2018 •

edited

ioquatix commented Jun 26, 2018 •

edited