Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Import maps and performance (HTTP/2) #2697

Open
nateberkopec opened this issue Sep 13, 2021 · 54 comments
Open

Import maps and performance (HTTP/2) #2697

nateberkopec opened this issue Sep 13, 2021 · 54 comments

Comments

@nateberkopec
Copy link
Member

Rails is going the way of supporting import maps by default as its new JS solution. This means that Rails apps can make 100+ requests to the application server as they traverse the import map.

To make this fast in Puma, I have a few concerns:

  1. HTTP/2. This is almost certainly a starting requirement. So, we have to decide what approach we're going to take here - graft on helper libraries to the existing native extensions, or scrap the parser entirely and use something in Ruby, etc.
  2. Benchmark. Should be easy to create a before/after benchmark here.
  3. Rack. I remember last time we evaluated HTTP/2 in HTTP 2.0 #454 there were a lot of concerns about how this would all map to Rack, which were basically never resolved.

Other things we'll probably come up as we're spiking this. Right now, I want to know:

  1. Who's interested in getting involved
  2. What technical approach we're going to take for HTTP/2 support
  3. What other technical barriers might exist (I/O?) for making fast importmaps a reality.
@nateberkopec
Copy link
Member Author

What technical approach we're going to take for HTTP/2 support

Here are the options as I see them:

  1. nghttp2. We would need a similar library for Java. This has the disadvantage of pushing more logic into native extensions, and our contributor base in that area is much more limited, so this would end up being a lot of hero-coding by a limited set of people.
  2. http2 gem by Ilya Grigorik. Advantage: maintained by someone very knowledgable, disadvantage: not sure how this would integrate into existing code around HTTP 1.1.
  3. protocol-http2. Advantage of also supporting http/1 via protocol-http1, which means we would get to just rip out all the http-related native extensions.

@MSP-Greg
Copy link
Member

MSP-Greg commented Sep 14, 2021

This means that Rails apps can make 100+ requests to the application server as they traverse the import map.

A bit confused. Are 'import maps' kind of like RubyGems for js in the client/browser? IOW, they're not generating requests to the application server, but to web repositories of the js 'packages'?

Regardless, I'm interested in http/2 and http/3. Currently (with http/1.1), one 'conversation/request' is communicated to the rack app per socket. With http/2, one socket can have many streams, and hence, multiple 'conversations/requests' happen on one socket. How does rack handle that? I must be missing something, probably need to look at some examples of Ruby http/2 servers...

See https://github.com/rails/importmap-rails#what-if-i-dont-like-to-use-a-javascript-cdn

To me that implies that js files/packages can be served either from the 'application server' (Puma, etc) or CDN's...

@brenogazzola
Copy link

brenogazzola commented Sep 16, 2021

I’m willing to help. Don’t have any experience with app servers, but not afraid of reading code and debugging stuff until I figure what’s happening.

Worst case, I can at least provide a production app running rails master to test a real workload (not necessarily through importmaps, but by having webpack chunk as much as possible)

@brenogazzola
Copy link

nghttp2 sounds like the least ideal of the three. It feels like it would be the fastest, but a few extra milliseconds per request sound better than bugs being open longer and burned out maintainers.

protocol2 sounds like a fallback option unless the native extensions are a maintenance problem right now. If they are stable them I’d say there’s no reason to change a winning team.

http2 is probably the best option. Full ruby and will add the least amount of unknowns to the current code, even if it required a bit extra effort to get it integrated. Silly question from someone who never touched puma code: does it need to integrate with http1 code? Can’t it be an XOR deal? As in “client support http2, so let’s get all requests through the http2 gem code.”?

@ioquatix
Copy link
Contributor

ioquatix commented Sep 16, 2021

Don't underestimate the complexity of HTTP/2.

I'm not sure I'd call the http2 gem well maintained.

The biggest limitation Puma has in this regard is simply the limited number of threads w.r.t. simultaneous connections.

There is also the reality that most load balancers don't yet support HTTP/2 on the backend.

Falcon supports HTTP/2 with Rack with no changes to applications. Rack essentially implements CGI (HTTP/1.1) and there is a reasonably well defined mapping from HTTP/2 -> HTTP/1.1 which we effectively implement. However, I wanted to extend Rack to embrace concurrency rack/rack#1745. To me, this is one of the biggest advantages of HTTP/2 that we can leverage within the user facing code.

Supporting multiple streams essentially means you'd be implementing a HTTP/2 -> HTTP/1 (application) gateway. You might as well just use Falcon in the connection acceptor thread - and this is only half suggested as a joke. I'm at least somewhat serious. 99% of Falcon is Async::HTTP. You could literally pull in Async::HTTP, buffer the entire request, and shoot it off to the Puma worker pool and you'd be able to implement this in about a day and about 100 lines of code.

@ioquatix
Copy link
Contributor

By the way, I've often thought of making a "threaded" adaptor for Falcon which executes requests in a thread pool which matches with Rails' expectations on how the execution of web requests works, rather than Falcon/Async which gives you a well defined concurrency model which unfortunately still bumps up against Rails' assumptions about the world ActiveRecord works in. But the reality is, even this is slowly changing. Puma is an incredibly important piece of technology. This probably forms part of a larger conversation but I'm not sure Puma needs HTTP/2. I think there are still improvements to be made to work scheduling and thread pool implementation TBH. I don't see Falcon/Puma as being in competition with each other except on the most friendly terms. They serve different purposes.

@nateberkopec
Copy link
Member Author

nateberkopec commented Sep 16, 2021

So it sounds like a benchmark would be a good first step then to get an idea of the extent of the issue. That will let us try different configurations, such as putting h2o or nginx in front of Puma, which will basically be the same perf-wise as Puma support HTTP/2 proper, because as @ioquatix mentioned without a concurrency model change we're basically just implementing a HTTP/2->HTTP/1 gateway.

I want to be clear that the motivation here is not "Puma must support HTTP/2" but "Puma should make the new import-map-driven experience in Rails 7 as fast as possible", so HTTP/2 is just a strong hunch on how we accomplish that. Maybe there are other things we can do. I'm also wondering if, because import maps is primarily about serving files from disk, that maybe there's a shortcut we can take here that makes import maps fast by avoiding Rack entirely.

I think this weekend I'll take some time to make a benchmark and that should provide some insight into next steps, then.

@byroot
Copy link
Contributor

byroot commented Sep 16, 2021

"Puma should make the new import-map-driven experience in Rails 7 as fast as possible"

You mean as a development server, or as a production one? Because I can see the appeal for the former, but for the later, if your assets requests hit your Ruby server, you kind already lost.

And even then, operationally speaking, it doesn't make much sense to multiplex requests all the way to the application server. If you have multiple requests in the same connection, it's much better if your reverse proxy dispatch them to distinct workers. Hence why I don't really see the appeal of HTTP/2 for ruby application servers.

@nateberkopec
Copy link
Member Author

definitely mostly focusing on the former, but we can't ignore the prod experience I think either. Puma has always been an app server that you can just throw up on MY_FAV_VPS and have a good experience, and if Rails 7 makes that not true, that would be a loss.

@ioquatix
Copy link
Contributor

Leveraging HTTP/1 with splice/send file might be sufficient and you could totally build a light weight fiber scheduler for file IO which seems like it would solve any static file serving overhead.

@nateberkopec
Copy link
Member Author

So, there appear to be two reasons why this isn't fast today:

  1. Server response times for JS in Rails 7 are actually kind of bad. It takes about 60 ms to get to the first byte down the pipe.
  2. The six-connection-per-domain limit (which is a limit imposed by browsers themselves, not in the protocol)

If you take either of those things away, Puma rips just as fast as anything else. HTTP/2 removes the latter limit. I can maybe work with Rails on #1, since 60ms to put a file down the pipe seems kind of bad but I don't know enough about how that works in Rails to know if it will be hard or not.

So that means this problem isn't significantly I/O bound.

Here is my benchmark setup using k6. I'm still working on the nginx reverse proxy there to see if an HTTP/2 -> 1 gateway solves the problem.

@nateberkopec
Copy link
Member Author

Also, re: TTFB, that will definitely be worse in production because after you set up the initial connection, you still need to make a full roundtrip to ask for the next file you want. So maybe fixing Rails' TTFB here won't do much.

@nateberkopec
Copy link
Member Author

Screen Shot 2021-09-19 at 2 59 34 PM

Example of what the HOL-blocking looks like today.

@brenogazzola
Copy link

brenogazzola commented Sep 19, 2021

If you do decide that rails needs fixing, and the problem is Sprockets (which is responsible for serving the files), I can help, as I’ve just started spending some time there to fix a couple of bugs.

We will also have to give some care to the new propshaft gem which will replace sprockets:
https://github.com/rails/propshaft

@brenogazzola
Copy link

brenogazzola commented Sep 19, 2021

That said. JS and CSS files are requested enough that they should have close to 100% cache hit from CDNs

@nateberkopec
Copy link
Member Author

@brenogazzola Not anymore. Browser caches are partitioned now. It essentially means the cache key of a request includes the domain name of the current window. 3rd party CDNs will not hit any more often than 1st-party requests.

@brenogazzola
Copy link

What I get from the article is that the old “use jquery from a cdn because if the user visited another website that has it, it will be already cached” is no longer valid, is that right? 🤔

What I meant is, if your app is using cloudflare, and your users are in US, and you deploy a new js file, puma will only need to serve it 75 times (25 pops * 3 until cache hit) before the requests stop reaching puma.

Other CDNs have their own rules, but it seems to me the motivation here is “there are going to be many js files now, instead of one, let’s make sure puma can serve them fast”, and it will only matter for those initial 75 requests.

@ioquatix
Copy link
Contributor

By only supporting HTTP/1 Puma is in a unique position to serve static files very efficiently using splice/sendfile. I don't know the current implementation, but we should definitely take advantage of it if possible.

@brenogazzola
Copy link

Nate will know this better than me, but AFAIK Puma is getting the CSS/JS file from sprockets, and both implement the Rack spec, so Sprockets is reading the content of the file and returning it to Puma through the call method. No send_file involved?

@byroot
Copy link
Contributor

byroot commented Sep 20, 2021

@brenogazzola there's no more sprockets, that's the point.

@nateberkopec
Copy link
Member Author

@ioquatix I don't think that's important right now, because as my benchmark shows, HOL-blocking is the bottleneck and not I/O speed.

@brenogazzola Yes, in production, CDNs will alleviate a lot of the load from Puma. However, Puma should be a good and fast experience without any external dependencies, as it has always been. Also, we don't have CDNs in development, and in dev, Puma currently takes about ~8 seconds to fully satisfy the downloads for an import-mapped app with 150 dependencies, which is not a great experience.

@MSP-Greg
Copy link
Member

MSP-Greg commented Sep 20, 2021

Re the 150 dependencies, what type of response bodies are being returned by rack? Are they enums/arrays, chunked, or respond to to_path, etc?

If they're enums/arrays or chunked, depending on the 'length' and byte size, there may be an improvement with #2696.

@brenogazzola
Copy link

Puma currently takes about ~8 seconds to fully satisfy the downloads for an import-mapped app with 150 dependencies, which is not a great experience.

😱. Ok, I'm convinced, haha

@byroot
Copy link
Contributor

byroot commented Sep 20, 2021

I don't think that's important right now, because as my benchmark shows, HOL-blocking is the bottleneck and not I/O speed.

Well, if you serve each individual static file request say 2x faster, it would have a major impact on that HOL-blocking.

I dug a bit on the Rails & Rack code (ActionDispatch::Static and Rack::Files), and they do write(read()). Leveraging IO.copy_stream (sendfile / splice etc) would require quite a bit of work though.

@nateberkopec
Copy link
Member Author

nateberkopec commented Sep 20, 2021

@byroot If you make 150 concurrent requests for a ~70kb file against Puma running Rack::Static (see my benchmark), it completes in less than 100 milliseconds. So I think the latency Rails is adding is coming from somewhere else in the request, not I/O. I can open up stackprof and look at it later.

casperisfine pushed a commit to casperisfine/puma that referenced this issue Sep 20, 2021
Ref: https://puma/issues/2697

```
$ benchmarks/wrk/big_response.sh
Puma starting in single mode...
* Puma version: 5.5.0 (ruby 3.0.2-p107) ("Zawgyi")
*  Min threads: 4
*  Max threads: 4
*  Environment: development
*          PID: 17879
* Listening on http://0.0.0.0:9292
Use Ctrl-C to stop
Running 1m test @ http://localhost:9292
  2 threads and 4 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     3.37ms    5.89ms  48.28ms   94.46%
    Req/Sec     0.88k   148.97     1.07k    82.08%
  Latency Distribution
     50%    2.21ms
     75%    2.78ms
     90%    4.09ms
     99%   35.75ms
  105651 requests in 1.00m, 108.24GB read
Requests/sec:   1758.39
Transfer/sec:      1.80GB
- Gracefully stopping, waiting for requests to finish
```

```
$ benchmarks/wrk/big_file.sh
Puma starting in single mode...
* Puma version: 5.5.0 (ruby 3.0.2-p107) ("Zawgyi")
*  Min threads: 4
*  Max threads: 4
*  Environment: development
*          PID: 18034
* Listening on http://0.0.0.0:9292
Use Ctrl-C to stop
Running 1m test @ http://localhost:9292
  2 threads and 4 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.06ms    1.09ms  20.98ms   97.94%
    Req/Sec     1.85k   150.69     2.03k    89.92%
  Latency Distribution
     50%    0.94ms
     75%    1.03ms
     90%    1.21ms
     99%    4.91ms
  221380 requests in 1.00m, 226.81GB read
Requests/sec:   3689.18
Transfer/sec:      3.78GB
- Gracefully stopping, waiting for requests to finish
```
@byroot
Copy link
Contributor

byroot commented Sep 20, 2021

I did a quick proof of concept: #2703

The current read(write())

  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     3.37ms    5.89ms  48.28ms   94.46%
    Req/Sec     0.88k   148.97     1.07k    82.08%
  Latency Distribution
     50%    2.21ms
     75%    2.78ms
     90%    4.09ms
     99%   35.75ms
  105651 requests in 1.00m, 108.24GB read
Requests/sec:   1758.39
Transfer/sec:      1.80GB

Using IO.copy_stream:

  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.06ms    1.09ms  20.98ms   97.94%
    Req/Sec     1.85k   150.69     2.03k    89.92%
  Latency Distribution
     50%    0.94ms
     75%    1.03ms
     90%    1.21ms
     99%    4.91ms
  221380 requests in 1.00m, 226.81GB read
Requests/sec:   3689.18
Transfer/sec:      3.78GB

casperisfine pushed a commit to casperisfine/puma that referenced this issue Sep 20, 2021
Ref: https://puma/issues/2697

```
$ benchmarks/wrk/big_response.sh
Puma starting in single mode...
* Puma version: 5.5.0 (ruby 3.0.2-p107) ("Zawgyi")
*  Min threads: 4
*  Max threads: 4
*  Environment: development
*          PID: 17879
* Listening on http://0.0.0.0:9292
Use Ctrl-C to stop
Running 1m test @ http://localhost:9292
  2 threads and 4 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     3.37ms    5.89ms  48.28ms   94.46%
    Req/Sec     0.88k   148.97     1.07k    82.08%
  Latency Distribution
     50%    2.21ms
     75%    2.78ms
     90%    4.09ms
     99%   35.75ms
  105651 requests in 1.00m, 108.24GB read
Requests/sec:   1758.39
Transfer/sec:      1.80GB
- Gracefully stopping, waiting for requests to finish
```

```
$ benchmarks/wrk/big_file.sh
Puma starting in single mode...
* Puma version: 5.5.0 (ruby 3.0.2-p107) ("Zawgyi")
*  Min threads: 4
*  Max threads: 4
*  Environment: development
*          PID: 18034
* Listening on http://0.0.0.0:9292
Use Ctrl-C to stop
Running 1m test @ http://localhost:9292
  2 threads and 4 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.06ms    1.09ms  20.98ms   97.94%
    Req/Sec     1.85k   150.69     2.03k    89.92%
  Latency Distribution
     50%    0.94ms
     75%    1.03ms
     90%    1.21ms
     99%    4.91ms
  221380 requests in 1.00m, 226.81GB read
Requests/sec:   3689.18
Transfer/sec:      3.78GB
- Gracefully stopping, waiting for requests to finish
```
@byroot
Copy link
Contributor

byroot commented Sep 20, 2021

IO.copy_stream will be a fast path in many cases.

Yeah, one big downside though is that it doesn't have a proper timeout API. You can wrap it with Timeout.timeout and that will work it, but that's far from ideal. There's likely an opportunity to improve it upstream.

Regarding rack, response.to_path I believe is an "official" interface for "sendfile" like responses.

For delegating to the reverse proxy, yes: https://github.com/rack/rack/blob/d15dd728440710cfc35ed155d66a98dc2c07ae42/lib/rack/sendfile.rb

for the purpose of using IO.copy_steam, we could look at either to_io, or respond_to?(:read) || respond_to?(:readpartial).

Also I just realized my benchmark isn't quite perfect, big_response send a string directly from memory, so it doesn't have to open and read a file, so the difference might be even bigger.

@MSP-Greg
Copy link
Member

I added big_file.ru to some of the code that's in various PR's, and updated the code in #2696 to use IO.copy_stream. Results below with the wrk code, with Puma running -w4 t5:5, with a 1,074 kB body size:

Master

────────wrk────────  ─Request─time─distribution─(ms)─  Worker─requests  ─wrk─requests─
 -t    -c   req/sec   50%    75%    90%    99%   100%  spread   total     total   bad
 20   100     1164     21    383    606    738    816   3.12    17701     17701     0
 26   130     1168     22    531    832   1010   1100   5.26    17730     17730     0
 35   175     1161     22    750   1150   1410   1510   2.29    17675     17675     0
 46   230     1164     23   1010   1550   1900   1990   1.03    17804     17804     0
 60   300     1169     18     20     26   1530   2000   1.96    17926     17926     0
              1165                                    Totals    88836     88836
══════════════════════════════════════════════════════════════════════════════════════

Modified PR 2696

────────wrk────────  ─Request─time─distribution─(ms)─  Worker─requests  ─wrk─requests─
 -t    -c   req/sec   50%    75%    90%    99%   100%  spread   total     total   bad
 20   100    21178    1.3   21.6   34.6   41.2  263.6   0.21   319947    319947     0
 26   130    21188    1.4   30.4   47.5   55.8  283.8   0.59   320099    320099     0
 35   175    21052    1.6   43.3   67.1   78.2  137.6   0.32   318098    318098     0
 46   230    21201    1.6   59.2   90.6  104.8  168.6   0.45   320772    320772     0
 60   300    21236    1.6   79.8  121.0  138.6  196.6   0.28   321555    321555     0
             21171                                    Totals  1600471   1600471
══════════════════════════════════════════════════════════════════════════════════════

Note that the master run also had errors on the last wrk run (-t60 -c300) and the smem data was 'odd'.

Regardless, using IO.copy_stream (when the response body supports it) results in a RPS increase from 1,165 to 21,171. Pretty big increase. wrk was processing about 21.7 GB per second.

Lastly, I changed the body size to 50 kB, master ran 11,284, PR 2696 ran 23,138.

@MSP-Greg
Copy link
Member

@byroot

one big downside though is that it doesn't have a proper timeout API

Is it any different than what the current code uses, which is IO#each (same as IO#each_line)?

Also, re the data I listed above, IO.copy_stream should always be equal to or faster than IO#each, but the speed difference is related to how many iterations #each does. That is somewhat indeterminate for most static files, as one would assume they're all compressed?

@ioquatix
Copy link
Contributor

Timeout.timeout is event-driven if you use the fiber scheduler in Ruby 3.1.

@byroot
Copy link
Contributor

byroot commented Sep 26, 2021

Is it any different than what the current code uses, which is IO#each (same as IO#each_line)?

I think you are talking about reading the IO the application returned. This one isn't a big concern because we can assume the application is responsible for returning an IO that won't block forever.

The timeout that concerns me is the write one. Right now the fast_write helper uses a timeout, if we were to use copy_stream it would expose us to DOS attack. e.g. download an 1MB response and read 1B/s, you tied up the puma thread for 1M seconds.

So for a development only feature, or if we're behind a reverse proxy that buffer responses, it would be ok, but other than that it would be a big security risk.

I started looking at adding a timeout to copy_stream and so far I think it's possible, but it might take me a long time or just not manage to do it. And at best it would be available in 3.1.

So not sure how puma could make use of it given these constraints.

@ioquatix
Copy link
Contributor

We will definitely support timeout with copy_stream on the fiber scheduler if that's any help. However, this might not make it into 3.1.

@nateberkopec
Copy link
Member Author

I learned that Gusto's ~7 year old Rails monolith has 1276 javascript assets, with a total size of 31.2 MB (average size of about 24kb). I've modified my benchmark to look similar (it downloads the es shim, at 30kb, 1200 times).

I've done some more testing using my benchmark, and here's what I've learned:

  1. Average time per request is down to 7ms. Did something change in Rails here? If so, 👏 I was getting way worse perf before.
  2. Puma 5.5.2 with 5 threads can download 1200 JS files of ~30kb in size in 5.7 seconds. Increasing thread count had no measurable effect.
  3. Falcon over HTTP/2 with falcon serve --threaded -n 1 completes in 7.5 seconds (30% slower). Results with falcon serve --hybrid --forks 1 --threads 25 were the same.
  4. Time spent receiving in both servers was about 200µs per request, suggesting little difference (or room to improve) in terms of I/O performance, since just 0.25 seconds is spent actually reading bytes over the entire benchmark.
  5. 96% of the benchmark time is spent waiting on Rails for a response.
  6. Running additional workers helps quite a bit. With puma -t 5 -w 4, the benchmark completes in just 2.5 seconds.

So, locally, Puma seems to do just fine. It looks like any slowness here is caused by Rails' response times, and is not alleviated with HTTP/2. For local performance, we should focus on improving Rails' response times.

For production, it's a different story. I suspect Rails doesn't really care about our story here, because David looks like he's expecting everyone to just use a CDN in prod. That's fine for Rails but like I said above, I would like for Puma to "just work" and provide a decent experience where possible. In production, a single request might take 100-200ms round-trip, which is going to balloon the total benchmark time here into unsustainable territory of like 40 seconds or more.

My benchmark could be improved by actually providing a list of 1200 JS files to download of roughly 30MB total size, rather than just downloading the same file over and over. I think that would show any problems in Rails better, particularly in a cold boot scenario.

@ioquatix
Copy link
Contributor

Falcon over HTTP/2 with falcon serve --threaded -n 1 completes in 7.5 seconds (30% slower). Results with falcon serve --hybrid --forks 1 --threads 25 were the same.

Use multi-process for the best throughput (the default). Both the modes you tested are almost the same - lots of threads.

@nateberkopec
Copy link
Member Author

I'm trying to duplicate Rails' default - 1 process. If Rails wants to increase throughput by increasing process count, they can go ahead and do that.

@nateberkopec
Copy link
Member Author

It almost looks like HTTP/2 just makes this benchmark slower rather than faster, I don't think it's anything specific to Falcon. I just tried with nginx-with-http2-fronting-Puma and got 8.7 seconds. I don't know anything about HTTP/2 connection tuning but it just appears to be doing poorly in this case.

@ioquatix
Copy link
Contributor

When I have a moment, I'll try out your benchmark and report back.

@MSP-Greg
Copy link
Member

Before a few things sidetracked me, I was working on perf testing against string, array, chunked, and file bodies. Was using wrk and ruby code. OS file caching and all sorts of things come into play.

Noticed some odd things with varying file sizes, which may be specific to WSL2/Ubuntu. Or, I suspect that IO.copy_stream is optimized for file io, and slow socket clients (relative to file io) may not realize any benefit. Not sure. Need to jump back into it...

@nateberkopec
Copy link
Member Author

nateberkopec commented Oct 13, 2021

@ioquatix It might be worth trying other benchmarking tools (against a running Rails app). I'd like to see this duplicated with something like h2load for example. There aren't any h2 configuration settings available in k6, so I'm wondering if maybe it's setting its concurrent streams too low or something? Worth ruling out.

@ioquatix
Copy link
Contributor

ioquatix commented Oct 13, 2021

I actually have my own benchmarking tools in async-http which despite not being as fast as native C code, are pretty decent at assessing differences in concurrency. You can try the gem benchmark-http if you are interested. It's more for measuring real world latency and concurrency, but can still work in micro-benchmarks.

@nateberkopec
Copy link
Member Author

I just tried h2load - a little bit better. 7.3 seconds on NGINX-fronting-Puma, still slower than HTTP1.

@ioquatix
Copy link
Contributor

I compared HTTP/1.1 & HTTP/2:

samuel@Fukurou ~/D/s/benchmark-http (master)> bin/benchmark-http hammer -k 16 -c 1200 --alpn-protocols "http/1.1" https://localhost:9292/index.js
I am going to benchmark https://localhost:9292/index.js...
I am running 16 asynchronous tasks that will each make 1200 sequential requests...
5705 samples: 5705x 200. 5749.15 requests per second. S/D: 1.58ms.
11615 samples: 11615x 200. 5825.28 requests per second. S/D: 1.16ms.
17599 samples: 17599x 200. 5876.32 requests per second. S/D: 957.470µs.
I made 19200 requests in 3.3s. The per-request latency was 2.73ms. That's 5870.590963580636 asynchronous requests/second.
	          Variance: 0.860µs
	Standard Deviation: 927.553µs
	    Standard Error: 6.694µs
19200 samples: 19200x 200. 5870.59 requests per second. S/D: 927.553µs.


samuel@Fukurou ~/D/s/benchmark-http (master)> bin/benchmark-http hammer -k 16 -c 1200 --alpn-protocols "h2" https://localhost:9292/index.js
I am going to benchmark https://localhost:9292/index.js...
I am running 16 asynchronous tasks that will each make 1200 sequential requests...
2978 samples: 2978x 200. 3000.86 requests per second. S/D: 2.91ms.
6160 samples: 6160x 200. 3089.05 requests per second. S/D: 2.17ms.
9328 samples: 9328x 200. 3114.76 requests per second. S/D: 1.88ms.
12490 samples: 12490x 200. 3124.4 requests per second. S/D: 1.73ms.
15698 samples: 15698x 200. 3139.27 requests per second. S/D: 1.62ms.
18772 samples: 18772x 200. 3157.08 requests per second. S/D: 1.60ms.
I made 19200 requests in 6.0s. The per-request latency was 5.04ms. That's 3175.8445265040596 asynchronous requests/second.
	          Variance: 2.573µs
	Standard Deviation: 1.60ms
	    Standard Error: 11.577µs
19200 samples: 19200x 200. 3175.84 requests per second. S/D: 1.60ms.

TLS adds quite a bit of overhead, and HTTP/2 is slower in every way. However where HTTP/2 has an advantage is when you have lots of simultaneous requests. But in terms of raw throughput, it doesn't have a clear advantage, because the protocol is much more complex in the user space.

@schneems
Copy link
Contributor

schneems commented Mar 9, 2022

Hello world. I’m looking at this (but don’t have immediate answers).

At a high level it looks like if the problem is the rails/rack response time and rack will degrade a http2 to http1 that we are limited in any fixes to puma (or any other web server).

Even though supporting http2 would mean that we only have to handle one connection instead of 1200 connections, based on the info here it sounds like the bottleneck isn’t tcp slow start or friends associated but in action dispatch static middleware or somewhere before it.

I am curious where the bulk of time is spent between puma having a parsed request for an asset and writing a response.

I’m curious how well passenger fares here as it serves

It seems like to get maximum performance we would need both something that understands http2 and doesn’t need to continually open new connections as well as low latency logic to find those assets on disk and serve them.

Which sounds a lot like putting a http2 server in front of your rails app and teaching it how to serve assets which sounds a lot like what passenger does https://www.phusionpassenger.com/library/dev/ruby/rails_integration.html#static-assets-serving.

Has anyone benched passenger?

@ioquatix
Copy link
Contributor

ioquatix commented Mar 9, 2022

I will revisit this some time later this year and have updated benchmarks for HTTP/2 - right now the overhead of Async::IO is quite decent but using IO::Buffer and direct IO using the fiber scheduler should help a bit (hopefully a lot). I don't see why we can't make this a lot better.

@ain
Copy link

ain commented Jun 9, 2022

Valid question by @schneems around Passenger. Perhaps some of these questions have already been answered by the Passenger team?

nateberkopec pushed a commit that referenced this issue Sep 9, 2022
Ref: https:///issues/2697

```
$ benchmarks/wrk/big_response.sh
Puma starting in single mode...
* Puma version: 5.5.0 (ruby 3.0.2-p107) ("Zawgyi")
*  Min threads: 4
*  Max threads: 4
*  Environment: development
*          PID: 17879
* Listening on http://0.0.0.0:9292
Use Ctrl-C to stop
Running 1m test @ http://localhost:9292
  2 threads and 4 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     3.37ms    5.89ms  48.28ms   94.46%
    Req/Sec     0.88k   148.97     1.07k    82.08%
  Latency Distribution
     50%    2.21ms
     75%    2.78ms
     90%    4.09ms
     99%   35.75ms
  105651 requests in 1.00m, 108.24GB read
Requests/sec:   1758.39
Transfer/sec:      1.80GB
- Gracefully stopping, waiting for requests to finish
```

```
$ benchmarks/wrk/big_file.sh
Puma starting in single mode...
* Puma version: 5.5.0 (ruby 3.0.2-p107) ("Zawgyi")
*  Min threads: 4
*  Max threads: 4
*  Environment: development
*          PID: 18034
* Listening on http://0.0.0.0:9292
Use Ctrl-C to stop
Running 1m test @ http://localhost:9292
  2 threads and 4 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.06ms    1.09ms  20.98ms   97.94%
    Req/Sec     1.85k   150.69     2.03k    89.92%
  Latency Distribution
     50%    0.94ms
     75%    1.03ms
     90%    1.21ms
     99%    4.91ms
  221380 requests in 1.00m, 226.81GB read
Requests/sec:   3689.18
Transfer/sec:      3.78GB
- Gracefully stopping, waiting for requests to finish
```
nateberkopec added a commit that referenced this issue Sep 9, 2022
* Proof of Concept: Use `IO.copy_stream` to serve files

Ref: https:///issues/2697

```
$ benchmarks/wrk/big_response.sh
Puma starting in single mode...
* Puma version: 5.5.0 (ruby 3.0.2-p107) ("Zawgyi")
*  Min threads: 4
*  Max threads: 4
*  Environment: development
*          PID: 17879
* Listening on http://0.0.0.0:9292
Use Ctrl-C to stop
Running 1m test @ http://localhost:9292
  2 threads and 4 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     3.37ms    5.89ms  48.28ms   94.46%
    Req/Sec     0.88k   148.97     1.07k    82.08%
  Latency Distribution
     50%    2.21ms
     75%    2.78ms
     90%    4.09ms
     99%   35.75ms
  105651 requests in 1.00m, 108.24GB read
Requests/sec:   1758.39
Transfer/sec:      1.80GB
- Gracefully stopping, waiting for requests to finish
```

```
$ benchmarks/wrk/big_file.sh
Puma starting in single mode...
* Puma version: 5.5.0 (ruby 3.0.2-p107) ("Zawgyi")
*  Min threads: 4
*  Max threads: 4
*  Environment: development
*          PID: 18034
* Listening on http://0.0.0.0:9292
Use Ctrl-C to stop
Running 1m test @ http://localhost:9292
  2 threads and 4 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.06ms    1.09ms  20.98ms   97.94%
    Req/Sec     1.85k   150.69     2.03k    89.92%
  Latency Distribution
     50%    0.94ms
     75%    1.03ms
     90%    1.21ms
     99%    4.91ms
  221380 requests in 1.00m, 226.81GB read
Requests/sec:   3689.18
Transfer/sec:      3.78GB
- Gracefully stopping, waiting for requests to finish
```

* Ruby 2.2 compat

* test_puma_server.rb - fixup test_file_body

Co-authored-by: Jean Boussier <jean.boussier@gmail.com>
Co-authored-by: MSP-Greg <Greg.mpls@gmail.com>
@dkniffin
Copy link

dkniffin commented Feb 16, 2023

Any updates here? I'm migrating a Rails 7 app to use importmap and I ran into slowness in my test and dev environments, when fetching the many assets. It's so slow that my capybara tests are "timing out", meaning the assets take too long to load, and cause other things to fail as a result. Is there a solution to this problem yet?

@nateberkopec
Copy link
Member Author

Thanks for reporting @dkniffin. Asset loading is a fairly complex process, and Puma may or may not be your bottleneck. Since this issue is difficult to reproduce on a simple app and the path forward to improving performance isn't clear here, if you have a half hour and can book some time with me (see CONTRIBUTING.md for link) I'd love to peek at your screen while this is happening so I can get some ideas as to what's going on.

@BastienL
Copy link

@dkniffin as we were running into the same issue as yours, we set up Caddy in front of Puma to serve the assets in development and test modes. As Caddy supports HTTP/2, import maps are no longer a problem.

@dkniffin
Copy link

@nateberkopec Thank you for the offer. I think I found the issue in my slow tests: config.assets.digest = false. This config was causing the test environment to not have digests on files, which in turn meant that the files weren't being served up as static files from public. That improved a majority of the requests quite a bit.

Another thing was changing relative imports to absolute imports. So import Foo from "./foo" to import Foo from "components/foo" or whatever. That change also allowed the correct static files to be used.

I still think there would definitely be a benefit from having http/2 support in Rack and puma, but it's less of a concern for me now. (I think... still gotta finish getting my test suite passing)

@BastienL Thanks for the tip! I'll definitely check that out.

@nateberkopec
Copy link
Member Author

Great, and if anyone else wants to take me up on that offer I'm happy to do it.

You should not need a reverse proxy in front of Puma to enjoy good performance in development, so I'm not happy with that as a solution, only as a workaround.

@navidemad
Copy link

To make it work with https://localhost, as mentioned by @BastienL caddy reverse proxy v2

curl -sS https://webi.sh/caddy | sh
caddy reverse-proxy --from localhost --to :3000

Unfortunately, I couldn't found out a way to make it works in my Github Action CI.
It would be awesome to have it out of the box in puma 🥇 I look forward to seeing that 😃

@nateberkopec nateberkopec mentioned this issue Aug 14, 2023
@darinwilson
Copy link

For others like me who stumbled onto this issue while looking into using import maps and puma: it sounds like David might be planning on releasing something to help bridge the current gap: https://twitter.com/FORSBERGtwo/status/1736766444485099794

@justinko
Copy link

https://github.com/oesmith/puffing-billy can also aid in test env

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests