-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Import maps and performance (HTTP/2) #2697
Comments
Here are the options as I see them:
|
A bit confused. Are 'import maps' kind of like RubyGems for js in the client/browser? IOW, they're not generating requests to the application server, but to web repositories of the js 'packages'? Regardless, I'm interested in http/2 and http/3. Currently (with http/1.1), one 'conversation/request' is communicated to the rack app per socket. With http/2, one socket can have many streams, and hence, multiple 'conversations/requests' happen on one socket. How does rack handle that? I must be missing something, probably need to look at some examples of Ruby http/2 servers... See https://github.com/rails/importmap-rails#what-if-i-dont-like-to-use-a-javascript-cdn To me that implies that js files/packages can be served either from the 'application server' (Puma, etc) or CDN's... |
I’m willing to help. Don’t have any experience with app servers, but not afraid of reading code and debugging stuff until I figure what’s happening. Worst case, I can at least provide a production app running rails master to test a real workload (not necessarily through importmaps, but by having webpack chunk as much as possible) |
nghttp2 sounds like the least ideal of the three. It feels like it would be the fastest, but a few extra milliseconds per request sound better than bugs being open longer and burned out maintainers. protocol2 sounds like a fallback option unless the native extensions are a maintenance problem right now. If they are stable them I’d say there’s no reason to change a winning team. http2 is probably the best option. Full ruby and will add the least amount of unknowns to the current code, even if it required a bit extra effort to get it integrated. Silly question from someone who never touched puma code: does it need to integrate with http1 code? Can’t it be an XOR deal? As in “client support http2, so let’s get all requests through the http2 gem code.”? |
Don't underestimate the complexity of HTTP/2. I'm not sure I'd call the The biggest limitation Puma has in this regard is simply the limited number of threads w.r.t. simultaneous connections. There is also the reality that most load balancers don't yet support HTTP/2 on the backend. Falcon supports HTTP/2 with Rack with no changes to applications. Rack essentially implements CGI (HTTP/1.1) and there is a reasonably well defined mapping from HTTP/2 -> HTTP/1.1 which we effectively implement. However, I wanted to extend Rack to embrace concurrency rack/rack#1745. To me, this is one of the biggest advantages of HTTP/2 that we can leverage within the user facing code. Supporting multiple streams essentially means you'd be implementing a HTTP/2 -> HTTP/1 (application) gateway. You might as well just use Falcon in the connection acceptor thread - and this is only half suggested as a joke. I'm at least somewhat serious. 99% of Falcon is |
By the way, I've often thought of making a "threaded" adaptor for Falcon which executes requests in a thread pool which matches with Rails' expectations on how the execution of web requests works, rather than Falcon/Async which gives you a well defined concurrency model which unfortunately still bumps up against Rails' assumptions about the world ActiveRecord works in. But the reality is, even this is slowly changing. Puma is an incredibly important piece of technology. This probably forms part of a larger conversation but I'm not sure Puma needs HTTP/2. I think there are still improvements to be made to work scheduling and thread pool implementation TBH. I don't see Falcon/Puma as being in competition with each other except on the most friendly terms. They serve different purposes. |
So it sounds like a benchmark would be a good first step then to get an idea of the extent of the issue. That will let us try different configurations, such as putting I want to be clear that the motivation here is not "Puma must support HTTP/2" but "Puma should make the new import-map-driven experience in Rails 7 as fast as possible", so HTTP/2 is just a strong hunch on how we accomplish that. Maybe there are other things we can do. I'm also wondering if, because import maps is primarily about serving files from disk, that maybe there's a shortcut we can take here that makes import maps fast by avoiding Rack entirely. I think this weekend I'll take some time to make a benchmark and that should provide some insight into next steps, then. |
You mean as a development server, or as a production one? Because I can see the appeal for the former, but for the later, if your assets requests hit your Ruby server, you kind already lost. And even then, operationally speaking, it doesn't make much sense to multiplex requests all the way to the application server. If you have multiple requests in the same connection, it's much better if your reverse proxy dispatch them to distinct workers. Hence why I don't really see the appeal of HTTP/2 for ruby application servers. |
definitely mostly focusing on the former, but we can't ignore the prod experience I think either. Puma has always been an app server that you can just throw up on |
Leveraging HTTP/1 with splice/send file might be sufficient and you could totally build a light weight fiber scheduler for file IO which seems like it would solve any static file serving overhead. |
So, there appear to be two reasons why this isn't fast today:
If you take either of those things away, Puma rips just as fast as anything else. HTTP/2 removes the latter limit. I can maybe work with Rails on #1, since 60ms to put a file down the pipe seems kind of bad but I don't know enough about how that works in Rails to know if it will be hard or not. So that means this problem isn't significantly I/O bound. Here is my benchmark setup using k6. I'm still working on the nginx reverse proxy there to see if an HTTP/2 -> 1 gateway solves the problem. |
Also, re: TTFB, that will definitely be worse in production because after you set up the initial connection, you still need to make a full roundtrip to ask for the next file you want. So maybe fixing Rails' TTFB here won't do much. |
Example of what the HOL-blocking looks like today. |
If you do decide that rails needs fixing, and the problem is Sprockets (which is responsible for serving the files), I can help, as I’ve just started spending some time there to fix a couple of bugs. We will also have to give some care to the new propshaft gem which will replace sprockets: |
That said. JS and CSS files are requested enough that they should have close to 100% cache hit from CDNs |
@brenogazzola Not anymore. Browser caches are partitioned now. It essentially means the cache key of a request includes the domain name of the current window. 3rd party CDNs will not hit any more often than 1st-party requests. |
What I get from the article is that the old “use jquery from a cdn because if the user visited another website that has it, it will be already cached” is no longer valid, is that right? 🤔 What I meant is, if your app is using cloudflare, and your users are in US, and you deploy a new js file, puma will only need to serve it 75 times (25 pops * 3 until cache hit) before the requests stop reaching puma. Other CDNs have their own rules, but it seems to me the motivation here is “there are going to be many js files now, instead of one, let’s make sure puma can serve them fast”, and it will only matter for those initial 75 requests. |
By only supporting HTTP/1 Puma is in a unique position to serve static files very efficiently using splice/sendfile. I don't know the current implementation, but we should definitely take advantage of it if possible. |
Nate will know this better than me, but AFAIK Puma is getting the CSS/JS file from sprockets, and both implement the Rack spec, so Sprockets is reading the content of the file and returning it to Puma through the |
@brenogazzola there's no more sprockets, that's the point. |
@ioquatix I don't think that's important right now, because as my benchmark shows, HOL-blocking is the bottleneck and not I/O speed. @brenogazzola Yes, in production, CDNs will alleviate a lot of the load from Puma. However, Puma should be a good and fast experience without any external dependencies, as it has always been. Also, we don't have CDNs in development, and in dev, Puma currently takes about ~8 seconds to fully satisfy the downloads for an import-mapped app with 150 dependencies, which is not a great experience. |
Re the 150 dependencies, what type of response bodies are being returned by rack? Are they enums/arrays, chunked, or respond to If they're enums/arrays or chunked, depending on the 'length' and byte size, there may be an improvement with #2696. |
😱. Ok, I'm convinced, haha |
Well, if you serve each individual static file request say 2x faster, it would have a major impact on that HOL-blocking. I dug a bit on the Rails & Rack code ( |
@byroot If you make 150 concurrent requests for a ~70kb file against Puma running Rack::Static (see my benchmark), it completes in less than 100 milliseconds. So I think the latency Rails is adding is coming from somewhere else in the request, not I/O. I can open up |
Ref: https://puma/issues/2697 ``` $ benchmarks/wrk/big_response.sh Puma starting in single mode... * Puma version: 5.5.0 (ruby 3.0.2-p107) ("Zawgyi") * Min threads: 4 * Max threads: 4 * Environment: development * PID: 17879 * Listening on http://0.0.0.0:9292 Use Ctrl-C to stop Running 1m test @ http://localhost:9292 2 threads and 4 connections Thread Stats Avg Stdev Max +/- Stdev Latency 3.37ms 5.89ms 48.28ms 94.46% Req/Sec 0.88k 148.97 1.07k 82.08% Latency Distribution 50% 2.21ms 75% 2.78ms 90% 4.09ms 99% 35.75ms 105651 requests in 1.00m, 108.24GB read Requests/sec: 1758.39 Transfer/sec: 1.80GB - Gracefully stopping, waiting for requests to finish ``` ``` $ benchmarks/wrk/big_file.sh Puma starting in single mode... * Puma version: 5.5.0 (ruby 3.0.2-p107) ("Zawgyi") * Min threads: 4 * Max threads: 4 * Environment: development * PID: 18034 * Listening on http://0.0.0.0:9292 Use Ctrl-C to stop Running 1m test @ http://localhost:9292 2 threads and 4 connections Thread Stats Avg Stdev Max +/- Stdev Latency 1.06ms 1.09ms 20.98ms 97.94% Req/Sec 1.85k 150.69 2.03k 89.92% Latency Distribution 50% 0.94ms 75% 1.03ms 90% 1.21ms 99% 4.91ms 221380 requests in 1.00m, 226.81GB read Requests/sec: 3689.18 Transfer/sec: 3.78GB - Gracefully stopping, waiting for requests to finish ```
I did a quick proof of concept: #2703 The current
Using
|
Ref: https://puma/issues/2697 ``` $ benchmarks/wrk/big_response.sh Puma starting in single mode... * Puma version: 5.5.0 (ruby 3.0.2-p107) ("Zawgyi") * Min threads: 4 * Max threads: 4 * Environment: development * PID: 17879 * Listening on http://0.0.0.0:9292 Use Ctrl-C to stop Running 1m test @ http://localhost:9292 2 threads and 4 connections Thread Stats Avg Stdev Max +/- Stdev Latency 3.37ms 5.89ms 48.28ms 94.46% Req/Sec 0.88k 148.97 1.07k 82.08% Latency Distribution 50% 2.21ms 75% 2.78ms 90% 4.09ms 99% 35.75ms 105651 requests in 1.00m, 108.24GB read Requests/sec: 1758.39 Transfer/sec: 1.80GB - Gracefully stopping, waiting for requests to finish ``` ``` $ benchmarks/wrk/big_file.sh Puma starting in single mode... * Puma version: 5.5.0 (ruby 3.0.2-p107) ("Zawgyi") * Min threads: 4 * Max threads: 4 * Environment: development * PID: 18034 * Listening on http://0.0.0.0:9292 Use Ctrl-C to stop Running 1m test @ http://localhost:9292 2 threads and 4 connections Thread Stats Avg Stdev Max +/- Stdev Latency 1.06ms 1.09ms 20.98ms 97.94% Req/Sec 1.85k 150.69 2.03k 89.92% Latency Distribution 50% 0.94ms 75% 1.03ms 90% 1.21ms 99% 4.91ms 221380 requests in 1.00m, 226.81GB read Requests/sec: 3689.18 Transfer/sec: 3.78GB - Gracefully stopping, waiting for requests to finish ```
Yeah, one big downside though is that it doesn't have a proper timeout API. You can wrap it with
For delegating to the reverse proxy, yes: https://github.com/rack/rack/blob/d15dd728440710cfc35ed155d66a98dc2c07ae42/lib/rack/sendfile.rb for the purpose of using Also I just realized my benchmark isn't quite perfect, |
I added big_file.ru to some of the code that's in various PR's, and updated the code in #2696 to use IO.copy_stream. Results below with the wrk code, with Puma running Master
Modified PR 2696
Note that the master run also had errors on the last wrk run (-t60 -c300) and the smem data was 'odd'. Regardless, using Lastly, I changed the body size to 50 kB, master ran 11,284, PR 2696 ran 23,138. |
Is it any different than what the current code uses, which is Also, re the data I listed above, |
|
I think you are talking about reading the IO the application returned. This one isn't a big concern because we can assume the application is responsible for returning an IO that won't block forever. The timeout that concerns me is the write one. Right now the So for a development only feature, or if we're behind a reverse proxy that buffer responses, it would be ok, but other than that it would be a big security risk. I started looking at adding a timeout to So not sure how puma could make use of it given these constraints. |
We will definitely support timeout with copy_stream on the fiber scheduler if that's any help. However, this might not make it into 3.1. |
I learned that Gusto's ~7 year old Rails monolith has 1276 javascript assets, with a total size of 31.2 MB (average size of about 24kb). I've modified my benchmark to look similar (it downloads the es shim, at 30kb, 1200 times). I've done some more testing using my benchmark, and here's what I've learned:
So, locally, Puma seems to do just fine. It looks like any slowness here is caused by Rails' response times, and is not alleviated with HTTP/2. For local performance, we should focus on improving Rails' response times. For production, it's a different story. I suspect Rails doesn't really care about our story here, because David looks like he's expecting everyone to just use a CDN in prod. That's fine for Rails but like I said above, I would like for Puma to "just work" and provide a decent experience where possible. In production, a single request might take 100-200ms round-trip, which is going to balloon the total benchmark time here into unsustainable territory of like 40 seconds or more. My benchmark could be improved by actually providing a list of 1200 JS files to download of roughly 30MB total size, rather than just downloading the same file over and over. I think that would show any problems in Rails better, particularly in a cold boot scenario. |
Use multi-process for the best throughput (the default). Both the modes you tested are almost the same - lots of threads. |
I'm trying to duplicate Rails' default - 1 process. If Rails wants to increase throughput by increasing process count, they can go ahead and do that. |
It almost looks like HTTP/2 just makes this benchmark slower rather than faster, I don't think it's anything specific to Falcon. I just tried with nginx-with-http2-fronting-Puma and got 8.7 seconds. I don't know anything about HTTP/2 connection tuning but it just appears to be doing poorly in this case. |
When I have a moment, I'll try out your benchmark and report back. |
Before a few things sidetracked me, I was working on perf testing against string, array, chunked, and file bodies. Was using wrk and ruby code. OS file caching and all sorts of things come into play. Noticed some odd things with varying file sizes, which may be specific to WSL2/Ubuntu. Or, I suspect that |
@ioquatix It might be worth trying other benchmarking tools (against a running Rails app). I'd like to see this duplicated with something like |
I actually have my own benchmarking tools in |
I just tried h2load - a little bit better. 7.3 seconds on NGINX-fronting-Puma, still slower than HTTP1. |
I compared HTTP/1.1 & HTTP/2:
TLS adds quite a bit of overhead, and HTTP/2 is slower in every way. However where HTTP/2 has an advantage is when you have lots of simultaneous requests. But in terms of raw throughput, it doesn't have a clear advantage, because the protocol is much more complex in the user space. |
Hello world. I’m looking at this (but don’t have immediate answers). At a high level it looks like if the problem is the rails/rack response time and rack will degrade a http2 to http1 that we are limited in any fixes to puma (or any other web server). Even though supporting http2 would mean that we only have to handle one connection instead of 1200 connections, based on the info here it sounds like the bottleneck isn’t tcp slow start or friends associated but in action dispatch static middleware or somewhere before it. I am curious where the bulk of time is spent between puma having a parsed request for an asset and writing a response. I’m curious how well passenger fares here as it serves It seems like to get maximum performance we would need both something that understands http2 and doesn’t need to continually open new connections as well as low latency logic to find those assets on disk and serve them. Which sounds a lot like putting a http2 server in front of your rails app and teaching it how to serve assets which sounds a lot like what passenger does https://www.phusionpassenger.com/library/dev/ruby/rails_integration.html#static-assets-serving. Has anyone benched passenger? |
I will revisit this some time later this year and have updated benchmarks for HTTP/2 - right now the overhead of Async::IO is quite decent but using IO::Buffer and direct IO using the fiber scheduler should help a bit (hopefully a lot). I don't see why we can't make this a lot better. |
Valid question by @schneems around Passenger. Perhaps some of these questions have already been answered by the Passenger team? |
Ref: https:///issues/2697 ``` $ benchmarks/wrk/big_response.sh Puma starting in single mode... * Puma version: 5.5.0 (ruby 3.0.2-p107) ("Zawgyi") * Min threads: 4 * Max threads: 4 * Environment: development * PID: 17879 * Listening on http://0.0.0.0:9292 Use Ctrl-C to stop Running 1m test @ http://localhost:9292 2 threads and 4 connections Thread Stats Avg Stdev Max +/- Stdev Latency 3.37ms 5.89ms 48.28ms 94.46% Req/Sec 0.88k 148.97 1.07k 82.08% Latency Distribution 50% 2.21ms 75% 2.78ms 90% 4.09ms 99% 35.75ms 105651 requests in 1.00m, 108.24GB read Requests/sec: 1758.39 Transfer/sec: 1.80GB - Gracefully stopping, waiting for requests to finish ``` ``` $ benchmarks/wrk/big_file.sh Puma starting in single mode... * Puma version: 5.5.0 (ruby 3.0.2-p107) ("Zawgyi") * Min threads: 4 * Max threads: 4 * Environment: development * PID: 18034 * Listening on http://0.0.0.0:9292 Use Ctrl-C to stop Running 1m test @ http://localhost:9292 2 threads and 4 connections Thread Stats Avg Stdev Max +/- Stdev Latency 1.06ms 1.09ms 20.98ms 97.94% Req/Sec 1.85k 150.69 2.03k 89.92% Latency Distribution 50% 0.94ms 75% 1.03ms 90% 1.21ms 99% 4.91ms 221380 requests in 1.00m, 226.81GB read Requests/sec: 3689.18 Transfer/sec: 3.78GB - Gracefully stopping, waiting for requests to finish ```
* Proof of Concept: Use `IO.copy_stream` to serve files Ref: https:///issues/2697 ``` $ benchmarks/wrk/big_response.sh Puma starting in single mode... * Puma version: 5.5.0 (ruby 3.0.2-p107) ("Zawgyi") * Min threads: 4 * Max threads: 4 * Environment: development * PID: 17879 * Listening on http://0.0.0.0:9292 Use Ctrl-C to stop Running 1m test @ http://localhost:9292 2 threads and 4 connections Thread Stats Avg Stdev Max +/- Stdev Latency 3.37ms 5.89ms 48.28ms 94.46% Req/Sec 0.88k 148.97 1.07k 82.08% Latency Distribution 50% 2.21ms 75% 2.78ms 90% 4.09ms 99% 35.75ms 105651 requests in 1.00m, 108.24GB read Requests/sec: 1758.39 Transfer/sec: 1.80GB - Gracefully stopping, waiting for requests to finish ``` ``` $ benchmarks/wrk/big_file.sh Puma starting in single mode... * Puma version: 5.5.0 (ruby 3.0.2-p107) ("Zawgyi") * Min threads: 4 * Max threads: 4 * Environment: development * PID: 18034 * Listening on http://0.0.0.0:9292 Use Ctrl-C to stop Running 1m test @ http://localhost:9292 2 threads and 4 connections Thread Stats Avg Stdev Max +/- Stdev Latency 1.06ms 1.09ms 20.98ms 97.94% Req/Sec 1.85k 150.69 2.03k 89.92% Latency Distribution 50% 0.94ms 75% 1.03ms 90% 1.21ms 99% 4.91ms 221380 requests in 1.00m, 226.81GB read Requests/sec: 3689.18 Transfer/sec: 3.78GB - Gracefully stopping, waiting for requests to finish ``` * Ruby 2.2 compat * test_puma_server.rb - fixup test_file_body Co-authored-by: Jean Boussier <jean.boussier@gmail.com> Co-authored-by: MSP-Greg <Greg.mpls@gmail.com>
Any updates here? I'm migrating a Rails 7 app to use importmap and I ran into slowness in my test and dev environments, when fetching the many assets. It's so slow that my capybara tests are "timing out", meaning the assets take too long to load, and cause other things to fail as a result. Is there a solution to this problem yet? |
Thanks for reporting @dkniffin. Asset loading is a fairly complex process, and Puma may or may not be your bottleneck. Since this issue is difficult to reproduce on a simple app and the path forward to improving performance isn't clear here, if you have a half hour and can book some time with me (see CONTRIBUTING.md for link) I'd love to peek at your screen while this is happening so I can get some ideas as to what's going on. |
@nateberkopec Thank you for the offer. I think I found the issue in my slow tests: Another thing was changing relative imports to absolute imports. So I still think there would definitely be a benefit from having http/2 support in Rack and puma, but it's less of a concern for me now. (I think... still gotta finish getting my test suite passing) @BastienL Thanks for the tip! I'll definitely check that out. |
Great, and if anyone else wants to take me up on that offer I'm happy to do it. You should not need a reverse proxy in front of Puma to enjoy good performance in development, so I'm not happy with that as a solution, only as a workaround. |
To make it work with https://localhost, as mentioned by @BastienL caddy reverse proxy v2
Unfortunately, I couldn't found out a way to make it works in my Github Action CI. |
For others like me who stumbled onto this issue while looking into using import maps and puma: it sounds like David might be planning on releasing something to help bridge the current gap: https://twitter.com/FORSBERGtwo/status/1736766444485099794 |
https://github.com/oesmith/puffing-billy can also aid in test env |
Rails is going the way of supporting import maps by default as its new JS solution. This means that Rails apps can make 100+ requests to the application server as they traverse the import map.
To make this fast in Puma, I have a few concerns:
Other things we'll probably come up as we're spiking this. Right now, I want to know:
The text was updated successfully, but these errors were encountered: