New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for non-blocking fibers #2601
Conversation
Also add test coverage for drain_on_shutdown option.
Puma::FiberPool uses non-blocking fibers instead of threads to process connections. A new `fiber_scheduler` configuration option sets a custom fiber scheduler and enables the use of the FiberPool. For debugging/testing, use the `SCHEDULER=1` env variable to enable the `libev_scheduler` backend.
Good to hear the implementation of non-blocking fibers of puma. For benchmarking, I've just updated the |
Note that it is valuable to execute application requests in different threads on other implementations than CRuby, because they typically (at least JRuby and TruffleRuby) actually execute Ruby threads in parallel. So I would think for non-CRuby a mix of Fibers (for IO concurrency) and Threads (for CPU parallelism) is the best. |
This is definitely the "less controversial" option from my perspective. I'm surprised by the performance impact. I sort of assumed that any additional overhead generated by threads would more present on the request buffering side rather than the "app.call" side. Your benchmark results show that this is basically a straight overhead/latency removal rather than any increased concurrency (to be expected I guess).
We definitely have to be mindful here, as "infinite concurrency" can remove backpressure. Backpressure ensures optimal load balancing between processes. |
One more concern I have re: fiber pools: there is no interrupt/fallback if a unit of work does not yield back to the caller. With threads and the GVL, we have a 100-millisecond limit (see Not a concern for implementations without GVLs obviously. |
Looks great to me! |
After thinking about the different approaches discussed here, one option is to consider using an intermediate buffer. I'm not sure if this is a common option, but having one reactor at the top level handling incoming request and outgoing responses is fine, but as has been said the entire request/response must be buffered which isn't ideal in some cases. Therefore, why not use a unix pipe for streaming requests and responses? You can do this pretty efficiently from both ends, i.e. at the top level you are just reading from the network and writing to a buffered pipe (back pressure still exists). On the request side, you are reading the body (from the pipe) and writing the response (pipe). At the top level, you read the response body from the pipe and write it back out to the network (still has back pressure). While this can pose some memory overheads, apparently this can be done fairly efficiently using |
…ead pool (for performance testing/comparison)
I've added a commit 1ec4c58 which shows the "fibers x threads" approach, and updated the "Performance" section of the description with some microbenchmark comparisons. Some observations:
Theoretically yes I agree, however Fibers are still implemented on top of Threads in JRuby (and TruffleRuby), correct? Until JRuby/TruffleRuby integrates an optimized coroutine-based implementation of Fibers, I doubt this feature will offer any advantage for those Ruby runtimes over just using the existing thread pool. (Please correct me if I'm wrong on this!)
Yeah, I was a bit surprised too- it's possible using a Queue to pass the request/response between the ThreadPool and the FiberPool is creating more overhead than expected somewhere? Maybe there's another approach to this I haven't thought of that wouldn't have as much of a performance impact.
Agreed, I wouldn't consider the feature complete until the 'max' concurrency setting is supported for reasons like that.
Yeah, this will be a new challenge, with a few possible approaches:
Passing the request/response from a connection-handling fiber to a thread-pool doesn't involve any copying (both exist in the same process to begin with so they just share the same Rack However more generally, streaming requests/responses through a pipe is definitely an interesting idea- it could be useful for efficiently balancing individual requests across processes (instead of just balancing the connections, which is what Puma does currently). And the idea of using |
Everything that's being discussed here makes me super excited. Thanks! Event loops naturally degrade as long as the processing is represented in the event loop. Basically, if you can saturate the event loop with work, it will stop having time for Using a event loop for the front end and a thread pool for the backend makes a lot of sense, until your backend becomes largely I/O bound. This would happen when you were dealing with WebSockets. That being said, Rails makes a lot of assumptions about thread state, so I don't think Puma should go in that direction generally. A more narrow scope for this might be in streaming responses. In that case, you could run that block of code in the event loop, and you'd get 90% of the use cases (i.e. WebSockets) at the expense of a few of the issues (ActiveRecord thread local scope, blocking I/O stalling the front half of the server). This is more on AR in terms of how they want to fix that problem, but until they have a use case that they (GitHub?) care about, I don't see much movement in this area. But if it's not supported by the server, it's definitely not going to move at all. I'm not a big fan of preemption in event driven concurrency. I think it over-complicates the model, and the assumption is you shouldn't be doing heavy work in the event loop if you care about latency. This is the same as doing CPU heavy work in a NodeJS callback... we just don't conceptually have the same model in Ruby - i.e. developers need to be educated. With Ruby 3, the following becomes event driven too:
So splitting work into a background work pool becomes trivial and there is nothing wrong with doing that. We can't solve all "concurrency" and "parallelism" problems, there are really just a bunch of different design trade offs. Is just that now we have "event driven" as something we can use as a design parameter. |
Here is Linus talking about how to use splice, it was a bit of an eye opener for me: https://yarchive.net/comp/linux/splice.html |
Correct, Fibers on TruffleRuby and JRuby currently use native threads, but project Loom is making good progress and that will bring coroutines to the JVM and so to TruffleRuby and JRuby. |
Just pinging on this. Would be wonderful to have especially in apps that have a lot of blocking I/O 👍 |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Closing as inactive, please bump once you are able to make the necessary changes 😄 |
Is anything happening here, it would be amazing for puma to support this, it seems quite a few developments have arisen in this space since this issue was closed. |
Description
This PR adds support for Non-blocking Fibers to the
Server
class.Puma::FiberPool
uses non-blocking fibers instead of threads to process connections.A new
fiber_scheduler
configuration option sets a custom fiber scheduler and enables the use of the FiberPool.For debugging/testing, use the
SCHEDULER=1
env variable which enables thelibev_scheduler
backend.Some of the code is still pretty rough around the edges, but it's working well enough for basic benchmarks and passes most of the server tests at this point (some of the force-shutdown-related behavior is not quite working perfectly yet). Sharing this as an early draft PR so anyone else interested can take a look and start testing/experimenting with it!
(Resolves #2517)
Performance
benchmarks/wrk/hello.sh
(Single-process, 4 concurrent keepalive connections):Before: 7362.54 req/sec
After: 24432.62 req/sec (3.3x faster)
Fibers x Threads: 15428.31 req/sec (2.1x faster)
Comparisons to other Rack application servers:
Falcon: 17351.79 req/sec
Agoo: 37469.55 req/sec
Iodine: 90763.71 req/sec
Tipi: 62627.34 req/sec
hello.sh
with"-H Connection: close"
(NON-keepalive connections):Before: 6362.43 req/sec
After: 15379.49 req/sec (2.4x faster)
Fibers x Threads: 9643.85 req/sec (1.5x faster)
Comparisons:
Falcon: 6412.59 req/sec
Agoo: 2181.67 req/sec
Iodine: 41072.78 req/sec
Tipi: 17297.16 req/sec
Running 30s test @ http://localhost:9292 2 threads and 4 connections connection 0: 130266 requests completed connection 1: 130286 requests completed connection 0: 130120 requests completed connection 1: 129980 requests completed Thread Stats Avg Stdev Max +/- Stdev Latency 219.16us 196.05us 3.77ms 90.12% Req/Sec 8.71k 0.87k 11.02k 69.55% 520652 requests in 30.10s, 36.25MB read Socket errors: connect 0, read 520648, write 0, timeout 0 Requests/sec: 17297.16 Transfer/sec: 1.20MB[more to come]
Notes
libev_scheduler
backend to the GitHub version in theGemfile
for now for testing, since it includes some fixes not included in the latest Rubygems release.libev_scheduler
into this draft PR for now because it seemed the fastest in initial tests, but more detailed comparisons and benchmarks against the other existing implementations (such asasync
,evt
, or the plain Ruby implementation in its test suite) would be helpful.FIBERS_THREADS
env variable for testing. I've added microbenchmark results to the performance section for comparison.]Your checklist for this pull request
[ci skip]
to the title of the PR.#issue
" to the PR description or my commit messages.