Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for non-blocking fibers #2601

Closed
wants to merge 6 commits into from
Closed

Conversation

wjordan
Copy link
Contributor

@wjordan wjordan commented Apr 17, 2021

Description

This PR adds support for Non-blocking Fibers to the Server class.
Puma::FiberPool uses non-blocking fibers instead of threads to process connections.
A new fiber_scheduler configuration option sets a custom fiber scheduler and enables the use of the FiberPool.
For debugging/testing, use the SCHEDULER=1 env variable which enables the libev_scheduler backend.

Some of the code is still pretty rough around the edges, but it's working well enough for basic benchmarks and passes most of the server tests at this point (some of the force-shutdown-related behavior is not quite working perfectly yet). Sharing this as an early draft PR so anyone else interested can take a look and start testing/experimenting with it!

(Resolves #2517)

Performance

  • benchmarks/wrk/hello.sh (Single-process, 4 concurrent keepalive connections):
Before: 7362.54 req/sec
$ benchmarks/wrk/hello.sh
Puma starting in single mode...
* Puma version: 5.2.2 (ruby 3.0.0-p0) ("Fettisdagsbulle")
*  Min threads: 4
*  Max threads: 4
*  Environment: development
*          PID: 2089795
* Listening on http://0.0.0.0:9292
Use Ctrl-C to stop
Running 30s test @ http://localhost:9292
  2 threads and 4 connections
connection 0: 55213 requests completed
connection 1: 55628 requests completed
connection 0: 55170 requests completed
connection 1: 55606 requests completed
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   554.64us  289.19us   7.26ms   77.99%
    Req/Sec     3.70k   306.21     5.30k    87.87%
  Latency Distribution
     50%  511.00us
     75%  677.00us
     90%    0.90ms
     99%    1.47ms
  221617 requests in 30.10s, 16.06MB read
Requests/sec:   7362.54
Transfer/sec:    546.44KB
After: 24432.62 req/sec (3.3x faster)
$ SCHEDULER=1 benchmarks/wrk/hello.sh
Puma starting in single mode...
* Puma version: 5.2.2 (ruby 3.0.0-p0) ("Fettisdagsbulle")
*  Min threads: 4
*  Max threads: 4
*  Environment: development
*          PID: 2090586
* Listening on http://0.0.0.0:9292
Use Ctrl-C to stop
Running 30s test @ http://localhost:9292
  2 threads and 4 connections
connection 0: 183960 requests completed
connection 1: 183999 requests completed
connection 0: 183747 requests completed
connection 1: 183726 requests completed
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   167.23us   80.32us   4.43ms   93.20%
    Req/Sec    12.28k   817.19    12.93k    96.18%
  Latency Distribution
     50%  152.00us
     75%  157.00us
     90%  195.00us
     99%  438.00us
  735432 requests in 30.10s, 53.30MB read
Requests/sec:  24432.62
Transfer/sec:      1.77MB
- Gracefully stopping, waiting for requests to finish
Fibers x Threads: 15428.31 req/sec (2.1x faster)
$ PUMA_DEBUG=1 SCHEDULER=1 FIBERS_THREADS=1 benchmarks/wrk/hello.sh 
Puma starting in single mode...
* Puma version: 5.2.2 (ruby 3.0.0-p0) ("Fettisdagsbulle")
*  Min threads: 4
*  Max threads: 4
*  Environment: development
*          PID: 2477704
* Listening on http://0.0.0.0:9292
% Using fibers + threads
Use Ctrl-C to stop
% Using fiber scheduler
Running 30s test @ http://localhost:9292
  2 threads and 4 connections
connection 0: 116712 requests completed
connection 1: 117022 requests completed
connection 0: 115425 requests completed
connection 1: 115239 requests completed
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   352.48us  673.99us  16.55ms   96.71%
    Req/Sec     7.75k     1.05k    9.49k    72.59%
  Latency Distribution
     50%  221.00us
     75%  293.00us
     90%  435.00us
     99%    3.62ms
  464398 requests in 30.10s, 33.66MB read
Requests/sec:  15428.31
Transfer/sec:      1.12MB
- Gracefully stopping, waiting for requests to finish
% Drained 0 additional connections.

Comparisons to other Rack application servers:

Falcon: 17351.79 req/sec
Running 30s test @ http://localhost:9292
  2 threads and 4 connections
connection 0: 130575 requests completed
connection 1: 130579 requests completed
connection 0: 130577 requests completed
connection 1: 130566 requests completed
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   231.18us   60.40us   3.07ms   92.73%
    Req/Sec     8.72k   571.45     9.29k    98.17%
  522297 requests in 30.10s, 49.31MB read
Requests/sec:  17351.79
Transfer/sec:      1.64MB
Agoo: 37469.55 req/sec
Running 30s test @ http://localhost:9292
  2 threads and 4 connections
connection 0: 288072 requests completed
connection 1: 300080 requests completed
connection 0: 264534 requests completed
connection 1: 271597 requests completed
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     2.22ms    9.79ms 100.78ms   97.48%
    Req/Sec    18.83k     5.97k   32.71k    72.17%
  1124283 requests in 30.01s, 81.49MB read
Requests/sec:  37469.55
Transfer/sec:      2.72MB
Iodine: 90763.71 req/sec
Running 30s test @ http://localhost:9292
  2 threads and 4 connections
connection 0: 680118 requests completed
connection 1: 680111 requests completed
connection 0: 685893 requests completed
connection 1: 685896 requests completed
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    48.59us   38.29us   1.29ms   94.16%
    Req/Sec    45.62k     4.26k   56.84k    86.71%
  2732018 requests in 30.10s, 463.77MB read
Requests/sec:  90763.71
Transfer/sec:     15.41MB
Tipi: 62627.34 req/sec
Running 30s test @ http://localhost:9292
  2 threads and 4 connections
connection 0: 471343 requests completed
connection 1: 471323 requests completed
connection 0: 471172 requests completed
connection 1: 471266 requests completed
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    72.93us  116.81us   9.10ms   96.97%
    Req/Sec    31.47k     3.23k   40.86k    77.24%
  1885104 requests in 30.10s, 131.24MB read
Requests/sec:  62627.34
Transfer/sec:      4.36MB

  • hello.sh with "-H Connection: close" (NON-keepalive connections):
Before: 6362.43 req/sec
$ wrk http://localhost:9292 -c 4 -d 30 -H "Connection: close"
Running 30s test @ http://localhost:9292
  2 threads and 4 connections
connection 0: 47881 requests completed
connection 1: 47880 requests completed
connection 0: 47876 requests completed
connection 1: 47876 requests completed
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   604.18us  160.92us   5.15ms   73.62%
    Req/Sec     3.20k   443.62     5.59k    83.19%
  191513 requests in 30.10s, 17.35MB read
Requests/sec:   6362.43
Transfer/sec:    590.26KB
After: 15379.49 req/sec (2.4x faster)
$ wrk http://localhost:9292 -c 4 -d 30 -H "Connection: close"
Running 30s test @ http://localhost:9292
  2 threads and 4 connections
connection 0: 115722 requests completed
connection 1: 115722 requests completed
connection 0: 115743 requests completed
connection 1: 115743 requests completed
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   246.48us  104.18us   3.35ms   91.47%
    Req/Sec     7.73k   669.82     8.42k    89.20%
  462930 requests in 30.10s, 41.94MB read
Requests/sec:  15379.49
Transfer/sec:      1.39MB
Fibers x Threads: 9643.85 req/sec (1.5x faster)
$ PUMA_DEBUG=1 SCHEDULER=1 FIBERS_THREADS=1 benchmarks/wrk/hello.sh -H "Connection: close"
Puma starting in single mode...
* Puma version: 5.2.2 (ruby 3.0.0-p0) ("Fettisdagsbulle")
*  Min threads: 4
*  Max threads: 4
*  Environment: development
*          PID: 2478095
* Listening on http://0.0.0.0:9292
% Using fibers + threads
Use Ctrl-C to stop
% Using fiber scheduler
Running 30s test @ http://localhost:9292
  2 threads and 4 connections
connection 0: 72571 requests completed
connection 1: 72571 requests completed
connection 0: 72572 requests completed
connection 1: 72571 requests completed
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   398.24us  141.27us   4.23ms   88.16%
    Req/Sec     4.85k   495.18     5.65k    77.24%
  Latency Distribution
     50%  345.00us
     75%  397.00us
     90%  599.00us
     99%    0.91ms
  290285 requests in 30.10s, 26.30MB read
Requests/sec:   9643.85
Transfer/sec:      0.87MB
- Gracefully stopping, waiting for requests to finish
% Drained 0 additional connections.

Comparisons:

Falcon: 6412.59 req/sec
$ wrk http://localhost:9292 -c 4 -d 30 -H "Connection: close"
Running 30s test @ http://localhost:9292
  2 threads and 4 connections
connection 0: 48412 requests completed
connection 1: 48109 requests completed
connection 0: 47919 requests completed
connection 1: 48146 requests completed
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     7.49ms   24.54ms 430.51ms   93.36%
    Req/Sec     3.25k     1.38k    5.57k    68.69%
  192586 requests in 30.03s, 21.67MB read
Requests/sec:   6412.59
Transfer/sec:    738.95KB
Agoo: 2181.67 req/sec
$ wrk http://localhost:9292 -c 4 -d 30 -H "Connection: close"
Running 30s test @ http://localhost:9292
  2 threads and 4 connections
connection 0: 16361 requests completed
connection 1: 16273 requests completed
connection 0: 16441 requests completed
connection 1: 16411 requests completed
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     3.25ms    3.85ms  20.55ms   81.69%
    Req/Sec     1.10k   455.37     2.64k    67.67%
  65486 requests in 30.02s, 4.75MB read
  Socket errors: connect 0, read 65482, write 0, timeout 0
Requests/sec:   2181.67
Transfer/sec:    161.92KB
Iodine: 41072.78 req/sec
$ wrk http://localhost:9292 -c 4 -d 30 -H "Connection: close"
Running 30s test @ http://localhost:9292
  2 threads and 4 connections
connection 0: 308778 requests completed
connection 1: 308777 requests completed
connection 0: 309375 requests completed
connection 1: 309373 requests completed
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    91.63us   53.56us   1.15ms   91.83%
    Req/Sec    20.64k     1.96k   25.15k    80.40%
  1236303 requests in 30.10s, 203.97MB read
Requests/sec:  41072.78
Transfer/sec:      6.78MB
Tipi: 17297.16 req/sec Running 30s test @ http://localhost:9292 2 threads and 4 connections connection 0: 130266 requests completed connection 1: 130286 requests completed connection 0: 130120 requests completed connection 1: 129980 requests completed Thread Stats Avg Stdev Max +/- Stdev Latency 219.16us 196.05us 3.77ms 90.12% Req/Sec 8.71k 0.87k 11.02k 69.55% 520652 requests in 30.10s, 36.25MB read Socket errors: connect 0, read 520648, write 0, timeout 0 Requests/sec: 17297.16 Transfer/sec: 1.20MB

[more to come]

Notes

  • Based off Refactor drain on shutdown #2600 (cleaning up the connection-draining code) to make this implementation a bit easier to accomplish.
  • I've linked the libev_scheduler backend to the GitHub version in the Gemfile for now for testing, since it includes some fixes not included in the latest Rubygems release.
  • The scheduler implementation is swappable. I added libev_scheduler into this draft PR for now because it seemed the fastest in initial tests, but more detailed comparisons and benchmarks against the other existing implementations (such as async, evt, or the plain Ruby implementation in its test suite) would be helpful.
  • There is currently no limit on the number of fibers in the FiberPool, though there are probably cases where setting a per-process concurrency limit could still be helpful so I'll look into adding this at some point.
  • This PR is currently written so that that the Rack applications themselves are also running on non-blocking fibers, which could have some undesired impact on behavior of existing apps. More testing and experimentation would be helpful to see how compatible existing applications will be with the fiber scheduler, as well as how their performance compares to threads.
  • An alternate implementation would pass the complete buffered HTTP request from the Fiber pool to a Thread pool, and then pass the Rack response back to the Fiber to be written to the client. This would give the Rack app the same existing multi-threaded environment, so shouldn't have any change in behavior. However, this had a performance impact in some initial tests. I still think the option is promising and could use further investigation (it may offer better performance than Puma's existing reactor, even if it's not as fast as the all-fiber approach).
    • Update: commit 1ec4c58 demonstrates an alternate implementation along these lines, enabled using the FIBERS_THREADS env variable for testing. I've added microbenchmark results to the performance section for comparison.]

Your checklist for this pull request

  • I have reviewed the guidelines for contributing to this repository.
  • I have added (or updated) appropriate tests if this PR fixes a bug or adds a feature.
  • My pull request is 100 lines added/removed or less so that it can be easily reviewed.
  • If this PR doesn't need tests (docs change), I added [ci skip] to the title of the PR.
  • If this closes any issues, I have added "Closes #issue" to the PR description or my commit messages.
  • I have updated the documentation accordingly.
  • All new and existing tests passed, including Rubocop.

Also add test coverage for drain_on_shutdown option.
Puma::FiberPool uses non-blocking fibers instead of threads
to process connections.
A new `fiber_scheduler` configuration option sets a custom fiber scheduler
and enables the use of the FiberPool.
For debugging/testing, use the `SCHEDULER=1` env variable to enable the
`libev_scheduler` backend.
@dsh0416
Copy link

dsh0416 commented Apr 18, 2021

Good to hear the implementation of non-blocking fibers of puma. For benchmarking, I've just updated the evt library for fixing some issues for the implementation. dsh0416/evt@6de1d0e

@eregon
Copy link
Contributor

eregon commented Apr 18, 2021

An alternate implementation would pass the complete buffered HTTP request from the Fiber pool to a Thread pool. [...]

Note that it is valuable to execute application requests in different threads on other implementations than CRuby, because they typically (at least JRuby and TruffleRuby) actually execute Ruby threads in parallel. So I would think for non-CRuby a mix of Fibers (for IO concurrency) and Threads (for CPU parallelism) is the best.

@nateberkopec nateberkopec added feature perf waiting-for-changes Waiting on changes from the requestor labels Apr 18, 2021
@nateberkopec
Copy link
Member

An alternate implementation would pass the complete buffered HTTP request from the Fiber pool to a Thread pool, and then pass the Rack response back to the Fiber to be written to the client.

This is definitely the "less controversial" option from my perspective. I'm surprised by the performance impact. I sort of assumed that any additional overhead generated by threads would more present on the request buffering side rather than the "app.call" side.

Your benchmark results show that this is basically a straight overhead/latency removal rather than any increased concurrency (to be expected I guess).

There is currently no limit on the number of fibers in the FiberPool, though there are probably cases where setting a per-process concurrency limit could still be helpful so I'll look into adding this at some point.

We definitely have to be mindful here, as "infinite concurrency" can remove backpressure. Backpressure ensures optimal load balancing between processes.

@nateberkopec
Copy link
Member

One more concern I have re: fiber pools: there is no interrupt/fallback if a unit of work does not yield back to the caller. With threads and the GVL, we have a 100-millisecond limit (see TIME_QUANTUM... constants) to hold the GVL which prevents some tail latency.

Not a concern for implementations without GVLs obviously.

@ioquatix
Copy link
Contributor

Looks great to me!

@ioquatix
Copy link
Contributor

After thinking about the different approaches discussed here, one option is to consider using an intermediate buffer. I'm not sure if this is a common option, but having one reactor at the top level handling incoming request and outgoing responses is fine, but as has been said the entire request/response must be buffered which isn't ideal in some cases.

Therefore, why not use a unix pipe for streaming requests and responses? You can do this pretty efficiently from both ends, i.e. at the top level you are just reading from the network and writing to a buffered pipe (back pressure still exists).

On the request side, you are reading the body (from the pipe) and writing the response (pipe). At the top level, you read the response body from the pipe and write it back out to the network (still has back pressure).

While this can pose some memory overheads, apparently this can be done fairly efficiently using splice IIRC. If you do it correctly, you should get close to zero-copy I/O.

@dentarg
Copy link
Member

dentarg commented Apr 19, 2021

@ioquatix That sounds exactly what @cjlarose wrote in #puma-contrib:matrix.org ~3 weeks ago, hehe :) I can't find a way to link to those messages, but @cjlarose will probably chime in here if he feels the need. EDIT: Maybe it was @wjordan mentioning it first in the Matrix chat.

@wjordan
Copy link
Contributor Author

wjordan commented Apr 19, 2021

I've added a commit 1ec4c58 which shows the "fibers x threads" approach, and updated the "Performance" section of the description with some microbenchmark comparisons. Some observations:

  • Iodine is way ahead of the pack on this single-process microbenchmark- I take it as an upper bound on how well an evented-IO Rack server implemented directly in C can perform.
  • Agoo uses multiple native threads (no GVL) for handling connections, so it's technically using multiple cores even in this single-process benchmark. I don't know why it performed remarkably poorly on the non-keepalive test (concurrency issues?).
  • Running fibers x threads (non-blocking fibers for request/response handling, threads for invoking the Rack application) performs somewhere in between the all-fiber and the original approach. It's possible my implementation (using a Queue to pass the request/response between fiber and thread pool) isn't optimal- if anyone has any improvements on this please try it out!

Note that it is valuable to execute application requests in different threads on other implementations than CRuby, because they typically (at least JRuby and TruffleRuby) actually execute Ruby threads in parallel. So I would think for non-CRuby a mix of Fibers (for IO concurrency) and Threads (for CPU parallelism) is the best.

Theoretically yes I agree, however Fibers are still implemented on top of Threads in JRuby (and TruffleRuby), correct? Until JRuby/TruffleRuby integrates an optimized coroutine-based implementation of Fibers, I doubt this feature will offer any advantage for those Ruby runtimes over just using the existing thread pool. (Please correct me if I'm wrong on this!)

This is definitely the "less controversial" option from my perspective. I'm surprised by the performance impact. I sort of assumed that any additional overhead generated by threads would more present on the request buffering side rather than the "app.call" side.

Yeah, I was a bit surprised too- it's possible using a Queue to pass the request/response between the ThreadPool and the FiberPool is creating more overhead than expected somewhere? Maybe there's another approach to this I haven't thought of that wouldn't have as much of a performance impact.

We definitely have to be mindful here, as "infinite concurrency" can remove backpressure. Backpressure ensures optimal load balancing between processes.

Agreed, I wouldn't consider the feature complete until the 'max' concurrency setting is supported for reasons like that.

One more concern I have re: fiber pools: there is no interrupt/fallback if a unit of work does not yield back to the caller. With threads and the GVL, we have a 100-millisecond limit (see TIME_QUANTUM... constants) to hold the GVL which prevents some tail latency.

Yeah, this will be a new challenge, with a few possible approaches:

  • Educate developers looking to run in 'non-blocking fiber mode' to design CPU-heavy parts of Rack applications to periodically yield back to the scheduler (aside: is sleep 0 the best way to do this currently, or are scheduler implementations expected to automatically re-schedule fibers paused by Fiber.yield directly?)
  • Use Threads instead of Fibers for the Rack-application handling itself (better compatibility, but applications can't leverage the advantages of non-blocking fibers in their own code)
  • Integrate pre-emptive scheduling into the Fiber scheduler (@WJWH has been doing some very promising work in this direction- see "Pre-emptive fiber-based concurrency in MRI Ruby")

Therefore, why not use a unix pipe for streaming requests and responses? You can do this pretty efficiently from both ends, i.e. at the top level you are just reading from the network and writing to a buffered pipe (back pressure still exists).

Passing the request/response from a connection-handling fiber to a thread-pool doesn't involve any copying (both exist in the same process to begin with so they just share the same Rack env object), so I don't think a pipe would be helpful there (unless it's possibly less latency than a Queue mutex?). Hopefully the code in 1ec4c58 makes what I was talking about here more clear.

However more generally, streaming requests/responses through a pipe is definitely an interesting idea- it could be useful for efficiently balancing individual requests across processes (instead of just balancing the connections, which is what Puma does currently). And the idea of using splice to make the extra hop zero-copy is a great suggestion.

@ioquatix
Copy link
Contributor

ioquatix commented Apr 20, 2021

Everything that's being discussed here makes me super excited. Thanks!

Event loops naturally degrade as long as the processing is represented in the event loop. Basically, if you can saturate the event loop with work, it will stop having time for accept. When that point can occur depends a lot on workload.

Using a event loop for the front end and a thread pool for the backend makes a lot of sense, until your backend becomes largely I/O bound. This would happen when you were dealing with WebSockets. That being said, Rails makes a lot of assumptions about thread state, so I don't think Puma should go in that direction generally.

A more narrow scope for this might be in streaming responses. In that case, you could run that block of code in the event loop, and you'd get 90% of the use cases (i.e. WebSockets) at the expense of a few of the issues (ActiveRecord thread local scope, blocking I/O stalling the front half of the server). This is more on AR in terms of how they want to fix that problem, but until they have a use case that they (GitHub?) care about, I don't see much movement in this area. But if it's not supported by the server, it's definitely not going to move at all.

I'm not a big fan of preemption in event driven concurrency. I think it over-complicates the model, and the assumption is you shouldn't be doing heavy work in the event loop if you care about latency. This is the same as doing CPU heavy work in a NodeJS callback... we just don't conceptually have the same model in Ruby - i.e. developers need to be educated.

With Ruby 3, the following becomes event driven too:

Thread.new{...}.value

So splitting work into a background work pool becomes trivial and there is nothing wrong with doing that. We can't solve all "concurrency" and "parallelism" problems, there are really just a bunch of different design trade offs. Is just that now we have "event driven" as something we can use as a design parameter.

@ioquatix
Copy link
Contributor

ioquatix commented Apr 20, 2021

Here is Linus talking about how to use splice, it was a bit of an eye opener for me: https://yarchive.net/comp/linux/splice.html

@eregon
Copy link
Contributor

eregon commented Apr 20, 2021

Theoretically yes I agree, however Fibers are still implemented on top of Threads in JRuby (and TruffleRuby), correct? Until JRuby/TruffleRuby integrates an optimized coroutine-based implementation of Fibers, I doubt this feature will offer any advantage for those Ruby runtimes over just using the existing thread pool. (Please correct me if I'm wrong on this!)

Correct, Fibers on TruffleRuby and JRuby currently use native threads, but project Loom is making good progress and that will bring coroutines to the JVM and so to TruffleRuby and JRuby.
Also TruffleRuby/JRuby need to update to Ruby 3 to implement the new Ruby 3 scheduler interface to take advantage of this PR.
I'm not sure it's so clear cut, I could imagine an advantage of having more Fibers/threads to handle IO, even if they occur a thread context switch instead of a fiber context switch.

@johnnyshields
Copy link
Contributor

Just pinging on this. Would be wonderful to have especially in apps that have a lot of blocking I/O 👍

@claudiug

This comment has been minimized.

@skcc321

This comment has been minimized.

@nateberkopec
Copy link
Member

Closing as inactive, please bump once you are able to make the necessary changes 😄

@reeganviljoen
Copy link

reeganviljoen commented Feb 15, 2024

Is anything happening here, it would be amazing for puma to support this, it seems quite a few developments have arisen in this space since this issue was closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature perf waiting-for-changes Waiting on changes from the requestor
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature Request] Ruby 3.0 Fiber Scheduler support
10 participants