Introduce support for fiber-per-request. #3101

ioquatix · 2023-03-17T04:08:54Z

This enables the correct scope of Fiber.storage per request.

Fiber[] and Fiber[]= are recently introduced features in Ruby 3.2. They are designed to provide per-operation or per-request state handling.

Using a thread pool, Fiber.storage is retained beyond a single request, which can leak information from one request into another. The simplest way to avoid this is to wrap each request itself in a fiber. The overhead should be minimal as these fibers are cached and reused, so once warm, the overhead is a single Ruby VALUE allocation per request.

The benefit is that code which uses Fiber[]= will be scoped correctly to a single request.

Your checklist for this pull request

I have reviewed the guidelines for contributing to this repository.
I have added (or updated) appropriate tests if this PR fixes a bug or adds a feature.
My pull request is 100 lines added/removed or less so that it can be easily reviewed.
If this PR doesn't need tests (docs change), I added [ci skip] to the title of the PR.
If this closes any issues, I have added "Closes #issue" to the PR description or my commit messages.
I have updated the documentation accordingly.
All new and existing tests passed, including Rubocop.

nateberkopec · 2023-03-17T04:43:16Z

Very interesting!

I assume we don't need a Ruby version gate here - people just won't use Fiber[]= in their Ruby 3.1 and earlier apps.

We should have a test for 3.2+ that tries to read Fiber[] across requests.
We should run this by our benchmark suite locally to double-check.

ioquatix · 2023-03-17T04:43:27Z

@nateberkopec There are two parts to this PR.

One might be considered a feature and one might be considered a bug.

Between requests, fiber locals are not reset, so even without supporting Fiber.storage, just using Thread.current[:foo] can leak between requests (may or may not be a bad thing???).

The other part of this is correctly supporting Fiber.storage assignment.

ioquatix · 2023-03-17T04:44:47Z

@nateberkopec I've just added tests, I think you will find them interesting to review because they show the current behaviour vs what I'm proposing (just comment out the first commit to see the behavioural differences).

Also 100% we should test for performance regressions. Is there documentation you can point me at for the benchmark test suite?

nateberkopec · 2023-03-17T04:46:18Z

Right, thanks for calling that out - since Thread.current[:foo] uses fiber-locals underneath, this will be a big behavior change for people using that right?

(may or may not be a bad thing???).

It's not good or bad, it's just Behavior, Which That People Inevitably Depend On 😆

Is there documentation you can point me at for the benchmark test suite?

Not really, consider this just an action item on the maintainer's part.

nateberkopec · 2023-03-17T04:46:45Z

We also failures on main, yikes.

ioquatix · 2023-03-17T04:49:13Z

I think leaking state between requests is probably a bad thing. Don’t disagree that people might be depending on it but sounds flaky at best if so (order dependent and thread dependent).

ioquatix · 2023-03-17T05:14:51Z

I might need to fix the exception handling when I’m back home.

ioquatix · 2023-03-17T08:07:51Z

I observed Windows segfaulting: https://github.com/puma/puma/actions/runs/4445249381/jobs/7804200284

MSP-Greg · 2023-03-17T14:32:59Z

@ioquatix Thanks for working on this. I ran these commits on top of revisions to the test system, Windows 2.5 failed, and all the JRuby jobs. Everything else was ok.

Some performance data using WSL2/Ubuntu 22.04 is below. It does seem to slow things down...

Puma repo branch pr/3101
Body    ────────── req/sec ──────────   ─────── req 50% times ───────
 KB     array   chunk  string      io   array   chunk  string      io
1       16279   16019   15739   11057   0.195   0.194   0.206   0.311
10      15737   13432   13397   10976   0.209   0.302   0.312   0.381
─────────────────────────────────────────────────────────────────────

Puma repo branch master
Body    ────────── req/sec ──────────   ─────── req 50% times ───────
 KB     array   chunk  string      io   array   chunk  string      io
1       18484   18486   18641   12733   0.195   0.178   0.191   0.235
10      18297   17096   18560   12368   0.170   0.240   0.196   0.268
─────────────────────────────────────────────────────────────────────

Above was run with benchmarks/local/response_time_wrk.sh -w2 -t5:5 -s tcp6 -Y -b 1,10, Using your wrk.

ioquatix · 2023-03-18T01:54:13Z

I will check the performance too, that seems a little too costly to me. Maybe we can improve on it, or maybe there is something else going on.

In any case, there are a couple of options to consider:

This could be a middleware. But this behaviour is inconsistent across different servers, and the middleware would have no effect on fiber-per-request servers like Falcon already.
It could be optional feature of Puma.

MSP-Greg · 2023-03-18T04:46:50Z

Maybe we can improve on it

Interested in helping with that. Right now, the Fiber is wrapping nothing more than app.call, which in the benchmarks is very quickly sending back a few dozen headers and various body configurations (size, array length, enum, file, etc).

Or, we've got three main 'processes':

A. Receive and process the request
B. App.call
C. Assemble & transmit the response

Current code is just wrapping B. I suspect that should be expanded?

ioquatix · 2023-03-18T04:57:38Z

Feel free to modify this PR as you see fit. I will try to look at it a bit later tonight or tomorrow.

ioquatix · 2023-03-18T04:59:57Z

Regarding your question, I'm only concerned about the app code being exposed to prior state. Well, we can wrap more, but I don't think there is a huge advantage... oh but streaming responses probably should be in the same context so... yeah we might want to expand it for that reason and add some more tests.

dentarg · 2023-03-18T09:31:44Z

Re Thread.current and leaking state, I suspect most apps are clearing it themselves (example)

ioquatix · 2023-03-19T19:45:06Z

Ah, nice example.

I guess ideally with the newer "fiber per request" style interface, we shouldn't expect users to do this (one of the authors of request state was actually very excited by this). The semantics at the language level should be clear.

nateberkopec · 2023-03-20T06:19:49Z

This could be a middleware.
It could be optional feature of Puma.

Hmm... when you put it this way, it reminds me of the old "thread_safe" config setting in Rails. Aaron eventually removed it because "who would ever say, "give me the non thread safe version please!"

12% performance hit is rough, but we're still doing a response in far less than a tenth of a millisecond. I think the trade could be worth it.

ioquatix · 2023-03-20T07:32:49Z

12% performance hit is rough, but we're still doing a response in far less than a tenth of a millisecond. I think the trade could be worth it.

With a little bit of finangling, we might be able to introduce a more streamlined interface.

For example, if you only care about Fiber storage, it possible to assign to it directly.

Let me do a bit more analysis on the performance cost. It seems 1/10th of a ms is quite a bit, more than I expected.

ioquatix · 2023-03-20T08:01:18Z

I just noticed there is some provision for clean_thread_locals which is actually "cleaning" fiber locals.

ioquatix · 2023-03-20T08:40:37Z

I wouldn't call this particularly comprehensive (I was seeing a margin of error of like +-5%) but here are the results from my laptop:

Normal

> benchmarks/local/response_time_wrk.sh -T2 -w2 -t8:8 -b c1,10,100

Size   req/sec    50%     75%     90%     99%    100%  Resp Size
───────────────────────────────────────────────────────────────── chunk
1       27301   0.136   0.154   0.189   0.439  16.780      1040
10      21446   0.174   0.195   0.235   0.446  15.030     10318
100      6037   0.646   0.711   0.806   1.010   7.270    103108

Fiber

> benchmarks/local/response_time_wrk.sh -T2 -w2 -t8:8 -b c1,10,100

Size   req/sec    50%     75%     90%     99%    100%  Resp Size
───────────────────────────────────────────────────────────────── chunk
1       25492    0.14    0.16    0.21    0.44    6.67      1040
10      20082    0.18    0.21    0.27    0.52    6.98     10318
100      5302    0.70    0.82    1.06  199.18  306.25    103107

> RUBY_SHARED_FIBER_POOL_FREE_STACKS=0 benchmarks/local/response_time_wrk.sh -T2 -w2 -t8:8 -b c1,10,100

Size   req/sec    50%     75%     90%     99%    100%  Resp Size
───────────────────────────────────────────────────────────────── chunk
1       26900   0.138   0.156   0.196   0.412   6.580      1040
10      21256   0.173   0.195   0.250   0.479   6.770     10318
100      5841   0.660   0.741   0.850   1.080   7.430    103107

ioquatix · 2023-03-20T08:41:34Z

(also as an aside, the benchmark by default uses too many wrk threads/connections and will over-saturate Puma giving less than robust results, since a whole ton of wrk connection will be idle).

ioquatix · 2023-03-20T08:43:08Z

RUBY_SHARED_FIBER_POOL_FREE_STACKS=0 is a secret escape hatch which uses a little more memory but increases performance, since it doesn't return the fiber stacks back to the OS. What this tells us, is that the overhead of memory management is non-trivial, since using it appears to give an advantage.

ioquatix · 2023-03-20T08:56:28Z

The Linux results are also a bit weird.

Normal

> benchmarks/local/response_time_wrk.sh -T2 -w2 -t8:8 -b c1,10,100

Size   req/sec    50%     75%     90%     99%    100%  Resp Size
───────────────────────────────────────────────────────────────── chunk
1       25380   0.146   0.154   0.185   0.415   5.330      1040
10      20640   0.176   0.188   0.270   0.473   5.780     10318
100      9585   0.386   0.488   0.592   0.811   7.850    103185

Fiber

> PUMA_FIBER_PER_REQUEST=1 benchmarks/local/response_time_wrk.sh -T2 -w2 -t8:8 -b c1,10,100

Size   req/sec    50%     75%     90%     99%    100%  Resp Size
───────────────────────────────────────────────────────────────── chunk
1       21598   0.171   0.182   0.226   0.472   5.650      1040
10      18058   0.200   0.212   0.308   0.530   5.740     10318
100      8603   0.474   0.516   0.669   0.860   6.350    103108

> PUMA_FIBER_PER_REQUEST=1 RUBY_SHARED_FIBER_POOL_FREE_STACKS=0 benchmarks/local/response_time_wrk.sh -T2 -w2 -t8:8 -b c1,10,100

Size   req/sec    50%     75%     90%     99%    100%  Resp Size
───────────────────────────────────────────────────────────────── chunk
1       37997   0.102   0.130   0.140   0.272   5.520      1040
10      24752   0.154   0.189   0.219   0.397   5.800     10318
100      8672   0.440   0.503   0.612   0.827   7.950    103109

ioquatix · 2023-03-20T09:00:10Z

I tried it several times. It looks like on Linux, in some situations, this can be a performance advantage. But I don't know why. Maybe it's a bug in the benchmark.

MSP-Greg · 2023-03-20T16:50:24Z

@ioquatix Thanks for looking into a bunch of things. I've been putting off getting a native Ubuntu host system (time, money, etc), so I use WSL2/Ubuntu, which runs on Windows. It is somewhat inconsistent, and often slower than a native system.

So, I tried to get the benchmark default to hit Puma hard, but not 'hammered', or something like that. If you find different default would work better, we change them.
I suspect you may have found it in the code or notes, but array (a) & chunked (c) bodies use an element size of 1kB, so c100 is sending 100 1kB chunks for the body. For String bodies (s) the body is a single element array...

ioquatix · 2023-03-20T20:40:34Z

@MSP-Greg do you accept github sponsors?

nateberkopec · 2023-03-29T06:23:05Z

While I think - if this is a Good Idea and It Works - we should just make the default, it is a breaking change and likely to Break Something Some Where. I wonder if, in the meantime, before Puma 7, we could ship this is a rack middleware to give people time to try it and see what happens?

The middleware would just a be a temporary thing intended to test this out, not as a permanent fixture.

ioquatix · 2023-03-29T09:24:41Z

Here are my thoughts in no particular order.

The performance overhead will be more complex on TruffleRuby and JRuby until they adopt Loom/Virtual Threads.

It's possible in the single process/thread case that users are always expecting the same thread locals to work as a kind of global state.

Since connection keep-alive is bound to a worker, there is a kind of natural affinity that comes with every request being processed by the same thread. I could imagine some code depending on that (it's going to be janky but 🤷🏼). Additionally, if they are using thread-local caches, performance/locality might be affected.

This feature can't quite be implemented as a middleware as it doesn't correctly cover streaming responses. However, there is NO specification that states that a streaming response should be handled on the same fiber as the request, but I imagine some poorly designed code that tries to take a lock during app.call and free it in response[2].close, could blow up.

I believe that the performance overhead can and should be reduced to almost zero but it will require more investigation on my part. I will definitely consider this use case closely but I can't make any guarantees before Ruby 3.3. It's possible I'll have time and motivation to investigate it. An overhead of 2-3% should be attainable or better IMHO.

Maybe better to start off as a feature of puma that is disabled by default in v6. It could be controlled by the existing option: clean_thread_locals which is... actually cleaning fiber locals (according to the implementation). Or we could introduce a new option.

ioquatix · 2024-04-03T09:22:02Z

It might be good to address #3360 before merging this.

ioquatix · 2024-04-03T13:23:17Z

@MSP-Greg do you have any idea why https://github.com/puma/puma/actions/runs/8539240372/job/23393556676?pr=3101 is failing?

MSP-Greg · 2024-04-03T13:41:44Z

Still need a bit more coffee. I'll test it locally soon (today). I've got to switch out OpenSSL 3.2 in my MSYS2...

BTW, thanks for your work on this.

MSP-Greg · 2024-04-03T23:06:15Z

@ioquatix

Looked at the Windows 2.5 issue, and I could verify locally. Ruby 2.6 works fine, didn't check 2.4.

I thought I'd try to write some code without Puma to see if I can get a better idea what's happening. It's just stopping, with no indication why...

ioquatix · 2024-04-03T23:13:57Z

If you add print statements to every line, can we see how far it executed the test?

ioquatix · 2024-04-03T23:15:01Z

@MSP-Greg do you have some time now to show me? https://whereby.com/ioquatix

MSP-Greg · 2024-04-04T01:16:39Z

@ioquatix

We can 'whereby' if you'd like. Sorry for the delay...

ioquatix · 2024-04-04T01:23:57Z

@MSP-Greg I have some time now if that suits you.

MSP-Greg · 2024-04-04T01:24:56Z

Knocking

ioquatix · 2024-04-04T03:47:10Z

We came to the conclusion that Fibers on Windows, Ruby 2.5, do not work correctly.

This enables the correct scope of `Fiber.storage` per request. - Unify `clean_thread_locals` and `fiber_per_request` configuration options.

ioquatix · 2024-04-24T23:18:43Z

I've rebased this PR, so that it includes the fixes for Puma::Server.current which was discovered during my work here: #3360.

This PR introduces a new configuration option fiber_per_request, and aliases it to the old one, clean_thread_locals. In fact, the old name is also incorrect, since it only cleans fiber locals. So we can consider deprecating and removing clean_thread_locals in v7, or fixing it so that it actually does use a new thread for each request.

While I have not measured it recently, CRuby does cache state relating to thread allocations and fiber allocations, so the overhead of doing this in Ruby/Puma may not make sense. In other words, a thread pool (reusing threads) constructed in Ruby-land may not be an advantage over just writing Thread.new {...} per request.

Regarding the naming, I am okay with it, but don't have a strong opinion. It seems to be to clearly represent the behaviour, and that seems good enough to me. There is one environment variable I was using for ease of testing, but I'm not sure if we want to keep that in the default configuration.

In terms of performance, I think the impact will depend on what you are doing, but it's minimal (somewhere between "no impact" and "< 10% of a no-op web request"). In addition, the feature is opt-in. Whether it remains that way in the future, I'm convinced that there are potentially a lot of bugs relating to this - i.e. sharing state between requests. We have known use cases for this at my company, for example, and so we desire to enable this feature for additional safety.

@MSP-Greg any thoughts on what is required to get this merged?

MSP-Greg · 2024-04-25T00:29:26Z

@ioquatix

I'm AFK for the rest of the evening. I'll have look tomorrow. Thanks.

MSP-Greg · 2024-04-25T15:29:39Z

@ioquatix

Approved. Maybe a day or two for others to comment.

CRuby... a thread pool (reusing threads) constructed in Ruby-land may not be an advantage over just writing Thread.new {...} per request.

Interesting. Wonder how JRuby and TruffleRuby would be affected...

ioquatix · 2024-04-26T00:25:08Z

Interesting. Wonder how JRuby and TruffleRuby would be affected...

I don't want to make assertions without looking at the code, but my general feeling is, a lot of these platforms are internally caching the allocations where possible. It's not always possible or semantically correct, so sometimes things don't get reused, but I can attest to the fact that fibers are heavily reused and in a server like Puma, the implementation above will only allocate one "internal" fiber per thread and they will be reused over and over again - the Fiber.new instances are different, but the stacks will all be reused over and over again, so there should be very little overhead once warm.

As a consequence, I think in languages like Ruby, we should focus on correctness and simplicity, and push performance issues further down the stack. Of course, Puma has a well defined execution model: workers/threads/etc and that's a part of the contract, but using thread pools and keeping a list of threads, it would be much better if Puma could just do Thread.new{...} per request from a simplicity and correctness POV, without worrying about performance because CRuby will already be capable of pooling them more efficiently (and across all of your application, not just Puma).

Ruby 2.6 recycled stacks: https://github.com/ruby/ruby/blob/ruby_2_6/vm.c#L2417-L2447
Ruby 2.7 we used alloca within pthread stacks (which I assume can be reused based on the pthread configuration): https://github.com/ruby/ruby/blob/ruby_2_7/thread.c#L723 = measurable performance improvement.
Ruby 2.7 fiber stack resuse: https://github.com/ruby/ruby/blob/ruby_2_7/cont.c#L152-L177

MSP-Greg · 2024-04-26T01:44:54Z

@ioquatix

Thanks for the info. Given that Puma has on_thread_start & on_thread_exit hooks, I suspect using "Thread.new{...} per request" might cause some issues. Maybe Puma 7...

Aside from the things you mentioned, using it might make it easier to deal with a timeout 'waiting for an app response'. Not.sure.

ioquatix · 2024-05-07T22:37:35Z

I've changed Fiber.new do to Fiber.new(storage: nil) do which increases the per-request isolation by preventing inheritance of any existing state. This is probably the correct model.

nateberkopec added feature waiting-for-review Waiting on review from anyone labels Mar 17, 2023

ioquatix force-pushed the fiber-per-request branch from f8fc513 to e980d4f Compare March 17, 2023 08:01

ioquatix force-pushed the fiber-per-request branch 6 times, most recently from 78ec81c to e7b2ac4 Compare April 3, 2024 08:48

ioquatix force-pushed the fiber-per-request branch from e7b2ac4 to 6cc6b48 Compare April 3, 2024 13:05

ioquatix added 2 commits April 25, 2024 11:08

Introduce support for fiber-per-request.

735f0d9

This enables the correct scope of `Fiber.storage` per request. - Unify `clean_thread_locals` and `fiber_per_request` configuration options.

Skip if old windows.

8594344

ioquatix force-pushed the fiber-per-request branch from c728fef to 8594344 Compare April 24, 2024 23:09

MSP-Greg approved these changes Apr 25, 2024

View reviewed changes

ioquatix mentioned this pull request May 7, 2024

Sporadic nil values from RequestStore after upgrade to 1.6.0 steveklabnik/request_store#96

Closed

Reset storage for each request.

e25c964

ioquatix force-pushed the fiber-per-request branch from 3851710 to e25c964 Compare May 7, 2024 22:45

Introduce support for fiber-per-request. #3101

Are you sure you want to change the base?

Introduce support for fiber-per-request. #3101

Conversation

ioquatix commented Mar 17, 2023 • edited

Your checklist for this pull request

nateberkopec commented Mar 17, 2023

ioquatix commented Mar 17, 2023

ioquatix commented Mar 17, 2023

nateberkopec commented Mar 17, 2023

nateberkopec commented Mar 17, 2023

ioquatix commented Mar 17, 2023

ioquatix commented Mar 17, 2023

ioquatix commented Mar 17, 2023

MSP-Greg commented Mar 17, 2023 • edited

ioquatix commented Mar 18, 2023

MSP-Greg commented Mar 18, 2023

ioquatix commented Mar 18, 2023

ioquatix commented Mar 18, 2023 • edited

dentarg commented Mar 18, 2023

ioquatix commented Mar 19, 2023

nateberkopec commented Mar 20, 2023

ioquatix commented Mar 20, 2023

ioquatix commented Mar 20, 2023

ioquatix commented Mar 20, 2023

Normal

Fiber

ioquatix commented Mar 20, 2023 • edited

ioquatix commented Mar 20, 2023 • edited

ioquatix commented Mar 20, 2023

Normal

Fiber

ioquatix commented Mar 20, 2023

MSP-Greg commented Mar 20, 2023 • edited

ioquatix commented Mar 20, 2023

nateberkopec commented Mar 29, 2023

ioquatix commented Mar 29, 2023

ioquatix commented Apr 3, 2024

ioquatix commented Apr 3, 2024

MSP-Greg commented Apr 3, 2024

MSP-Greg commented Apr 3, 2024

ioquatix commented Apr 3, 2024

ioquatix commented Apr 3, 2024

MSP-Greg commented Apr 4, 2024

ioquatix commented Apr 4, 2024

MSP-Greg commented Apr 4, 2024

ioquatix commented Apr 4, 2024

ioquatix commented Apr 24, 2024

MSP-Greg commented Apr 25, 2024

MSP-Greg commented Apr 25, 2024

ioquatix commented Apr 26, 2024 • edited

MSP-Greg commented Apr 26, 2024

ioquatix commented May 7, 2024 • edited

ioquatix commented Mar 17, 2023 •

edited

MSP-Greg commented Mar 17, 2023 •

edited

ioquatix commented Mar 18, 2023 •

edited

ioquatix commented Mar 20, 2023 •

edited

ioquatix commented Mar 20, 2023 •

edited

MSP-Greg commented Mar 20, 2023 •

edited

ioquatix commented Apr 26, 2024 •

edited

ioquatix commented May 7, 2024 •

edited