Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce support for fiber-per-request. #3101

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

ioquatix
Copy link
Contributor

@ioquatix ioquatix commented Mar 17, 2023

This enables the correct scope of Fiber.storage per request.

Fiber[] and Fiber[]= are recently introduced features in Ruby 3.2. They are designed to provide per-operation or per-request state handling.

Using a thread pool, Fiber.storage is retained beyond a single request, which can leak information from one request into another. The simplest way to avoid this is to wrap each request itself in a fiber. The overhead should be minimal as these fibers are cached and reused, so once warm, the overhead is a single Ruby VALUE allocation per request.

The benefit is that code which uses Fiber[]= will be scoped correctly to a single request.

Your checklist for this pull request

  • I have reviewed the guidelines for contributing to this repository.
  • I have added (or updated) appropriate tests if this PR fixes a bug or adds a feature.
  • My pull request is 100 lines added/removed or less so that it can be easily reviewed.
  • If this PR doesn't need tests (docs change), I added [ci skip] to the title of the PR.
  • If this closes any issues, I have added "Closes #issue" to the PR description or my commit messages.
  • I have updated the documentation accordingly.
  • All new and existing tests passed, including Rubocop.

@nateberkopec nateberkopec added feature waiting-for-review Waiting on review from anyone labels Mar 17, 2023
@nateberkopec
Copy link
Member

Very interesting!

I assume we don't need a Ruby version gate here - people just won't use Fiber[]= in their Ruby 3.1 and earlier apps.

  1. We should have a test for 3.2+ that tries to read Fiber[] across requests.
  2. We should run this by our benchmark suite locally to double-check.

@ioquatix
Copy link
Contributor Author

@nateberkopec There are two parts to this PR.

One might be considered a feature and one might be considered a bug.

Between requests, fiber locals are not reset, so even without supporting Fiber.storage, just using Thread.current[:foo] can leak between requests (may or may not be a bad thing???).

The other part of this is correctly supporting Fiber.storage assignment.

@ioquatix
Copy link
Contributor Author

@nateberkopec I've just added tests, I think you will find them interesting to review because they show the current behaviour vs what I'm proposing (just comment out the first commit to see the behavioural differences).

Also 100% we should test for performance regressions. Is there documentation you can point me at for the benchmark test suite?

@nateberkopec
Copy link
Member

Right, thanks for calling that out - since Thread.current[:foo] uses fiber-locals underneath, this will be a big behavior change for people using that right?

(may or may not be a bad thing???).

It's not good or bad, it's just Behavior, Which That People Inevitably Depend On 😆

Is there documentation you can point me at for the benchmark test suite?

Not really, consider this just an action item on the maintainer's part.

@nateberkopec
Copy link
Member

We also failures on main, yikes.

@ioquatix
Copy link
Contributor Author

I think leaking state between requests is probably a bad thing. Don’t disagree that people might be depending on it but sounds flaky at best if so (order dependent and thread dependent).

@ioquatix
Copy link
Contributor Author

I might need to fix the exception handling when I’m back home.

@ioquatix
Copy link
Contributor Author

I observed Windows segfaulting: https://github.com/puma/puma/actions/runs/4445249381/jobs/7804200284

@MSP-Greg
Copy link
Member

MSP-Greg commented Mar 17, 2023

@ioquatix Thanks for working on this. I ran these commits on top of revisions to the test system, Windows 2.5 failed, and all the JRuby jobs. Everything else was ok.

Some performance data using WSL2/Ubuntu 22.04 is below. It does seem to slow things down...

Puma repo branch pr/3101
Body    ────────── req/sec ──────────   ─────── req 50% times ───────
 KB     array   chunk  string      io   array   chunk  string      io
1       16279   16019   15739   11057   0.195   0.194   0.206   0.311
10      15737   13432   13397   10976   0.209   0.302   0.312   0.381
─────────────────────────────────────────────────────────────────────

Puma repo branch master
Body    ────────── req/sec ──────────   ─────── req 50% times ───────
 KB     array   chunk  string      io   array   chunk  string      io
1       18484   18486   18641   12733   0.195   0.178   0.191   0.235
10      18297   17096   18560   12368   0.170   0.240   0.196   0.268
─────────────────────────────────────────────────────────────────────

Above was run with benchmarks/local/response_time_wrk.sh -w2 -t5:5 -s tcp6 -Y -b 1,10, Using your wrk.

@ioquatix
Copy link
Contributor Author

I will check the performance too, that seems a little too costly to me. Maybe we can improve on it, or maybe there is something else going on.

In any case, there are a couple of options to consider:

  1. This could be a middleware. But this behaviour is inconsistent across different servers, and the middleware would have no effect on fiber-per-request servers like Falcon already.
  2. It could be optional feature of Puma.

@MSP-Greg
Copy link
Member

Maybe we can improve on it

Interested in helping with that. Right now, the Fiber is wrapping nothing more than app.call, which in the benchmarks is very quickly sending back a few dozen headers and various body configurations (size, array length, enum, file, etc).

Or, we've got three main 'processes':

A. Receive and process the request
B. App.call
C. Assemble & transmit the response

Current code is just wrapping B. I suspect that should be expanded?

@ioquatix
Copy link
Contributor Author

Feel free to modify this PR as you see fit. I will try to look at it a bit later tonight or tomorrow.

@ioquatix
Copy link
Contributor Author

ioquatix commented Mar 18, 2023

Regarding your question, I'm only concerned about the app code being exposed to prior state. Well, we can wrap more, but I don't think there is a huge advantage... oh but streaming responses probably should be in the same context so... yeah we might want to expand it for that reason and add some more tests.

@dentarg
Copy link
Member

dentarg commented Mar 18, 2023

Re Thread.current and leaking state, I suspect most apps are clearing it themselves (example)

@ioquatix
Copy link
Contributor Author

Ah, nice example.

I guess ideally with the newer "fiber per request" style interface, we shouldn't expect users to do this (one of the authors of request state was actually very excited by this). The semantics at the language level should be clear.

@nateberkopec
Copy link
Member

This could be a middleware.
It could be optional feature of Puma.

Hmm... when you put it this way, it reminds me of the old "thread_safe" config setting in Rails. Aaron eventually removed it because "who would ever say, "give me the non thread safe version please!"

12% performance hit is rough, but we're still doing a response in far less than a tenth of a millisecond. I think the trade could be worth it.

@ioquatix
Copy link
Contributor Author

12% performance hit is rough, but we're still doing a response in far less than a tenth of a millisecond. I think the trade could be worth it.

With a little bit of finangling, we might be able to introduce a more streamlined interface.

For example, if you only care about Fiber storage, it possible to assign to it directly.

Let me do a bit more analysis on the performance cost. It seems 1/10th of a ms is quite a bit, more than I expected.

@ioquatix
Copy link
Contributor Author

I just noticed there is some provision for clean_thread_locals which is actually "cleaning" fiber locals.

@ioquatix
Copy link
Contributor Author

I wouldn't call this particularly comprehensive (I was seeing a margin of error of like +-5%) but here are the results from my laptop:

Normal

> benchmarks/local/response_time_wrk.sh -T2 -w2 -t8:8 -b c1,10,100

Size   req/sec    50%     75%     90%     99%    100%  Resp Size
───────────────────────────────────────────────────────────────── chunk
1       27301   0.136   0.154   0.189   0.439  16.780      1040
10      21446   0.174   0.195   0.235   0.446  15.030     10318
100      6037   0.646   0.711   0.806   1.010   7.270    103108

Fiber

> benchmarks/local/response_time_wrk.sh -T2 -w2 -t8:8 -b c1,10,100

Size   req/sec    50%     75%     90%     99%    100%  Resp Size
───────────────────────────────────────────────────────────────── chunk
1       25492    0.14    0.16    0.21    0.44    6.67      1040
10      20082    0.18    0.21    0.27    0.52    6.98     10318
100      5302    0.70    0.82    1.06  199.18  306.25    103107

> RUBY_SHARED_FIBER_POOL_FREE_STACKS=0 benchmarks/local/response_time_wrk.sh -T2 -w2 -t8:8 -b c1,10,100

Size   req/sec    50%     75%     90%     99%    100%  Resp Size
───────────────────────────────────────────────────────────────── chunk
1       26900   0.138   0.156   0.196   0.412   6.580      1040
10      21256   0.173   0.195   0.250   0.479   6.770     10318
100      5841   0.660   0.741   0.850   1.080   7.430    103107

@ioquatix
Copy link
Contributor Author

ioquatix commented Mar 20, 2023

(also as an aside, the benchmark by default uses too many wrk threads/connections and will over-saturate Puma giving less than robust results, since a whole ton of wrk connection will be idle).

@ioquatix
Copy link
Contributor Author

ioquatix commented Mar 20, 2023

RUBY_SHARED_FIBER_POOL_FREE_STACKS=0 is a secret escape hatch which uses a little more memory but increases performance, since it doesn't return the fiber stacks back to the OS. What this tells us, is that the overhead of memory management is non-trivial, since using it appears to give an advantage.

@ioquatix
Copy link
Contributor Author

The Linux results are also a bit weird.

Normal

> benchmarks/local/response_time_wrk.sh -T2 -w2 -t8:8 -b c1,10,100

Size   req/sec    50%     75%     90%     99%    100%  Resp Size
───────────────────────────────────────────────────────────────── chunk
1       25380   0.146   0.154   0.185   0.415   5.330      1040
10      20640   0.176   0.188   0.270   0.473   5.780     10318
100      9585   0.386   0.488   0.592   0.811   7.850    103185

Fiber

> PUMA_FIBER_PER_REQUEST=1 benchmarks/local/response_time_wrk.sh -T2 -w2 -t8:8 -b c1,10,100

Size   req/sec    50%     75%     90%     99%    100%  Resp Size
───────────────────────────────────────────────────────────────── chunk
1       21598   0.171   0.182   0.226   0.472   5.650      1040
10      18058   0.200   0.212   0.308   0.530   5.740     10318
100      8603   0.474   0.516   0.669   0.860   6.350    103108

> PUMA_FIBER_PER_REQUEST=1 RUBY_SHARED_FIBER_POOL_FREE_STACKS=0 benchmarks/local/response_time_wrk.sh -T2 -w2 -t8:8 -b c1,10,100

Size   req/sec    50%     75%     90%     99%    100%  Resp Size
───────────────────────────────────────────────────────────────── chunk
1       37997   0.102   0.130   0.140   0.272   5.520      1040
10      24752   0.154   0.189   0.219   0.397   5.800     10318
100      8672   0.440   0.503   0.612   0.827   7.950    103109

@ioquatix
Copy link
Contributor Author

I tried it several times. It looks like on Linux, in some situations, this can be a performance advantage. But I don't know why. Maybe it's a bug in the benchmark.

@MSP-Greg
Copy link
Member

MSP-Greg commented Mar 20, 2023

@ioquatix Thanks for looking into a bunch of things. I've been putting off getting a native Ubuntu host system (time, money, etc), so I use WSL2/Ubuntu, which runs on Windows. It is somewhat inconsistent, and often slower than a native system.

So, I tried to get the benchmark default to hit Puma hard, but not 'hammered', or something like that. If you find different default would work better, we change them.
I suspect you may have found it in the code or notes, but array (a) & chunked (c) bodies use an element size of 1kB, so c100 is sending 100 1kB chunks for the body. For String bodies (s) the body is a single element array...

@ioquatix
Copy link
Contributor Author

@MSP-Greg do you accept github sponsors?

@nateberkopec
Copy link
Member

While I think - if this is a Good Idea and It Works - we should just make the default, it is a breaking change and likely to Break Something Some Where. I wonder if, in the meantime, before Puma 7, we could ship this is a rack middleware to give people time to try it and see what happens?

The middleware would just a be a temporary thing intended to test this out, not as a permanent fixture.

@ioquatix
Copy link
Contributor Author

Here are my thoughts in no particular order.

The performance overhead will be more complex on TruffleRuby and JRuby until they adopt Loom/Virtual Threads.

It's possible in the single process/thread case that users are always expecting the same thread locals to work as a kind of global state.

Since connection keep-alive is bound to a worker, there is a kind of natural affinity that comes with every request being processed by the same thread. I could imagine some code depending on that (it's going to be janky but 🤷🏼). Additionally, if they are using thread-local caches, performance/locality might be affected.

This feature can't quite be implemented as a middleware as it doesn't correctly cover streaming responses. However, there is NO specification that states that a streaming response should be handled on the same fiber as the request, but I imagine some poorly designed code that tries to take a lock during app.call and free it in response[2].close, could blow up.

I believe that the performance overhead can and should be reduced to almost zero but it will require more investigation on my part. I will definitely consider this use case closely but I can't make any guarantees before Ruby 3.3. It's possible I'll have time and motivation to investigate it. An overhead of 2-3% should be attainable or better IMHO.

Maybe better to start off as a feature of puma that is disabled by default in v6. It could be controlled by the existing option: clean_thread_locals which is... actually cleaning fiber locals (according to the implementation). Or we could introduce a new option.

@ioquatix ioquatix force-pushed the fiber-per-request branch 6 times, most recently from 78ec81c to e7b2ac4 Compare April 3, 2024 08:48
@ioquatix
Copy link
Contributor Author

ioquatix commented Apr 3, 2024

It might be good to address #3360 before merging this.

@ioquatix
Copy link
Contributor Author

ioquatix commented Apr 3, 2024

@MSP-Greg
Copy link
Member

MSP-Greg commented Apr 3, 2024

Still need a bit more coffee. I'll test it locally soon (today). I've got to switch out OpenSSL 3.2 in my MSYS2...

BTW, thanks for your work on this.

@MSP-Greg
Copy link
Member

MSP-Greg commented Apr 3, 2024

@ioquatix

Looked at the Windows 2.5 issue, and I could verify locally. Ruby 2.6 works fine, didn't check 2.4.

I thought I'd try to write some code without Puma to see if I can get a better idea what's happening. It's just stopping, with no indication why...

@ioquatix
Copy link
Contributor Author

ioquatix commented Apr 3, 2024

If you add print statements to every line, can we see how far it executed the test?

@ioquatix
Copy link
Contributor Author

ioquatix commented Apr 3, 2024

@MSP-Greg do you have some time now to show me? https://whereby.com/ioquatix

@MSP-Greg
Copy link
Member

MSP-Greg commented Apr 4, 2024

@ioquatix

We can 'whereby' if you'd like. Sorry for the delay...

@ioquatix
Copy link
Contributor Author

ioquatix commented Apr 4, 2024

@MSP-Greg I have some time now if that suits you.

@MSP-Greg
Copy link
Member

MSP-Greg commented Apr 4, 2024

Knocking

@ioquatix
Copy link
Contributor Author

ioquatix commented Apr 4, 2024

We came to the conclusion that Fibers on Windows, Ruby 2.5, do not work correctly.

This enables the correct scope of `Fiber.storage` per request.

- Unify `clean_thread_locals` and `fiber_per_request` configuration options.
@ioquatix
Copy link
Contributor Author

I've rebased this PR, so that it includes the fixes for Puma::Server.current which was discovered during my work here: #3360.

This PR introduces a new configuration option fiber_per_request, and aliases it to the old one, clean_thread_locals. In fact, the old name is also incorrect, since it only cleans fiber locals. So we can consider deprecating and removing clean_thread_locals in v7, or fixing it so that it actually does use a new thread for each request.

While I have not measured it recently, CRuby does cache state relating to thread allocations and fiber allocations, so the overhead of doing this in Ruby/Puma may not make sense. In other words, a thread pool (reusing threads) constructed in Ruby-land may not be an advantage over just writing Thread.new {...} per request.

Regarding the naming, I am okay with it, but don't have a strong opinion. It seems to be to clearly represent the behaviour, and that seems good enough to me. There is one environment variable I was using for ease of testing, but I'm not sure if we want to keep that in the default configuration.

In terms of performance, I think the impact will depend on what you are doing, but it's minimal (somewhere between "no impact" and "< 10% of a no-op web request"). In addition, the feature is opt-in. Whether it remains that way in the future, I'm convinced that there are potentially a lot of bugs relating to this - i.e. sharing state between requests. We have known use cases for this at my company, for example, and so we desire to enable this feature for additional safety.

@MSP-Greg any thoughts on what is required to get this merged?

@MSP-Greg
Copy link
Member

@ioquatix

I'm AFK for the rest of the evening. I'll have look tomorrow. Thanks.

@MSP-Greg
Copy link
Member

@ioquatix

Approved. Maybe a day or two for others to comment.

CRuby... a thread pool (reusing threads) constructed in Ruby-land may not be an advantage over just writing Thread.new {...} per request.

Interesting. Wonder how JRuby and TruffleRuby would be affected...

@ioquatix
Copy link
Contributor Author

ioquatix commented Apr 26, 2024

Interesting. Wonder how JRuby and TruffleRuby would be affected...

I don't want to make assertions without looking at the code, but my general feeling is, a lot of these platforms are internally caching the allocations where possible. It's not always possible or semantically correct, so sometimes things don't get reused, but I can attest to the fact that fibers are heavily reused and in a server like Puma, the implementation above will only allocate one "internal" fiber per thread and they will be reused over and over again - the Fiber.new instances are different, but the stacks will all be reused over and over again, so there should be very little overhead once warm.

As a consequence, I think in languages like Ruby, we should focus on correctness and simplicity, and push performance issues further down the stack. Of course, Puma has a well defined execution model: workers/threads/etc and that's a part of the contract, but using thread pools and keeping a list of threads, it would be much better if Puma could just do Thread.new{...} per request from a simplicity and correctness POV, without worrying about performance because CRuby will already be capable of pooling them more efficiently (and across all of your application, not just Puma).

@MSP-Greg
Copy link
Member

@ioquatix

Thanks for the info. Given that Puma has on_thread_start & on_thread_exit hooks, I suspect using "Thread.new{...} per request" might cause some issues. Maybe Puma 7...

Aside from the things you mentioned, using it might make it easier to deal with a timeout 'waiting for an app response'. Not.sure.

@ioquatix
Copy link
Contributor Author

ioquatix commented May 7, 2024

I've changed Fiber.new do to Fiber.new(storage: nil) do which increases the per-request isolation by preventing inheritance of any existing state. This is probably the correct model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature waiting-for-review Waiting on review from anyone
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants