Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 Allow exit code for dependencies with yield to always execute, by removing capacity limiter for them, to e.g. allow closing DB connections without deadlocks #5122

Merged
merged 8 commits into from Sep 4, 2022

Conversation

adriangb
Copy link
Contributor

@adriangb adriangb commented Jul 11, 2022

@codecov
Copy link

codecov bot commented Jul 11, 2022

Codecov Report

Merging #5122 (c7e3263) into master (e35df68) will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##            master     #5122    +/-   ##
==========================================
  Coverage   100.00%   100.00%            
==========================================
  Files          532       539     +7     
  Lines        13684     13906   +222     
==========================================
+ Hits         13684     13906   +222     
Impacted Files Coverage Δ
fastapi/__init__.py 100.00% <100.00%> (ø)
fastapi/concurrency.py 100.00% <100.00%> (ø)
tests/main.py 100.00% <0.00%> (ø)
fastapi/utils.py 100.00% <0.00%> (ø)
fastapi/routing.py 100.00% <0.00%> (ø)
fastapi/encoders.py 100.00% <0.00%> (ø)
tests/test_query.py 100.00% <0.00%> (ø)
fastapi/websockets.py 100.00% <0.00%> (ø)
fastapi/openapi/docs.py 100.00% <0.00%> (ø)
tests/test_ws_router.py 100.00% <0.00%> (ø)
... and 16 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@github-actions
Copy link
Contributor

📝 Docs preview for commit c507e76 at: https://62cb8e56fb08263a0b2a8795--fastapi.netlify.app

@Kludex
Copy link
Sponsor Collaborator

Kludex commented Jul 11, 2022

If a new CapacityLimiter is created every time, then the token can be 1... Can't it?

The inf would make sense for a global limiter... Isn't it the case?

@adriangb
Copy link
Contributor Author

Yeah I guess it can be 1. Or we can make it a global. Either way.

@github-actions
Copy link
Contributor

📝 Docs preview for commit 0892b02 at: https://62cbaec8c51dc34d83b10e22--fastapi.netlify.app

@github-actions
Copy link
Contributor

📝 Docs preview for commit d822171 at: https://62ccd4afa2dc031bf1f8e7af--fastapi.netlify.app

@github-actions
Copy link
Contributor

📝 Docs preview for commit 3557911 at: https://62cf1d1ed35ee72b45e8b61f--fastapi.netlify.app

@github-actions
Copy link
Contributor

📝 Docs preview for commit c53e0f6 at: https://62cf1e6ca2b9a32bbabee427--fastapi.netlify.app

@github-actions
Copy link
Contributor

📝 Docs preview for commit 3b73873 at: https://62d43b8f94f05f7127e81a5e--fastapi.netlify.app

@tiangolo
Copy link
Owner

tiangolo commented Sep 4, 2022

Niceee! 🤓 🚀

I tested this with the example code in: #3205

I used wrk and I ran twice:

$ wrk -c 100 -t 100 --timeout 60 "http://127.0.0.1:8000/"

Without the fix

Without the fix, I got first:

Running 10s test @ http://127.0.0.1:8000/
  100 threads and 100 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     3.40s     2.04s    8.05s    63.64%
    Req/Sec     0.00      0.00     0.00    100.00%
  22 requests in 10.09s, 3.05KB read
Requests/sec:      2.18
Transfer/sec:     309.54B

And then the second run I got:

Running 10s test @ http://127.0.0.1:8000/
  100 threads and 100 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     0.00us    0.00us   0.00us    -nan%
    Req/Sec     0.00      0.00     0.00      -nan%
  0 requests in 10.08s, 0.00B read
Requests/sec:      0.00
Transfer/sec:       0.00B

...this seems to reflect the deadlock. 🎉 (I'm celebrating being able to see something broken, it's sometimes hard to see the broken part 😅 )

After that, I Ctrl+C the Uvicorn process, and it hung in there, at:

^C
INFO:     Shutting down
INFO:     Waiting for background tasks to complete. (CTRL+C to force quit)

After 7 minutes I force-killed the process. So, indeed, deadlock.

With the fix, outside

With the fix, putting a single global CapacityLimiter outside of the function.

First I got:

Running 10s test @ http://127.0.0.1:8000/
  100 threads and 100 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     4.97s     3.07s   10.02s    67.50%
    Req/Sec     0.00      0.00     0.00    100.00%
  40 requests in 10.09s, 5.55KB read
Requests/sec:      3.96
Transfer/sec:     562.78B

For the second run I got:

Running 10s test @ http://127.0.0.1:8000/
  100 threads and 100 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     6.31s     2.45s    9.64s    73.08%
    Req/Sec     0.00      0.00     0.00    100.00%
  26 requests in 10.09s, 3.61KB read
Requests/sec:      2.58
Transfer/sec:     365.96B

Third run:

Running 10s test @ http://127.0.0.1:8000/
  100 threads and 100 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     6.90s     2.76s    9.58s    83.33%
    Req/Sec     0.00      0.00     0.00    100.00%
  12 requests in 10.08s, 1.66KB read
Requests/sec:      1.19
Transfer/sec:     169.06B

Not sure why it keeps reducing the number of requests. It seems something would be still wrong, but not sure what. 😔

After that, and after I wrote this, I Ctrl+C the Uvicorn process, it finished immediately. So it seems it handled the deadlock.

I ran the whole thing again, and I Ctrl+C the Uvicorn process immediately after the 3 runs, and then it waited for a bit at:

^C
INFO:     Shutting down
INFO:     Waiting for background tasks to complete. (CTRL+C to force quit)

But after 50 seconds it terminated normally.

With the fix, local

Keeping the local CapacityLimiter created every time, as in this PR.

The first run I got:

Running 10s test @ http://127.0.0.1:8000/
  100 threads and 100 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     4.71s     2.90s   10.06s    65.00%
    Req/Sec     0.00      0.00     0.00    100.00%
  40 requests in 10.08s, 5.55KB read
Requests/sec:      3.97
Transfer/sec:     563.34B

The second run I got:

Running 10s test @ http://127.0.0.1:8000/
  100 threads and 100 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     5.26s     2.50s    8.48s    80.00%
    Req/Sec     0.00      0.00     0.00    100.00%
  5 requests in 10.09s, 710.00B read
Requests/sec:      0.50
Transfer/sec:      70.37B

And the third run I got:

Running 10s test @ http://127.0.0.1:8000/
  100 threads and 100 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     5.52s     2.99s    9.27s    50.00%
    Req/Sec     0.00      0.00     0.00    100.00%
  4 requests in 10.10s, 568.00B read
Requests/sec:      0.40
Transfer/sec:      56.25B

The results are similar to having a global CapacityLimiter than having a local one. The 5 or 4 requests are comparable to some of the other tests I ran with the global CapacityLimiter. It's in the same ballpark.

Terminating the process also worked as expected instead of hanging there in a deadlock.

For some reason, it's still being able to handle fewer requests every time. Maybe it's just that it's trying to read the file every time or something like that and there's a bottleneck in that, just in the file system. I'm not sure about that.


Checking the AnyIO docs I see that:

Capacity limiters are like semaphores except that a single borrower (the current task by default) can only hold a single token at a time.

Ref: https://anyio.readthedocs.io/en/stable/synchronization.html

Then I'm not sure if it could be problematic if the same thread worker tried to get the limiter again, and if that could block with the other thing running on the same thread that already got the limiter token.

Until we figure this out, I'll leave the local CapacityLimiter (as in this PR) to be safe (although in my local tests it didn't seem to affect, but not sure why).

But anyway, it seems that this clearly improves the situation.

I wish there was a way to test this, in pytest. But not sure how to replicate the whole problem and situation in a feasible way (not involving a whole setup for "benchmarks" with ab, Locust, or wrk).

For now, I'll merge this, but if anyone has any idea about how this could be tested, to have checks in CI making sure the problem doesn't happen again, it would be awesome.

Thanks @adriangb for all the investigation and work! 🙇 🤓 🍰

@tiangolo tiangolo changed the title remove thread capacity limit for running sync context manager __exit__ 🐛 Allow exit code for dependencies with yield to always execute, by removing capacity limiter for them, to e.g. allow closing DB connections without deadlocks Sep 4, 2022
@adriangb
Copy link
Contributor Author

adriangb commented Sep 4, 2022

I wish there was a way to test this, in pytest. But not sure how to replicate the whole problem and situation in a feasible way (not involving a whole setup for "benchmarks" with ab, Locust, or wrk)

It might be reproducible by talking pure-ASGI to the FastAPI app and launching multiple tasks using a TaskGroup? Might be hard to get the timing right though.

You could try setting the capacity limit to float("inf") instead of 1.

I think the slowdowns might come from acquiring database connections. There's likely some semaphore there as well.

@tiangolo
Copy link
Owner

tiangolo commented Sep 4, 2022

Hehe I was writing this and you commented before me 😅


Ah, wait, I didn't read @Kludex comment properly. I didn't realize it could be an inf float. That probably makes sense and we can avoid creating a new object every time.


I just di that and updated the branch. 🚀


And yeah, maybe multiple ASGIs in a task group could work. But I'm not sure how to make it fail instead of hang, if it just hangs it's gonna be more difficult to test. 🤔

@tiangolo
Copy link
Owner

tiangolo commented Sep 4, 2022

Hehehe the tests have spoken... it seems we need to use the local variable after all. Otherwise AnyIO gets confused with what is the async backend. 😂

…nc context to create the object

This reverts commit 09dc61a.
@github-actions
Copy link
Contributor

github-actions bot commented Sep 4, 2022

📝 Docs preview for commit 09dc61a at: https://6314efcd67c58a208eaa7d3e--fastapi.netlify.app

@Kludex
Copy link
Sponsor Collaborator

Kludex commented Sep 4, 2022

wait

@adriangb
Copy link
Contributor Author

adriangb commented Sep 4, 2022

But I'm not sure how to make it fail instead of hang, if it just hangs it's gonna be more difficult to test

I think anyio.fail_on_after() should do the trick right? But if we were to put this in a test it might be good to figure out how to reproduce it without SQLAlchemy.

@github-actions
Copy link
Contributor

github-actions bot commented Sep 4, 2022

📝 Docs preview for commit c7e3263 at: https://6314f0dd00ed39272fd435c8--fastapi.netlify.app

@Kludex
Copy link
Sponsor Collaborator

Kludex commented Sep 4, 2022

Hehehe the tests have spoken... it seems we need to use the local variable after all. Otherwise AnyIO gets confused with what is the async backend. joy

https://github.com/agronholm/anyio/blob/07b49f76597f353185c825e2acd1b1b9fe274727/src/anyio/_backends/_asyncio.py#L2068

https://anyio.readthedocs.io/en/stable/api.html?highlight=runvar#anyio.lowlevel.RunVar

@tiangolo
Copy link
Owner

tiangolo commented Sep 4, 2022

@adriangb

I think anyio.fail_on_after() should do the trick right? But if we were to put this in a test it might be good to figure out how to reproduce it without SQLAlchemy.

Totally agree with that! That would be great. Not sure how but it would be great indeed.

@Kludex

Thanks for the links and pointers! (also on Discord)

Copying here for completeness, the issue about CapacityLimiters in Starlette: encode/starlette#1724

The snippet to update the default token size of the default CapacityLimiter in encode/starlette#1724 (comment) and for FastAPI in: https://gist.github.com/Kludex/f889f02bb81adf3a8d63176ad4d8bcb1


In conclusion, for now I'll merge this with the local capacity limiter. Maybe at some point Starlette or FastAPI can have a way to customize different capacity limiters for different things (not sure, I haven't thought much about it) and then this could be set there.

And about the option of using a RunVar (semi pseudo-contextvar) for a global-loop-local capacity limiter, for now I'll leave the local variable as it's simpler and we can be more certain the implementation works well. The RunVar can have a cost involved while dealing with these pseudo-contextvars that might offset the cost of creating the local object in the function. And the implementation would certainly be more complex. So I'll rather keep this one, to be a bit more certain, and avoid the risk of introducing other complex errors (contextvars are my nemesis 😂).

Thanks again you both for all the help! 🙇 🍰

@nickswiss
Copy link

This has been closed, but I am not really sure on the final solution as far as running syncronous SQLAlchemy with FastAPI. In my case, I am using FastAPI with SQLAlchemy and Oracle, which does not have a trusted async library.
If it is possible to use SQLAlchemy synchronously, what is the correct approach?

@tiangolo
Copy link
Owner

@nickswiss I'm not sure I understand the question or if there's something you're not being able to do. If you still have questions or problems, it's probably better to ask in a new GitHub issue, the template will guide you to structure the question in a way that will make it easier to help you, with an example I can run, etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants