Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using dependency injection to get SQLAlchemy session can lead to deadlock #3205

Closed
9 tasks done
Blue9 opened this issue May 10, 2021 · 23 comments
Closed
9 tasks done

Comments

@Blue9
Copy link

Blue9 commented May 10, 2021

First check

  • I added a very descriptive title to this issue.
  • I used the GitHub search to find a similar issue and didn't find it.
  • I searched the FastAPI documentation, with the integrated search.
  • I already searched in Google "How to X in FastAPI" and didn't find any information.
  • I already read and followed all the tutorial in the docs and didn't find an answer.
  • I already checked if it is not related to FastAPI but to Pydantic.
  • I already checked if it is not related to FastAPI but to Swagger UI.
  • I already checked if it is not related to FastAPI but to ReDoc.
  • After submitting this, I commit to one of:
    • Read open issues with questions until I find 2 issues where I can help someone and add a comment to help there.
    • I already hit the "watch" button in this repository to receive notifications and I commit to help at least 2 people that ask questions in the future.
    • Implement a Pull Request for a confirmed bug.

I've noticed that when using dependency injection with SQLAlchemy, a large number of concurrent requests can leave the app in a deadlocked state. This is especially noticeable with a small SQLAlchemy connection pool size (relative to the FastAPI thread pool size). Below is a self-contained example (you might have to tweak the pool size and the request body's sleep length but this should be a good starting point).

View app.py
"""
Setup: pip install fastapi sqlalchemy uvicorn
Run: python app.py
"""
import time
import uvicorn

from fastapi import Depends, FastAPI, Request
from sqlalchemy import create_engine
from sqlalchemy.orm import Session, sessionmaker
from sqlalchemy.pool import QueuePool


# SQLAlchemy setup
engine = create_engine(
    'sqlite:///test.db',
    connect_args={'check_same_thread': False},
    poolclass=QueuePool,
    pool_size=4,
    max_overflow=0,
    pool_timeout=None,  # Wait forever for a connection
)
SessionLocal = sessionmaker(autocommit=False, autoflush=False, bind=engine)

# FastAPI
app = FastAPI()

def get_db(request: Request):
    db = SessionLocal()
    try:
        yield db
    finally:
        db.close()

@app.get('/')
def index(db: Session = Depends(get_db)):
    # Some blocking work
    _ = db.execute('select 1')
    time.sleep(1)
    return {'hello': 'world'}

# Run
if __name__ == '__main__':
    uvicorn.run('app:app', reload=True, host='0.0.0.0', port=80)

When running the above with 100 concurrent requests (I used locust), I noticed that only around 5 requests are served, and then the app freezes and is unable to serve any more requests. Below is the locustfile.

View locustfile.py
"""
Setup: pip install locust
Run: Save as locustfile.py and run locust in terminal. Open http://localhost:8089 and run with 100, 100, http://localhost.
"""
from locust import HttpUser, task


class User(HttpUser):
    @task
    def index(self):
        self.client.get("/")

I suspect the following is happening. (Note that SessionLocal() is lazy, so db = SessionLocal() will return immediately even if no connections are available.)

  1. The first N requests come in (where N >= thread pool size). Their get_db dependencies run and yield, and we start executing their path operation functions. At this point, the entire thread pool is full. Only pool_size (4) requests are able to get a connection, and the remaining requests wait (in their path operation functions).
  2. The path operation functions that were able to get a connection return, opening up pool_size (4) spots in the thread pool. Because dependencies and requests run in separate threads, the finally blocks for these requests' get_db dependencies have not run yet, so the connections for these requests have not returned to the SQLAlchemy pool.
  3. More requests come in, and like step 1, their get_db dependencies run, and we start executing their path operation functions. No connections have returned to the SQLAlchemy pool, so these requests wait. At this point, the entire thread pool is full, and every thread is waiting for a connection.
  4. For the requests that finished in step 2, we try to schedule the finally blocks for their get_db dependencies in a new thread so we can free the connections, but all of the threads are busy, so we end up waiting.
  5. None of the threads will ever finish because they are waiting for a connection, and no connections will be released because the thread pool is full, leaving the app in a deadlocked state.

This doesn't really seem like a bug in FastAPI or in SQLAlchemy, but it suggests that we should not use dependency injection like this when using synchronous database libraries. The only workaround I've found for this is to use a context manager to handle the session in the endpoint itself instead of injecting the database session directly.

Another thing I've noticed is that changing get_db to be an async function prevents deadlock (as does using the middleware approach), but only if the endpoint does not have a response_model. If it has a response_model then the app will still lock up. I believe this is because if response_model is defined, then when we run serialize_response, field will be non-None, and we will attempt to run field.validate in a separate thread. If the thread pool is full with requests waiting for connections, we won't be able to serialize the response and won't be able to close the database connection. Maybe we could serialize the response in the same thread as the path operation function; I'm not sure what the benefit of serializing in a separate thread is.

There is similar discussion in tiangolo/full-stack-fastapi-template#104 and many others came to the conclusion that using a context manager is the right approach, but nothing really came of it. If others can validate that my suspicion is correct, then maybe we should change the docs to recommend using a context manager within the endpoint itself until a better solution is available.

Environment

  • OS: macOS Big Sur (11.3)
  • FastAPI Version: 0.63.0
  • SQLAlchemy Version: 1.4.13
  • Python version: 3.9.4
@Blue9 Blue9 added the question Question or problem label May 10, 2021
@magic-element-k
Copy link

I faced this problem. In 100% of cases, requests are blocked. Regardless of the frequency of requests, after about 7-10 requests, each time, requests are blocked. And so every time until you reboot. Tried it both in IDE on win10 and in Docker. On versions python 3.8 and 3.9 - nothing changes.

@magic-element-k
Copy link

I solved this problem by using this method - tiangolo/full-stack-fastapi-template#104 (comment).

@Blue9
Copy link
Author

Blue9 commented May 16, 2021

I'm not sure how background tasks are scheduled, but it doesn't look like that would always prevent the above situation. For example, running the above app using that solution still leads to deadlock with 100 concurrent users.

Another possible solution is to keep get_db() as-is, and change all endpoints to something like:

@app.get('/')
def index(db: Session = Depends(get_db)):
    with close_at_end(db) as db:
        data = db.query(...)...
        return {"data": data}

where close_at_end is a simple context manager that yields db and closes it after. This doesn't account for sub-dependencies and is a little tedious, but it can work as a temporary workaround.

It seems like if you want to keep using dependencies, the real solution is to migrate to an async database library, and if you're already using SQLAlchemy the switch should be relatively easy with version 1.4.

@mgtwf
Copy link

mgtwf commented Jun 3, 2021

Hi everyone,

During my locust test on my FastApi apis, I'm trying a 1000 users test with a 10 users spawn rate and after a moment, I have a QueuePool limit error.
First I increase my create_engine configuration (my SQL Database allows 75 different connections) :
engine = create_engine( SQLALCHEMY_DATABASE_URL, pool_recycle=3600, pool_size=16, max_overflow=48 )
But the problem persists. After a lot of investigation, I find this topic : tiangolo/full-stack-fastapi-template#104 (comment) (the same as @Blue9 ). And I just remove the Depends as recommanded by the following Session Manager :

@contextmanager
def get_db():
    session = SessionLocal()
     try:
        yield session
    finally:
        session.close()

@app.get("/users/get-mail/{mail}", response_model=schemas.User)
   def check_user_by_email(mail: str):
     with get_db() as db:
          db_user = crud.get_user_by_email(db, email=mail)
          return db_user

I think there is a critical problem with Depends that @tiangolo needs to investigate because in their documentation, they advise to use Depends, but Depends is not ready for a production use ! Thanks to the community for the Session Manager solution that saves my production deployment !

@brantley44
Copy link

We're seeing this issue too.

I'm working on going in and figuring out the bug, but @Blue9 seems to be correct in the problem stemming from the operation code being run in a different thread than the dependency seems to be the locus.

Temporary work arounds are to use a context manager, or use an async function with sqlalchemy.ext.asyncio, both of which ensure it all happens in the same thread. But you do not have to do both.

I would highly suggest to the FastAPI maintainers that they add a note in the documentation and on https://github.com/tiangolo/full-stack-fastapi-postgresql that there is a potential issue, as right now it is not production ready for a load greater than 30 or so concurrent users.

@BradleyKirton
Copy link

We have this same issue when making use of a resource pool which acquires a resource in a generator.

The example below illustrates the issue, this can be run using apache bench to simulate some actvity.

ab -n 100 -c 20 http://localhost:8000/test
import itertools
import asyncio
from concurrent.futures import ThreadPoolExecutor
from queue import Queue
from threading import currentThread

from fastapi.applications import FastAPI
from fastapi import Depends, Request


app = FastAPI()
pool: Queue[int] = Queue()


counter = itertools.count()


def get_data():
    trace_id = next(counter)
    ct = currentThread()
    print(f"get trace-id-{trace_id}", ct.name, ct.native_id)

    value = pool.get()
    yield value

    print(f"got trace-id-{trace_id}", ct.name, ct.native_id)
    pool.put_nowait(value)
    print(f"release trace-id-{trace_id}", ct.name, ct.native_id)


@app.get("/test")
def index(
    req: Request,
    data: dict[str, str] = Depends(get_data),
) -> dict:
    return data


@app.on_event("startup")
def on_startup() -> None:
    pool.put_nowait(1)
    pool.put_nowait(2)

    loop = asyncio.get_running_loop()
    executor = ThreadPoolExecutor(1)
    loop.set_default_executor(executor)

From the output in the logs it seems like the same thread is acquiring resources from the pool and creating a deadlock. This seems to be caused by the cm = contextmanager_in_threadpool(contextmanager(call)(**sub_values)) call in solve_generator. If we decorate our dependency function with @contextlib.contextmanager the issue does not occur.

get - resource trace-0 ThreadPoolExecutor-0_0 13479
got - resource trace-0 ThreadPoolExecutor-0_0
release - resource trace-0 ThreadPoolExecutor-0_0 13479
get - resource trace-1 ThreadPoolExecutor-0_0 13479
get - resource trace-2 ThreadPoolExecutor-0_1 13480
got - resource trace-1 ThreadPoolExecutor-0_0
got - resource trace-2 ThreadPoolExecutor-0_1
get - resource trace-3 ThreadPoolExecutor-0_0 13479
get - resource trace-4 ThreadPoolExecutor-0_1 13480

@alexiri
Copy link
Contributor

alexiri commented Nov 6, 2021

I've just been bitten by this issue as well. I had sqlalchemy configured with a pool of 10 and no overflow, and with 11 simultaneous connections the FastAPI server deadlocks completely and doesn't serve any request.

I remove the DB dependency injection and started using the contextmanager method and now it works as it should: 10 requests are served immediately and the 11th gets handled afterwards.

@tiangolo, could you clarify why dependency injection doesn't work in this case? Could you fix the examples in the documentation so more users aren't led astray?

@simondale00
Copy link
Contributor

simondale00 commented Nov 19, 2021

Hey all. I've been digging into this as well and I don't think this is a FastAPI issue per se. There are a few things going on that seem to be leading to the issue.

tl;dr

Dependencies and path operations defined as functions (def only) are run in an anyio threadpool. db.execute is a blocking call, causing all the workers in the pool to block in the path operation function. This prevents dependency generators from invoking their finally block, thereby preventing SQLAlchemy connections from being released.

Workarounds?

The suggested workaround to use a context manager within your path operation is by far the easiest solution. You're scoping the SQLAlchemy session lifecycle within the same coroutine, guaranteeing that connections can be acquired and released within that coroutine. This prevents the resource contention described in the "deep dive" section.

While I haven't tried it yet.... It may be worth using the SQLAlchemy 1.4 asyncio beta. Part of the core issue here is related to db.execute being a blocking call. If SQLAlchemy natively supports async operations, it may resolve the issue.

Deep dive

def get_db(request: Request):
    db = SessionLocal()
    try:
        yield db
    finally:
        db.close()

^ First, it should be noted that db = SessionLocal() is a non-blocking operation. Prior to the deadlock, all N requests will be able to create a session object and yield it to the path operation.

@app.get('/')
def index(db: Session = Depends(get_db)):
    # Some blocking work
    _ = db.execute('select 1')

^ db.execute is a blocking operation. Per the SQLAlchemy docs, the session requests a connection from the connection pool once queries are issued.

In the example code above, the SQLAlchemy connection pool is of size 4. This means that 4 requests will be able to check out a connection, while (N-4) connections will block on this line waiting for a connection to become available. (Side note, it's good practice to define connection timeouts on your engine object to avoid waiting interminably for a connection. Though that alone won't solve the problem).

def get_db(request: Request):
def index(db: Session = Depends(get_db)):

^ Also keep in mind that neither the dependency function nor the path operation are defined as coroutines. Both FastAPI and Starlette seem to want everything to run asynchronously, so these functions are invoked in a threadpool. See Starlette code, FastAPI doc 1, and FastAPI doc 2 for more information.

Here is where things break. anyio manages it's own threadpool of ~40 workers, each with it's own job queue. There are (N-4) workers blocked waiting for a SQLAlchemy connection to be relinquished. The 4 path operations that secured a connection can complete. FastAPI will then attempt to invoke the __exit__ method of your dependency generator to clean up the session (which implicitly checks the connection back into the connection pool). However, the dependency generator is not a coroutine, so it's passed off to the threadpool.

And here is our deadlock. All anyio worker threads are blocked waiting for SQLAlchemy connections. The code to release the connections is blocked waiting for a worker to become available.

But what about native coroutines?

While the root issue with the example code is related to the use of anyio thread pools, we have observed the deadlock when using native coroutines in both path operations and dependency generators. This case is a little more straightforward... (Note, the FastAPI documentation already recommends against this pattern)

@app.get('/')
async def index(db: Session = Depends(get_db)):
    # Some blocking work
    _ = db.execute('select 1')

^ As noted in the earlier section db.execute is a blocking operation, as it implicitly requests a connection from the SQLAlchemy connection pool. However, the connection can't be retrieved because the connection pool has exhausted. SQLAlchemy blocks waiting for a connection to become available.

Here-in lies the issue. The event loop doesn't have the power to interrupt this coroutine; there is no await syntax indentifying the coroutine can be interrupted to yield control back to the event loop. The event loop is now fully blocked.

I'd still argue this isn't really a FastAPI issue. The coroutine is executing a blocking call; whether it's SQLAlchemy or another third-party library, blocking calls pose this risk.

However, I would argue the FastAPI However, I would argue the FastAPI documentation needs to be updated. The "Dependencies with yield" documentation and the "FastAPI & SQLAlchemy" example can both lead to deadlocks.

@sorasful
Copy link

I've read a lot of issues about this problem, but there's one thing I can't really figure out yet.

Now we're using get_db as a context manager as it's preconized by a lot of people and it seems to work a lot better than using Depends. Now, the second problem is when you have a dependency, like "get_current_active_user". I've tried to refactor it like this :

async def get_current_user(token: str = Depends(oauth2_scheme)) -> User:
    with get_session() as db:
        decoded_token = jwt.decode(token, SECRET_KEY, [ALGORITHM])
        user = get_user(decoded_token['login'], db)
        if not user:
            raise HTTPException(400, 'User in token does not exists')
        return user

And my route looks like this (classic) :

@playlists_router.get('/me', response_model=List[PlaylistShow], tags=['playlists'])
def get_my_playlists(current_user: User = Depends(get_current_user)):
    with get_session() as db:
        playlists = current_user.playlists
        for p in playlists:
            p.new_entries_count = len(get_new_entries_for_user(current_user.id, p.id, db))

        return playlists

I get this error :
sqlalchemy.orm.exc.DetachedInstanceError: Parent instance <User at 0x1fa8d5e2d90> is not bound to a Session; lazy load operation of attribute 'playlists' cannot proceed (Background on this error at: https://sqlalche.me/e/14/bhk3)

I've read that maybe we should do something with pydantic and .from_orm method, but what if I need to update this current user or access its attributes ? I don't want to go through the process of searching him from the DB in the new context manager.

Thanks for your time !

@OpetherMB
Copy link

I've read a lot of issues about this problem, but there's one thing I can't really figure out yet.

Now we're using get_db as a context manager as it's preconized by a lot of people and it seems to work a lot better than using Depends. Now, the second problem is when you have a dependency, like "get_current_active_user". I've tried to refactor it like this :

async def get_current_user(token: str = Depends(oauth2_scheme)) -> User:
    with get_session() as db:
        decoded_token = jwt.decode(token, SECRET_KEY, [ALGORITHM])
        user = get_user(decoded_token['login'], db)
        if not user:
            raise HTTPException(400, 'User in token does not exists')
        return user

And my route looks like this (classic) :

@playlists_router.get('/me', response_model=List[PlaylistShow], tags=['playlists'])
def get_my_playlists(current_user: User = Depends(get_current_user)):
    with get_session() as db:
        playlists = current_user.playlists
        for p in playlists:
            p.new_entries_count = len(get_new_entries_for_user(current_user.id, p.id, db))

        return playlists

I get this error : sqlalchemy.orm.exc.DetachedInstanceError: Parent instance <User at 0x1fa8d5e2d90> is not bound to a Session; lazy load operation of attribute 'playlists' cannot proceed (Background on this error at: https://sqlalche.me/e/14/bhk3)

I've read that maybe we should do something with pydantic and .from_orm method, but what if I need to update this current user or access its attributes ? I don't want to go through the process of searching him from the DB in the new context manager.

Thanks for your time !

did you find a solution to this pblm ?

@sorasful
Copy link

sorasful commented Mar 9, 2022

@OpetherMB Unfortunately no, I decided to switch to Ormar which is an async ORM. It works very nicely now, but it required some refactoring.

@happy-toro
Copy link

I notice if we create the dependency function get_session as async function, the connection pool issue gone !

so in short,

  • no issue with SQLAlchemy async mode (with asyncpg) and FastAPI with async def get_session()
  • connection pool running out issue with SQLAlchemy sync mode (with psycopg2) and FastAPI with def get_session(). The issue could be resolved by declaring async def get_session()

@harsh8398
Copy link

@happy-toro Declaring get_session() as async with SQLAlchemy sync mode did not change the behaviour for me.

Also if it helps, I have compiled results from a list of load tests I performed with different implementations for get_db() with SQLAlchemy sync mode here: https://github.com/harsh8398/fastapi-dependency-issue.

@JarroVGIT
Copy link
Contributor

That is some comprehensive testing right there, kudos!

Indeed, changing your dependency function to a coroutine itself (using a sync def) won’t fix it, as the call to the DB later on will be blocking, waiting for a free connection, which will never happen as the event loop is blocked by the db call. Someone explained it above in a deep dive, good read!

@zoliknemet zoliknemet mentioned this issue Jun 18, 2022
9 tasks
@adriangb
Copy link
Contributor

adriangb commented Jul 7, 2022

Can you folks try this "hack" (let's be clear: I am not recommending this is a solution) and see if it solves/is related to your issues?

import anyio

app = FastAPI()

@app.on_event("startup")
def startup():
    limiter = anyio.to_thread.current_default_thread_limiter()
    limiter.total_tokens = 4096

@odiseo0
Copy link

odiseo0 commented Jul 11, 2022

I'll try @adriangb

@wcmoosa
Copy link

wcmoosa commented Jul 17, 2022

@happy-toro Declaring get_session() as async with SQLAlchemy sync mode did not change the behaviour for me.

Also if it helps, I have compiled results from a list of load tests I performed with different implementations for get_db() with SQLAlchemy sync mode here: https://github.com/harsh8398/fastapi-dependency-issue.

I've also been facing this issue on a production server, this seems like a major problem as FastAPI Dependencies with sync database connections are currently causing deadlocks @tiangolo

@wcmoosa
Copy link

wcmoosa commented Aug 27, 2022

@adriangb Thanks for the snippet, the limiter increase does seem to assist in the short term although 4096 is quite high & not feasible for some production systems.

Is there anyway to get this PR more attention for review ? your concurrency fixes seem to be a good way to resolve this.

@adriangb
Copy link
Contributor

Comment on it saying it fixed your issue (if you tested it and it did)

@wcmoosa
Copy link

wcmoosa commented Aug 28, 2022

@adriangb Thanks for speedy response, it's very much appreciated.

I tested this open PR with locust, 5 endpoints and 40 concurrent users and still hitting the Queuepool error:

Engine

engine = create_engine(SQLALCHEMY_DATABASE_URL, pool_recycle=600)

SessionLocal = sessionmaker(bind=engine)

deps:

def get_db() -> Generator:
    db = SessionLocal()
    try:
        yield db
    finally:
        db.close()`

Errors:

greenfields-api |   File "/usr/local/lib/python3.9/site-packages/h11/_state.py", line 251, in _fire_event_triggered_transitions
greenfields-api |     raise LocalProtocolError(
greenfields-api | h11._util.LocalProtocolError: can't handle event type ConnectionClosed when role=SERVER and state=SEND_RESPONSE
greenfields-api | ERROR:sqlalchemy.pool.impl.QueuePool:Exception during reset or similar
greenfields-api | Traceback (most recent call last):
greenfields-api |   File "/usr/src/app/./app/api/dependencies.py", line 17, in get_db
greenfields-api |     yield db
greenfields-api | GeneratorExit

raise exc.TimeoutError(\nsqlalchemy.exc.TimeoutError: QueuePool limit of size 5 overflow
10 reached, connection timed out, timeout 30.00 (Background on this error at: https://sqlalche.me/e/14/3o7r

Uvicorn command:
uvicorn app.main:app --reload --workers 1 --host 0.0.0.0 --port 8000

Main Packages:

@adriangb
Copy link
Contributor

I'm not sure what's going on with your test. Can you post the whole thing publicly in a reproducible format? These sorts of concurrency things are tricky, sometimes there's even more than one issue at play. For example, if you have a lot of clients and/or slow queries you'll still time out waiting for a database connection (even if there is no deadlock).

@tiangolo
Copy link
Owner

tiangolo commented Sep 4, 2022

Thanks for the report @Blue9 and thanks everyone for the discussion! ☕

This was most probably solved by @adriangb in #5122, released as part of FastAPI 0.82.0 🎉

If this solves it for you, you could close the issue. 🤓

@github-actions
Copy link
Contributor

Assuming the original need was handled, this will be automatically closed now. But feel free to add more comments or create new issues or PRs.

@tiangolo tiangolo reopened this Feb 27, 2023
Repository owner locked and limited conversation to collaborators Feb 27, 2023
@tiangolo tiangolo converted this issue into discussion #6628 Feb 27, 2023

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Projects
None yet
Development

Successfully merging a pull request may close this issue.