Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Requests randomly fail to receive data #3137

Closed
TimOrme opened this issue Apr 28, 2021 · 9 comments
Closed

Requests randomly fail to receive data #3137

TimOrme opened this issue Apr 28, 2021 · 9 comments
Labels
question Question or problem question-migrate

Comments

@TimOrme
Copy link

TimOrme commented Apr 28, 2021

Example

I'm having a very hard time reproducing this issue consistently, but have exhausted all other avenues that I could think of. I'll do my best to describe the setup here, but unfortunately I couldn't come up with a code sample that I was able to reproduce it consistently.

The gist is like this however:

import functools
from fastapi import FastAPI, Depends, HTTPException
from pydantic import BaseModel
from sqlalchemy import create_engine
from sqlalchemy.engine import Engine
from sqlalchemy.exc import IntegrityError
from sqlalchemy.orm import Session, sessionmaker
import ddtrace

app = FastAPI()

@functools.lru_cache(maxsize=1)
def get_database_engine() -> Engine:
    """Retrieve the database engine.  There should only be one of these per application."""
    engine = create_engine("postgres://")
    return engine


@functools.lru_cache(maxsize=1)
def get_database_session(engine: Engine = Depends(get_database_engine)) -> sessionmaker:
    """Retrieve the database session maker.  There should only be one of these per application."""
    return sessionmaker(autocommit=True, autoflush=False, bind=engine)


def get_db(session_local: sessionmaker = Depends(get_database_session)):
    """Get a database connection."""
    with ddtrace.tracer.trace("db_connection_acquire"):
        db = session_local()
        try:
            yield db
        finally:
            db.close()

class MappingRequest(BaseModel):
    """Request class to create new mappings."""

    original_id: str
    new_id: str

@app.post("/mapping")
def create_mapping(
    upload: MappingRequest, db: Session = Depends(get_db)
):
    """Create a new dicom deid mapping for a series."""
    with ddtrace.tracer.trace("mapping_start"):
        try:
            with db.begin():
                with ddtrace.tracer.trace("mapping_transaction"):
                    mapping = crud.create_mapping(
                        db, upload.original_id, upload.new_id
                    )
                    return {"mapping_id": mapping.id}
        except IntegrityError:
            raise HTTPException(status_code=409, detail="Mapping exists")

We have a simple sync endpoint which inserts keys into the database. The DB operation is very quick. We've also added span traces to debug the behavior using datadog.

Description

Under moderate load, about 10 requests per second, the endpoint mostly responds very fast in the 10ms range. However, sometimes there are extreme outliers in response time, where the request goes to 3 minutes before the connection is killed by our ALB.

Traces during the long operation show that the route code is never hit, nor are the dependencies. This seems to indicate that something within FastAPI is failing to properly schedule or submit the sync jobs into the thread pool [URL details obfuscated to remove company info]

image

Compared with normal requests:

image

This seems to be a similar issue to the one described here: https://stackoverflow.com/questions/61466243/uvicorn-not-processing-some-requests-randamly

And similar behavior here: #1195 but we are not on windows.

Environment

  • OS: [e.g. Linux / Windows / macOS]: docker with tiangolo/uvicorn-gunicorn-fastapi-docker:python:3.8, host is Linux.
  • FastAPI Version [e.g. 0.3.0]: 0.63.0
  • Python version: 3.8.2
  • Gunicorn version: 20.1.0
  • Uvicorn version: 0.13.4

Note that gunicorn and uvicorn were upgraded manually in the dockerfile in an attempt to resolve the issue. However, the default versions with the image exhibited the same behavior.

Additional context

We've tried a number of additional ways to reproduce this, but have been unsuccessful. When testing locally, even introducing 10x the load, we can't reproduce this issue.

Also should note this is running in AWS in ECS, behind an ALB. We have toyed with the timeout settings within gunicorn and uvicorn to try and address this as well, but none of those seem to solve it.

@TimOrme TimOrme added the question Question or problem label Apr 28, 2021
@TimOrme
Copy link
Author

TimOrme commented Apr 30, 2021

Some additional details here. We added some middleware to see if we could further debug where it was hanging:

"""Custom asgi middleware."""
import ddtrace


class TraceSendReceiveAsgiMiddleware:
    """Middleware to trace each message sent and received in a connection."""

    def __init__(self, app):
        """Initialize the middleware."""
        self.app = app

    async def __call__(self, scope, receive, send):
        """Invoke the middleware."""

        async def wrapped_receive():
            with ddtrace.tracer.trace("receive"):
                return await receive()

        async def wrapped_send(message):
            with ddtrace.tracer.trace("send"):
                return await send(message)

        return await self.app(scope, wrapped_receive, wrapped_send)

With this, we see that for these cases, it does seem to be stuck on the await receive() portion.

image

I also attempted to fully convert to asynchronous endpoints, upgraded to SQLAlchemy 1.4, leveraged its new async support, and moved the endpoints themselves to be async as well. However, I see the same behavior even with the async endpoints. I'll update the ticket title to reflect this, as it doesn't seem isolated to sync endpoints.

Also should note that the current client for this is using httpx. Initially though that perhaps the client was simply slow, but ensured that client timeouts are set, and verified that they are at 5 seconds. It would appear then that the server is stuck waiting for data even after the client has presumably dropped the request.

@TimOrme TimOrme changed the title Requests randomly fail to process with sync endpoints Requests randomly fail to receive data Apr 30, 2021
@TimOrme
Copy link
Author

TimOrme commented Apr 30, 2021

Also tried to remove gunicorn/uvicorn from the equation, and replaced them with hypercorn. The behavior is similar, though with some interesting differences. First, it didn't fail after the 3 minute mark, but instead took over 11 minutes to complete, presumably because hypercorn doesn't axe the request on connection close.

image

It also doesn't give a return code however, and I actually see an error in the logs, which may be helpful:


2021-04-30T18:03:55.103Z | RuntimeError: cannot reuse already awaited coroutine
-- | --
2021-04-30T18:03:55.103Z | message = await self._receive()
2021-04-30T18:03:55.103Z | File "/usr/local/lib/python3.7/site-packages/starlette/requests.py", line 204, in stream
2021-04-30T18:03:55.103Z | Traceback (most recent call last):
2021-04-30T18:03:55.103Z | future: <Task finished coro=<<async_generator_athrow without __name__>()> exception=RuntimeError('cannot reuse already awaited coroutine')>
2021-04-30T18:03:55.103Z | ERROR:asyncio:Task exception was never retrieved
2021-04-30T18:03:55.103Z | RuntimeError: Cannot call write() after write_eof()
2021-04-30T18:03:55.103Z | File "uvloop/handles/stream.pyx", line 676, in uvloop.loop.UVStream.write
2021-04-30T18:03:55.103Z | self._transport.write(data)
2021-04-30T18:03:55.103Z | File "/usr/local/lib/python3.7/asyncio/streams.py", line 305, in write
2021-04-30T18:03:55.103Z | self.writer.write(event.data)
2021-04-30T18:03:55.103Z | File "/usr/local/lib/python3.7/site-packages/hypercorn/asyncio/tcp_server.py", line 89, in protocol_send
2021-04-30T18:03:55.103Z | await self.send(RawData(data=data))
2021-04-30T18:03:55.103Z | File "/usr/local/lib/python3.7/site-packages/hypercorn/protocol/h11.py", line 229, in _send_h11_event
2021-04-30T18:03:55.103Z | status_code=event.status_code,
2021-04-30T18:03:55.103Z | File "/usr/local/lib/python3.7/site-packages/hypercorn/protocol/h11.py", line 117, in stream_send
2021-04-30T18:03:55.103Z | status_code=status_code,
2021-04-30T18:03:55.103Z | File "/usr/local/lib/python3.7/site-packages/hypercorn/protocol/http_stream.py", line 168, in _send_error_response
2021-04-30T18:03:55.103Z | await self._send_error_response(500)
2021-04-30T18:03:55.103Z | File "/usr/local/lib/python3.7/site-packages/hypercorn/protocol/http_stream.py", line 104, in app_send
2021-04-30T18:03:55.103Z | await send(None)
2021-04-30T18:03:55.103Z | File "/usr/local/lib/python3.7/site-packages/hypercorn/asyncio/context.py", line 45, in _handle
2021-04-30T18:03:55.103Z | Traceback (most recent call last):
2021-04-30T18:03:55.103Z | Exception ignored in: <coroutine object _handle at 0x7fe56822bf80>
2021-04-30T18:03:55.099Z | task: <Task pending coro=<_handle() running at /usr/local/lib/python3.7/site-packages/hypercorn/asyncio/context.py:39> wait_for=<Future cancelled> cb=[gather.<locals>._done_callback() at /usr/local/lib/python3.7/asyncio/tasks.py:691]>
2021-04-30T18:03:55.099Z | ERROR:asyncio:Task was destroyed but it is pending!
2021-04-30T18:03:55.099Z | RuntimeError: coroutine ignored GeneratorExit
2021-04-30T18:03:55.099Z | await TCPServer(app, loop, config, reader, writer)
2021-04-30T18:03:55.099Z | File "/usr/local/lib/python3.7/site-packages/hypercorn/asyncio/run.py", line 103, in _server_callback
2021-04-30T18:03:55.099Z | Traceback (most recent call last):
2021-04-30T18:03:55.099Z | Exception ignored in: <coroutine object worker_serve.<locals>._server_callback at 0x7fe5683c33b0>
2021-04-30T18:03:55.099Z | task: <Task pending coro=<worker_serve.<locals>._server_callback() running at /usr/local/lib/python3.7/site-packages/hypercorn/asyncio/run.py:103> wait_for=<_GatheringFuture pending cb=[<TaskWakeupMethWrapper object at 0x7fe568153550>()]>>
2021-04-30T18:03:55.099Z | ERROR:asyncio:Task was destroyed but it is pending!

@xpayn
Copy link

xpayn commented May 6, 2021

I'm having a similar issue.
It get stuck in a middleware when calling down the middleware stack, but it never enter the route handler.
My environment is an AWS lambda function called from api gateway, so for the moment I used logs to narrow the problem.
I'll try to reproduce it on my machine and try to use pdb to see where it's stuck.
It feels like some kind of deadlock.

@TimOrme
Copy link
Author

TimOrme commented May 6, 2021

I think I've figured this out, though not sure if this is really a problem with FastAPI or with httpx, which is the client I'm using. It appears that if a httpx client with keep-alive connections is running, and it has a timeout error, it doesn't complete the request. As such, FastAPI just waits patiently for more data to come in, even though the client request is dead. FastAPI doesn't notice that the client request is done until the connection itself is closed.

In short, the requests themselves aren't actually taking this long, it's just that the client has bailed, and FastAPI just keeps waiting.

I have a (mostly) reproducible example now:

DockerFile:

FROM tiangolo/uvicorn-gunicorn-fastapi:python3.7

COPY ./app /app

/app/main.py:

from fastapi import FastAPI
from pydantic import BaseModel
import time



class TimingMiddleware:
    """Middleware to track request time."""

    def __init__(self, app):
        """Initialize the middleware."""
        self.app = app

    async def __call__(self, scope, receive, send):
        """Invoke the middleware."""

        async def wrapped_receive():
            start = time.time()
            data = await receive()
            end = time.time()
            print(f"RECEIVED IN {end-start}")
            return data

        return await self.app(scope, wrapped_receive, send)


app = FastAPI()
app.add_middleware(TimingMiddleware)


class Item(BaseModel):
    data: str


@app.post("/endpoint")
def endpoint(item: Item):
    return {"Hello": "World"}

test_client.py:

import asyncio
import httpx
import time


async def make_request(client):
    # post a big chunk of data
    await client.post("http://localhost/endpoint", json={"data": "A" * 10000000})
    return "XXX"


async def main():
    async with httpx.AsyncClient(timeout=httpx.Timeout(read=5, connect=5, pool=5, write=0.00000000000000000000000000000000000000001)) as client:
        tasks = []
        for x in range(5):
            tasks.append(asyncio.create_task(make_request(client)))

        # Gather all tasks
        done = await asyncio.gather(*tasks, return_exceptions=True)
        print(done)  # Should always print WriteTimeout.

        # Just sleep for 10 seconds.  Connections wont be released until the end of this.
        time.sleep(10)

asyncio.run(main())

Start the FastAPI server.

docker build -t timeout .
docker run --rm --name timeout -p 80:80 timeout

Run the test client:

python test_client.py

You'll observe that test_client.py immediately issues timeouts:

%python test_client.py
[WriteTimeout(''), WriteTimeout(''), WriteTimeout(''), WriteTimeout(''), WriteTimeout('')]

But that FastAPI doesn't register any output until 10 seconds later, when the client closes the connections. If you extend the wait time to 3 minutes, the same behavior occurs.

I'm not sure what the expected behavior really should be. It does seem that httpx could potentially be better about notifying the server that nothing else is coming, but also seems that something is potentially awry with how keep-alives are working with gunicorn/uvicorn/fastapi. Ideally, the server should be configurable to time out more proactively on these dead connections.

A simple workaround, in my case at least, was to avoid keeping long-lived httpx clients, and instead create a new client and connection for each request. This is probably worse from a performance perspective, but it does prevent these sort of confusing timings.

@xpayn
Copy link

xpayn commented May 7, 2021

In my case, I use httpx but to call another webservice from my fastapi app, my clients are browser and it sometimes hangs (indefinitely for FF and canceled after 4s by chrome).
My runtime environment is looking strangely like yours (a lambda is a container), so there's maybe something to dig there.
I'll try to run my code in docker and see if it hangs with a browser.

@magic-element-k
Copy link

@TimOrme I also had a problem with blocking requests. Look here for solutions to this problem
#3205
tiangolo/full-stack-fastapi-template#104

@tiangolo
Copy link
Owner

tiangolo commented Sep 4, 2022

Thanks for the report and discussion! ☕

This was possibly solved in #5122, released as part of FastAPI 0.82.0 🎉

Could you check it? If this solves it for you, you could close the issue. 🤓

@github-actions
Copy link
Contributor

github-actions bot commented Dec 9, 2022

Assuming the original need was handled, this will be automatically closed now. But feel free to add more comments or create new issues or PRs.

@github-actions github-actions bot closed this as completed Dec 9, 2022
@adarshpunj
Copy link

@tiangolo as of 0.82.0 it doesn't seem to be resolved. I'm having a hard time reproducing this, since it happens randomly:

Environment

  • OS: Docker image of python 3.8 on a Linux host
  • FastAPI v0.82.0
  • uvicorn v0.20.0
  • pg8000 v1.29.4
  • nginx/1.18.0 (Ubuntu)

As @TimOrme has mentioned this isn't a load issue — the requests never reach the server.

@tiangolo tiangolo reopened this Feb 27, 2023
@github-actions github-actions bot removed the answered label Feb 27, 2023
Repository owner locked and limited conversation to collaborators Feb 27, 2023
@tiangolo tiangolo converted this issue into discussion #6806 Feb 27, 2023

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
question Question or problem question-migrate
Projects
None yet
Development

No branches or pull requests

5 participants