Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace task cancellation in BaseHTTPMiddleware with http.disconnect+recv_stream.close #1715

Merged
merged 8 commits into from Sep 24, 2022

Conversation

jhominal
Copy link
Member

@jhominal jhominal commented Jun 29, 2022

There have been various issues related to the way that BaseHTTPMiddleware works, and in particular, related to the way that it uses task cancellation in order to force the downstream ASGI app to shut down.

In particular, I would like to highlight that:

  • In a series of issues related to the way that BaseHTTPMiddleware, cancellation is used in order to get out of a deadlock between a call to send_stream.send that never finishes and blocks a downstream app from completing, and a call to recv_stream.receive that is never made because the body_stream iterator has been orphaned by the dispatch method implementation;
  • In allow yielding a new response #1697, @florimondmanca expresses his opinion that "AFAIK there’s no way in the ASGI spec to just discard a streaming body even though more_body is True. In fact I believe the spec says servers MUST pull until that becomes False." - in other words, using task cancellation on a downstream app may not be permitted by ASGI. I would note that, as ASGI is agnostic to the async framework in use, and as e.g. native asyncio and trio have very different cancellation semantics, that may be the only practical position that ASGI can take;

I believe that this pull request outlines a solution that avoids both using task cancellation on the downstream ASGI application, and also avoids deadlocking the downstream ASGI application. It does that by replacing task cancellation with the following features:

  • Instead of triggering task cancellation on the task group that runs the downstream ASGI application, an anyio.Event app_disconnected is set, that triggers the following consequences:
    • It hooks the receive function that is passed to the downstream ASGI application so that, if app_disconnected is set, a message with type http.disconnect will be returned (instead of waiting for the next message from request.receive);
    • It closes the recv_stream, which has the effect of removing the root cause of the known deadlock issue. However, as send_stream.send raises an anyio.BrokenResourceError in that case, we need to wrap send_stream.send before passing it to the downstream ASGI application;

What are the consequences of that proposed change?

  1. We cannot rely on task cancellation to force the downstream ASGI application to stop its processing. However, as task cancellation is a behavior that only happens when a BaseHTTPMiddleware is found in an ASGI middleware chain, and that is completely unexpected by ASGI applications (even Starlette's own response classes do not take that possibility in account), that may not be an actual loss of functionality;
  2. By sending a {"type":"http.disconnect"} message, we allow ASGI applications that listen to that event (such as Starlette's StreamingResponse) to potentially wrap up their processing once it becomes clear that their work will be discarded and start cleanup and finalization (e.g. something may be added to FileResponse);
  3. By closing the recv_stream once the response has been completely sent, we remove the root cause of the deadlock, and ensure that it cannot happen even e.g. in the face of a fully cancellation-shielded application;

There are two larger points that concern conformance to the ASGI specification / expected behavior (however, I do not know even with whom we could raise these points):

  • Should we confirm whether task cancellation can be used (or not) to interrupt a running ASGI application?
  • Should we confirm whether the usage of {"type":"http.disconnect"}, in order to signal to the application "Your link with the HTTP client has been severed, there is no need to continue producing http.response.body messages", is semantically compatible with what ASGI specifies?

This PR is a draft because, as I did not have validation from the maintainers to propose such a change, I did not add the tests that would prove that various issues (some of them the same as those fixed by the many other PRs on the same subject) are actually fixed by my proposal. In case any maintainer expresses interest in actually trying to integrate this proposal, I will gladly work to complete this PR with the necessary tests.

This PR is not a draft anymore, and:

@jhominal jhominal force-pushed the base-http-middleware-no-cancellation branch 4 times, most recently from a287b8d to c98203e Compare June 29, 2022 23:27
@florimondmanca
Copy link
Member

florimondmanca commented Jul 1, 2022

@jhominal Hey, thanks a lot for this. I think I would be very interested in seeing what the tests look like. Might help me better grok what practical situations this PR would help address.

About:

@florimondmanca expresses his opinion that "AFAIK there’s no way in the ASGI spec to just discard a streaming body even though more_body is True. In fact I believe the spec says servers MUST pull until that becomes False." - in other words, using task cancellation on a downstream app may not be permitted by ASGI.

This came from what I would call an interpretation of the spec, which reads (emphasis mine):

https://asgi.readthedocs.io/en/latest/specs/www.html#request-receive-event

more_body (bool) – Signifies if there is additional content to come (as part of a Request message). If True, the consuming application should wait until it gets a chunk with this set to False. If False, the request is complete and should be processed. Optional; if missing defaults to False.

I interpret "you should consume as long as more_body is True" as implying that "you should not interrupt the application until more_body becomes False", with the sense that "cancellation" is a form of "interruption".

So, as for...

  • Should we confirm whether task cancellation can be used (or not) to interrupt a running ASGI application?

My interpretation (although the spec doesn't mention any notion of "cancellation") would be, no.

As for http.disconnect, the spec reads:

https://asgi.readthedocs.io/en/latest/specs/www.html#disconnect-receive-event

Sent to the application when a HTTP connection is closed or if receive is called after a response has been sent. This is mainly useful for long-polling, where you may want to trigger cleanup code if the connection closes early.

Clearly, we are in the second situation ("if receive is called after a response has been set"), because we now have this:

await response(scope, receive, send)
app_disconnected.set()

So the answer to...

  • Should we confirm whether the usage of {"type":"http.disconnect"}, in order to signal to the application "Your link with the HTTP client has been severed, there is no need to continue producing http.response.body messages", is semantically compatible with what ASGI specifies?

Seems to be: yes, the http.disconnect usage here seems very appropriate. Promising!

We might want to make this even clearer by renaming the app_disconnected event to response_sent.

@florimondmanca
Copy link
Member

@jhominal Am I correct saying this would be an alternative to #1710?

@jhominal
Copy link
Member Author

jhominal commented Jul 1, 2022

@florimondmanca Thank you for your reply! I am going to work on the test cases - at least a few of them will be similar to existing issues/pull requests.

But yes, it is my belief that this proposal has the potential to be an alternative to #1710, #1700 (which fixes the "background tasks are canceled" issue by pushing them outside of the response processing), #1699, #1441 (which was closed yesterday).

I believe it would fix at least #1438, which is the issue I raised a little time ago.

@jhominal jhominal force-pushed the base-http-middleware-no-cancellation branch 3 times, most recently from f1d1e21 to c77c5eb Compare July 2, 2022 05:21
@adriangb
Copy link
Member

adriangb commented Jul 2, 2022

This does look very nice, I do think removing the cancellation is a step in the right direction.

this proposal has the potential to be an alternative to #1700 (which fixes the "background tasks are canceled" issue by pushing them outside of the response processing)

Did you identify any issues with that proposal so that we can have them in a comment for posteriority / we can close that PR?

@jhominal
Copy link
Member Author

jhominal commented Jul 2, 2022

I will just comment on the two issues that I believe are still open and would be directly affected by this PR:

@adriangb
Copy link
Member

adriangb commented Jul 2, 2022

I tried to add a test case for this in

def test_background_tasks(test_client_factory: Callable[[ASGIApp], TestClient]) -> None:
# test for https://github.com/encode/starlette/issues/919
container: List[str] = []
async def slow_task() -> None:
container.append("started")
# small delay to give BaseHTTPMiddleware a chance to cancel us
# this is required to make the test fail prior to fixing the issue
# so do not be surprised if you remove it and the test still passes
await anyio.sleep(0.1)
container.append("finished")
async def dispatch(
request: Request, call_next: Callable[[Request], Awaitable[Response]]
) -> Response:
return await call_next(request)
async def endpoint(request: Request) -> Response:
return Response(background=BackgroundTask(slow_task))
app = Starlette(
routes=[Route("/", endpoint)],
middleware=[Middleware(BaseHTTPMiddleware, dispatch=dispatch)],
)
client = test_client_factory(app)
response = client.get("/")
assert response.status_code == 200, response.content
assert container == ["started", "finished"]

The original issue says "subsequent ones are not processed until the 10 second sleep has finished (the first request returns before then though)". I was not able to reproduce that specific outcome, but I was able to reproduce the BackgroundTask not running when there was a BaseHTTPMiddleware present. I think it is the same bug, the behavior has just changed slightly over time.

@jhominal
Copy link
Member Author

jhominal commented Jul 2, 2022

@adriangb: I have looked at your test cases while implementing mine. However, after porting and reviewing a copy of the #919 test, I saw thought that the only substantial difference between my test cases for #1438 and #919 was that one test was expressed in pure ASGI terms, while the other was expressed using the TestClient (and relies on the fact that TestClient sends http.disconnect immediately after receiving the response - but I am of the opinion that TestClient should not be relied on for ASGI-level details). Because of that, I took out the test for #919.

As some of the issues reported on #919 were at a time when Starlette was implemented directly on top of asyncio, and given the difference that using anyio makes to the usable primitives, it is extremely likely that these issues cannot be reproduced anymore.

@jhominal jhominal force-pushed the base-http-middleware-no-cancellation branch from 1722e94 to 243d2ce Compare July 2, 2022 06:37
@jhominal jhominal marked this pull request as ready for review July 2, 2022 06:39
@adriangb
Copy link
Member

adriangb commented Jul 2, 2022

Hmm good point, looking at my test now I think that may be the case. I'll dig deeper tomorrow, thank you for combing through them.

@jhominal
Copy link
Member Author

jhominal commented Jul 2, 2022

Hmm good point, looking at my test now I think that may be the case. I'll dig deeper tomorrow, thank you for combing through them.

I just reread your test for #1438 and I realize that actually, it has a difference (which is that the background task waits on the disconnected event). I will admit that, as I had already implemented the test for #1438 before reading yours, I did not read that test case that closely - I mostly compared my version of #1438 and my version of #919 (based on yours).

@jhominal jhominal changed the title Draft: Replace task cancellation in BaseHTTPMiddleware with http.disconnect+recv_stream.close Replace task cancellation in BaseHTTPMiddleware with http.disconnect+recv_stream.close Jul 2, 2022
@jhominal jhominal force-pushed the base-http-middleware-no-cancellation branch from 243d2ce to 137776b Compare July 2, 2022 19:42
@jhominal
Copy link
Member Author

jhominal commented Jul 6, 2022

I have just added a test to check for the issue reported by @kissgyorgy on #1678 (comment) - about the way that BaseHTTPMiddleware cancellation prevents the context manager defined in the middleware from running its exit function.

Fixing this test did not require any modification to the PR's modifications of base.py.

@jhominal
Copy link
Member Author

jhominal commented Jul 7, 2022

@florimondmanca As you previously expressed interest in this PR, I chose to request you as a reviewer. Please feel free to suggest anyone else who you think should also take a look at this.

@jhominal
Copy link
Member Author

jhominal commented Aug 16, 2022

I agree that this PR is languishing a bit, so I have decided to request review from all encode maintainers, in the hope that more people can will take a look at this.

@jhominal jhominal force-pushed the base-http-middleware-no-cancellation branch from 1062843 to 4c50738 Compare August 16, 2022 11:08
@jhominal jhominal force-pushed the base-http-middleware-no-cancellation branch from 4c50738 to 274bdeb Compare September 3, 2022 16:51
Copy link
Member

@adriangb adriangb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems sensible to me. @Kludex and @florimondmanca I would like you two to take a look as well to see if you can spot any issues. Thank you @jhominal !

Comment on lines +38 to +46
async with anyio.create_task_group() as task_group:

async def wrap(func: typing.Callable[[], typing.Awaitable[T]]) -> T:
result = await func()
task_group.cancel_scope.cancel()
return result

task_group.start_soon(wrap, response_sent.wait)
message = await wrap(request.receive)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What this is doing is saying "wait for a message from the client but if response_sent gets set in the meantime then stop waiting/reading from the client and move on"

Copy link
Member Author

@jhominal jhominal Sep 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the issue that I want to solve here is, if the downstream app is waiting on receive, but as the response is sent (likely by the middleware), the downstream app cannot send anything meaningfully, so there is no point in letting downstream wait for another message from upstream.

We could also choose to rely on upstream receive returning a http.disconnect message when the response is sent (which should happen), but when I wrote that bit, I thought that a belt-and-braces approach would be better.

However, that approach does mean that every call to receive from an app gets an intermediary anyio.TaskGroup for each BaseHTTPMiddleware in the middleware chain.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other words, I would be open to modifying that part to remove the receive wrapper to avoid that cost if it thought to be too much.

@Kludex Kludex added this to the Version 0.21.0 milestone Sep 6, 2022
@Kludex Kludex mentioned this pull request Sep 6, 2022
8 tasks
@Kludex
Copy link
Sponsor Member

Kludex commented Sep 6, 2022

I'll check tomorrow. I've added this to the checklist for the next release.

@jhominal
Copy link
Member Author

jhominal commented Sep 6, 2022

@adriangb @Kludex Thank you very much for taking the time to look at it! (I have not been in Gitter much recently so I do not know if you have talked about this PR over there)

Copy link
Sponsor Member

@Kludex Kludex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🙏

@Kludex
Copy link
Sponsor Member

Kludex commented Sep 21, 2022

I'll let you merge @jhominal 👍

Thanks for this, and sorry for taking long to review.

Beautiful code. Well thought. 👍

@Kludex
Copy link
Sponsor Member

Kludex commented Sep 24, 2022

I'll merge this, since it's the only PR missing for the release.

@Kludex Kludex merged commit 040d8c8 into encode:master Sep 24, 2022
mmcfarland added a commit to microsoft/planetary-computer-apis that referenced this pull request Oct 24, 2022
The 0.21 release resolves a frequent error on our fastapi version.

See:
encode/starlette#1710
encode/starlette#1715
mmcfarland added a commit to microsoft/planetary-computer-apis that referenced this pull request Oct 25, 2022
* Temporarily use fork for starlette 0.21 release

The 0.21 release resolves a frequent error on our fastapi version.

See:
encode/starlette#1710
encode/starlette#1715

* Disable FTP as function app deploy option

Security controls

* Trace request attributes before invoking middleware

If an exception is raised in subsequent middlewares, added trace
attributes will still be logged to Azure. This allows us to find
requests that fail in the logs.

* Make config cache thread safe

cachetools cache is not thread safe and there were frequent exceptions
logged indicating that cache updates during async calls were failing
with key errors similar to those described in:

tkem/cachetools#80

Add a lock per table instance synchronizes cache updates across threads
in.

* Lint

* Changelog
aminalaee pushed a commit that referenced this pull request Feb 13, 2023
…ct`+`recv_stream.close` (#1715)

* replace BaseMiddleware cancellation after request send with closing recv_stream + http.disconnect in receive

fixes #1438

* Add no cover pragma on pytest.fail in tests/middleware/test_base.py

Co-authored-by: Adrian Garcia Badaracco <1755071+adriangb@users.noreply.github.com>

* make http_disconnect_while_sending test more robust in the face of scheduling issues

* Fix issue with running middleware context manager

Reported in #1678 (comment)

Co-authored-by: Adrian Garcia Badaracco <1755071+adriangb@users.noreply.github.com>
Co-authored-by: Marcelo Trylesinski <marcelotryle@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants