Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Task exception was never retrieved on v3.8.3 #6978

Open
1 task done
w0rmr1d3r opened this issue Sep 27, 2022 · 42 comments
Open
1 task done

Task exception was never retrieved on v3.8.3 #6978

w0rmr1d3r opened this issue Sep 27, 2022 · 42 comments
Labels

Comments

@w0rmr1d3r
Copy link

w0rmr1d3r commented Sep 27, 2022

Describe the bug

We bumped from v3.8.1 to v3.8.3 this morning and started seeing the following errors. No other changes were done to our project:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/aiohttp/web_protocol.py", line 505, in start
    request = self._request_factory(message, payload, self, writer, handler)
  File "/usr/local/lib/python3.9/site-packages/aiohttp/web_app.py", line 446, in _make_request
    return _cls(
  File "/usr/local/lib/python3.9/site-packages/aiohttp/web_request.py", line 811, in __init__
    super().__init__(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/aiohttp/web_request.py", line 188, in __init__
    assert transport is not None
AssertionError

And

Task exception was never retrieved
future: <Task finished name='Task-30396' coro=<RequestHandler.start() done, defined at /usr/local/lib/python3.9/site-packages/aiohttp/web_protocol.py:462> exception=AssertionError()>

To Reproduce

  1. We were using v3.8.1
  2. Bump to v3.8.3

No other changes introduced from our side, those errors started appearing.

Expected behavior

For that exception to be handled.

Logs/tracebacks

Added them on the description, adding them here as well:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/aiohttp/web_protocol.py", line 505, in start
    request = self._request_factory(message, payload, self, writer, handler)
  File "/usr/local/lib/python3.9/site-packages/aiohttp/web_app.py", line 446, in _make_request
    return _cls(
  File "/usr/local/lib/python3.9/site-packages/aiohttp/web_request.py", line 811, in __init__
    super().__init__(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/aiohttp/web_request.py", line 188, in __init__
    assert transport is not None
AssertionError

And

Task exception was never retrieved
future: <Task finished name='Task-30396' coro=<RequestHandler.start() done, defined at /usr/local/lib/python3.9/site-packages/aiohttp/web_protocol.py:462> exception=AssertionError()>

Python Version

$ python --version
3.9.13

aiohttp Version

$ python -m pip show aiohttp
3.8.3

multidict Version

$ python -m pip show multidict
6.0.2

yarl Version

$ python -m pip show yarl
1.8.1

OS

Docker image

Related component

Server

Additional context

I have added as much context as I had, if I left out anything relevant, happy to look for it and add it to the issue.

Code of Conduct

  • I agree to follow the aio-libs Code of Conduct
@w0rmr1d3r w0rmr1d3r added the bug label Sep 27, 2022
@Dreamsorcerer
Copy link
Member

Initial investigation of the exception not retrieved:
The task is created here: https://github.com/aio-libs/aiohttp/blob/master/aiohttp/web_protocol.py#L305
Then it looks like it should get awaited at: https://github.com/aio-libs/aiohttp/blob/master/aiohttp/web_protocol.py#L288

If an exception is raised, then .done() would be True, so the await doesn't happen. My initial thought is that there should not be a .done() check on that code.

Not investigated anything about the assertion error though.

@nrothGIT
Copy link

nrothGIT commented Oct 5, 2022

I've also just recently seen the exact error described by @w0rmr1d3r pop up, seemingly at random. Will report back if we're able to get any more clarity or fixes on our end. Thanks everyone for looking :)

@doublex
Copy link

doublex commented Oct 18, 2022

Same issue after updating to v3.8.3

@w0rmr1d3r
Copy link
Author

w0rmr1d3r commented Oct 19, 2022

Same issue after updating to v3.8.3

Indeed @doublex . This started appearing in our services when upgrading to v3.8.3
It doesn't happen with v3.8.1

@Dreamsorcerer
Copy link
Member

Dreamsorcerer commented Oct 19, 2022

There aren't too many changes (if you ignore formatting/typing changes), so if anyone can take a look and figure out which change is the issue, that'd be great.

These 2 removed lines are the only ones that stand out to me, so maybe try adding them back in first:
v3.8.1...v3.8.3#diff-2126b277e07e3fdd05e7a81da456accf24e5515a46c78c48a79db4eb166043e4L302-L303

@w0rmr1d3r
Copy link
Author

Hi @Dreamsorcerer ,

Thank you for providing the diff between the affected versions.

I'm reading through the code and trying to understand. But you will have more context and knowledge on how this library works. My guessings:

File web_protocol.py on line 505, will always pass the handler as None. Since it's set on line 472 to be the self._task_handler.
Which on line 305 is set as None, and the change on the highlighted lines never cancels the task_handler.

Still, I lack complete context of the code, just writing what I see.

Then, in the web_app.py file, line 437 -> 454, protocol will be a RequestHandler, but is received as None, since as seen in this line:

_RequestFactory = Callable[

it can be "asyncio.Task[None]".

I guess, that is setting it to None, somehow?

Again, just tried to read through the code, but have 0 context on how the library works.

My guesses might be completely wrong, apologies for that.

@Dreamsorcerer
Copy link
Member

Dreamsorcerer commented Nov 17, 2022

File web_protocol.py on line 505, will always pass the handler as None. Since it's set on line 472 to be the self._task_handler.

Line 473 asserts it to be not None, so that's not possible.

My suggestion was to try adding back in the lines 302-303 that were deleted and see if the problem is still reproducible. If it is, then maybe checkout the code at different commits until you find the commit that introduces the problem (you could use git bisect for this). Also remember to do this on the 3.9 branch, not master.

@w0rmr1d3r
Copy link
Author

Hello @Dreamsorcerer ,
Sorry for my late response.

So, what you mean is, adding back those lines in the next release and see if the issue is fixed? I understand that to be version 3.9.0?
And if those lines make the issue go away, re-review the commits on branch 3.9?

Just wanted to clarify and get right the steps on this.

Thank you for providing an answer and maintain this project 🚀

@Dreamsorcerer
Copy link
Member

Yes, just clone the 3.9 branch, add those lines in, then test if the issue is fixed (also check the issue is reproducible before adding the line in).
If so, make a PR and we'll look at releasing it in 3.9.

@w0rmr1d3r
Copy link
Author

Hello @Dreamsorcerer ,
I haven't had time to put effort on this.
We received the dependabot bump PR for 3.8.4 and saw the same error.

Hopefully in the future I might be able to put some effort on this to debug it or something.

@Dreamsorcerer
Copy link
Member

3.8.4 only added a moderate severity fix. If you can figure out this issue, we'll put it into the 3.9 release.

@jsoucheiron
Copy link
Contributor

jsoucheiron commented Apr 17, 2023

Just here to confirm that we can see this issue happening in python 3.9, 3.10 and 3.11 with aiohttp 3.8.4

@odesenfans
Copy link

I can also confirm that this issue occurs in our project with aiohttp 3.8.3 and Python 3.11.

@odesenfans
Copy link

There aren't too many changes (if you ignore formatting/typing changes), so if anyone can take a look and figure out which change is the issue, that'd be great.
These 2 removed lines are the only ones that stand out to me, so maybe try adding them back in first:
v3.8.1...v3.8.3#diff-2126b277e07e3fdd05e7a81da456accf24e5515a46c78c48a79db4eb166043e4L302-L303

@Dreamsorcerer I tried reverting these lines on top of v3.8.3 and this does not fix the issue. They were also the only lines standing out to me in the git history, so I'll need to take a deeper look.

Looking at the logic, it seems like this originates from the connection_lost path: the transport is set to None early in BaseProtocol.connection_lost(). I could not reproduce this issue locally though so no way to tell for sure. I didn't see anything out of the ordinary in the code, so maybe the issue is somewhere else, ex: some level of concurrency happening between connection_lost() and the request handler?

My setup:

  • Python 3.11
  • aiohttp 3.8.3
  • gunicorn + uvloop workers.

@Dreamsorcerer
Copy link
Member

@Dreamsorcerer I tried reverting these lines on top of v3.8.3 and this does not fix the issue. They were also the only lines standing out to me in the git history, so I'll need to take a deeper look.

If you can run it from the git repo, then a git bisect or something could help find the commit which introduced the issue.

@odesenfans
Copy link

@Dreamsorcerer I tried reverting these lines on top of v3.8.3 and this does not fix the issue. They were also the only lines standing out to me in the git history, so I'll need to take a deeper look.

If you can run it from the git repo, then a git bisect or something could help find the commit which introduced the issue.

I'd love to, but I have no idea how to build the package properly to test inside a Docker image (only solution I have to reproduce the issue at the moment). I've looked at the docs but I couldn't find a tip, I always end up with "aiohttp/_websocket.c: File not found". Any idea?

@odesenfans
Copy link

In the meantime, I realized I hadn't installed my patched version correctly. It appears that applying the patch below as suggested fixes the issue.

diff --git a/aiohttp/web_protocol.py b/aiohttp/web_protocol.py
index 10a96080..f103b167 100644
--- a/aiohttp/web_protocol.py
+++ b/aiohttp/web_protocol.py
@@ -299,6 +299,8 @@ class RequestHandler(BaseProtocol):
                 exc = ConnectionResetError("Connection lost")
             self._current_request._cancel(exc)
 
+        if self._task_handler is not None:
+            self._task_handler.cancel()
         if self._waiter is not None:
             self._waiter.cancel()

Now, I saw in the commit history that this patch has already been applied, removed and reapplied (cf. the discussion here: #6719). What's the best course of action? I don't see cancellation exceptions in my setup, maybe the whole issue discussed in #6719 has been fixed in the meantime.

@Dreamsorcerer
Copy link
Member

Now, I saw in the commit history that this patch has already been applied, removed and reapplied (cf. the discussion here: #6719). What's the best course of action?

OK, I've got a vague idea on what the cause may be now. Can you try removing the and not self._task_handler.done() (not on a production server though, I expect it to break):
https://github.com/aio-libs/aiohttp/blob/master/aiohttp/web_protocol.py#L287

My theory is that on connection_lost on older releases (and optionally in 3.9) we cancel the handler. Now we are not cancelling, I suspect there may be an exception occurring on self.transport.close() or something similar, because the connection was already closed:
https://github.com/aio-libs/aiohttp/blob/master/aiohttp/web_protocol.py#L610

So, I'm hoping that by making that change, instead of an exception was not retrieved warning, you'll get the actual exception and then we can catch that error in future.

@Dreamsorcerer
Copy link
Member

This also suggests that we may be able to build an actual test, if anyone wants to give that a go? Hopefully, it just means we need an endpoint that takes a couple of seconds and then have a client disconnect prematurely to trigger the warning.

@kjagiello
Copy link

kjagiello commented May 12, 2023

I've tried replicating this issue in a test, but it passes just fine on the 3.8 branch. Am I approaching this in a wrong way?

async def test_client_connection_lost(aiohttp_client: Any) -> None:
    async def handler(_):
        await asyncio.sleep(2)
        return web.Response()

    app = web.Application()
    app.router.add_route("GET", "/", handler)
    client = await aiohttp_client(app)

    with pytest.raises(TimeoutError):
        await asyncio.wait_for(
            client.get("/"),
            timeout=1,
        )
    await client.close()

I've set a breakpoint in BaseProtocol.connection_lost and saw it being called twice, once for the client, then another time for the server.

@Dreamsorcerer
Copy link
Member

I've never managed to reproduce it, which is why I keep asking others to test/debug.

@nosahama
Copy link

I am currently facing this issue, I do not think it's easily reproduced in a test as you need to send some sort of client/server disconnection into the request/response handler.

Task exception was never retrieved
future: <Task finished name='Task-70887' coro=<RequestHandler.start() done, defined at /opt/venv/lib/python3.10/site-packages/aiohttp/web_protocol.py:462> exception=AssertionError()>

I will go down back to 3.8.1.

@Dreamsorcerer
Copy link
Member

If my theory is right, then we should be able to write a test. But, I'm still waiting on someone to confirm the cause by hopefully getting the traceback as described in #6978 (comment)

@odesenfans
Copy link

I have been pretty busy since April but I'll try to reproduce the issue in the weeks to come.

@Dreamsorcerer
Copy link
Member

OK, I played around with the code, manually adding in exceptions etc. and I couldn't figure out any way to reproduce this. Hopefully, #7333 will help, but I don't think there's anything else I can do unless someone figures something else out.

@jsoucheiron
Copy link
Contributor

Once #7333 is released I can definitely give it a try.

@ignaciobolonio
Copy link

Still happening for v3.8.5 which includes #7333

@sedrakk
Copy link

sedrakk commented Jul 31, 2023

Continuing the conversation that started here.

I think this ticket is the correct one for my case given that the traceback is the same here.

  1. The issue reproduces relatively rare (up to 10 times per day) in the prod env, it could happen with any end point randomly.
  2. To test @Dreamsorcerer your suggestions, I need to deploy a custom version of aiohttp to the cloud (GAE), but whatever I tried I couldn't make GAE to use my modified version of the library and it either continued to use the version from PyPI or fail to run.
  3. So instead I ran a local env with the modified version of the library. Then wrote a browser script that from 4 separate tabs triggered 4 different end points every 10 ms. Unfortunately even after about 40K requests, not a single time I got the exception and accordingly self._force_close, self._close were always False and nothing broke from doing this comment neither.

Is there any way that I can add additional logs to test this in the current version of the library that I use in prod (3.8.5)? Maybe call some aiohttp method and print something upon any end point being hit?

Appreciate your help with trying to figure this out.

@Dreamsorcerer
Copy link
Member

2. To test @Dreamsorcerer your suggestions, I need to deploy a custom version of aiohttp to the cloud (GAE), but whatever I tried I couldn't make GAE to use my modified version of the library and it either continued to use the version from PyPI or fail to run.

I'm not failiar with GAE, but can you just drop aiohttp into the project directory, where it loads your app from?

Is there any way that I can add additional logs to test this in the current version of the library that I use in prod (3.8.5)? Maybe call some aiohttp method and print something upon any end point being hit?

I can't think of an easy way. You could try and monkey patch the library, but you'd probably have to copy the entire function (e.g. .start() method), add the print in, and then assign it to the aiohttp class.

@Dreamsorcerer
Copy link
Member

I'm not failiar with GAE, but can you just drop aiohttp into the project directory, where it loads your app from?

Or failing that, copy into the app's directory as a module, and then change the imports to from myapp import aiohttp, if necessary (assuming that doesn't break aiohttp's imports).

@sedrakk
Copy link

sedrakk commented Aug 3, 2023

Good news, I made it work. The issue was that I didn't change the path in the app's yml file:
entrypoint: gunicorn --timeout 60 -b :$PORT main:app_factory --worker-class vendor.aiohttp.GunicornUVLoopWebWorker
With it and with from vendor import aiohttp changed in 50 files haha, it finally worked.

I deployed with both changes you proposed.

                if self._force_close or self._close:
                    print(f"Line waiter.set_result(None): self._force_close = {self._force_close}, "
                          f"self._close = {self._close}")

Will keep posted when it happens.

@sedrakk
Copy link

sedrakk commented Aug 4, 2023

So the Task exception was never retrieved future continued happening without any prints or additional exceptions around it. You can see them below. Note that the trace has my custom version of aiohttp with the changes that @Dreamsorcerer proposed.

python.exc_info: "Traceback (most recent call last):
  File "/workspace/vendor/aiohttp/web_protocol.py", line 512, in start
    request = self._request_factory(message, payload, self, writer, handler)
  File "/workspace/vendor/aiohttp/web_app.py", line 446, in _make_request
    return _cls(
  File "/workspace/vendor/aiohttp/web_request.py", line 811, in __init__
    super().__init__(*args, **kwargs)
  File "/workspace/vendor/aiohttp/web_request.py", line 188, in __init__
    assert transport is not None
AssertionError
"
python.filename: "web_request.py"
python.funcName: "__init__"
python.lineno: 189
python.logger_name: "asyncio"
python.module: "web_request"
python.pid: 15
python.thread: "MainThread"
python.exc_info: "Traceback (most recent call last):
  File "/workspace/vendor/aiohttp/web_protocol.py", line 512, in start
    request = self._request_factory(message, payload, self, writer, handler)
  File "/workspace/vendor/aiohttp/web_app.py", line 446, in _make_request
    return _cls(
  File "/workspace/vendor/aiohttp/web_request.py", line 811, in __init__
    super().__init__(*args, **kwargs)
  File "/workspace/vendor/aiohttp/web_request.py", line 188, in __init__
    assert transport is not None
AssertionError
"
python.filename: "adapters.py"
python.funcName: "__init__"
python.lineno: 75
python.logger_name: "asyncio"
python.module: "adapters"
python.pid: 15
python.thread: "MainThread"
python.exc_info: "Traceback (most recent call last):
  File "/workspace/vendor/aiohttp/web_protocol.py", line 512, in start
    request = self._request_factory(message, payload, self, writer, handler)
  File "/workspace/vendor/aiohttp/web_app.py", line 446, in _make_request
    return _cls(
  File "/workspace/vendor/aiohttp/web_request.py", line 811, in __init__
    super().__init__(*args, **kwargs)
  File "/workspace/vendor/aiohttp/web_request.py", line 188, in __init__
    assert transport is not None
AssertionError
"
python.filename: "poolmanager.py"
python.funcName: "__init__"
python.lineno: 174
python.logger_name: "asyncio"
python.module: "poolmanager"
python.pid: 14
python.thread: "MainThread"

This line below got printed a few times:

            if self._force_close or self._close:
                print(f"Line - request = self._request_factory(message, payload, self, writer, handler): "
                      f"self._force_close = {self._force_close}, self._close = {self._close}")

Result of the print every time:
Line - request = self._request_factory(message, payload, self, writer, handler): self._force_close = True, self._close = False

But I think that happened every time when there was another issue: 2023/08/04 09:00:09 [error] 16#16: *195 upstream prematurely closed connection while reading response header from upstream, client: ***, server: _, request: "GET /v3/user/parameters HTTP/1.1", upstream: "http://127.0.0.1:8081/v3/user/parameters", host: "****.appspot.com".
That's some other issue that I haven't figured out yet, but I don't think it's related to aiohttp.

@Dreamsorcerer
Copy link
Member

Dreamsorcerer commented Aug 4, 2023

This line below got printed a few times:

That's right before the self._request_factory(...), right?

Did you add a print at waiter.set_result(...) as well? (I'm guessing, if you did, it didn't print anything because you added the if round it, in which case it'd confirm my thought that a race condition is occurring).

@sedrakk
Copy link

sedrakk commented Aug 4, 2023

Yes, this never got printed:

            if messages and waiter is not None and not waiter.done():
                # don't set result twice
                if self._force_close or self._close:
                    print(f"Line waiter.set_result(None): self._force_close = {self._force_close}, "
                          f"self._close = {self._close}")
                waiter.set_result(None)

this got printed ~10 times in different places from Task exception was never retrieved

            if self._force_close or self._close:
                print(f"Line - request = self._request_factory(message, payload, self, writer, handler): "
                      f"self._force_close = {self._force_close}, self._close = {self._close}")
            request = self._request_factory(message, payload, self, writer, handler)

and every time self._force_close = True, self._close = False

@Dreamsorcerer
Copy link
Member

Yep, so seems like it's definitely some kind of race condition. Not sure on the exact solution, will look at it later, but I suspect we can probably just add another check in the latter place. I'm a little short on time, so would be great if someone can take a look. A test would be good too, hopefully it's possible to write a test that closes it as soon as the waiter is set, and before the second bit of code runs (probably, waiter.add_done_callback(...) would work).

@sedrakk
Copy link

sedrakk commented Sep 20, 2023

Hi, any chance we can get a fix on this?

@Dreamsorcerer
Copy link
Member

Information is above if you want to try and create a test and fix.

@jsoucheiron
Copy link
Contributor

I've just tested it in 3.9.0 and it's still present.

imatge

@Dreamsorcerer
Copy link
Member

Well, that's because nobody has fixed it yet and why the issue is still open...

@jsoucheiron
Copy link
Contributor

There have been changes to the transport code so it could have been fixed.
I'd love to help debug it, but I haven't been able to reproduce locally, just in production, so it's quite hard for us to debug using custom versions.

@Dreamsorcerer
Copy link
Member

I still haven't had time to look, but what I mentioned before might be possible to reproduce in a test.
Last sentence: #6978 (comment)

@Dreamsorcerer
Copy link
Member

Dreamsorcerer commented Nov 20, 2023

Refamiliarising myself with the code, I think we need the test to get into the .start() method until it is waiting at:

await self._waiter

Then we can do something like r._waiter.add_done_callback(lambda f: r.force_close()) (or maybe shutdown() or connection_lost()).
At worst, that can probably be done by subclassing and changing the property setter to add another add_done_callback().

The objective is to get it to resume running in .start() and reach this code while _force_close == True:

request = self._request_factory(message, payload, self, writer, handler)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

10 participants