Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bad file descriptor problem still exists as of today with uvicorn workers #3013

Open
crankedguy opened this issue Jun 21, 2023 · 12 comments · May be fixed by #3127
Open

Bad file descriptor problem still exists as of today with uvicorn workers #3013

crankedguy opened this issue Jun 21, 2023 · 12 comments · May be fixed by #3127

Comments

@crankedguy
Copy link

Hello,

why was this
#1877
closed and further comments after closure completely ignored?

I must admit the crosslinking between the project issues and the conversation is extremely hard to follow, for me honestly not at all anymore now what is done, reverted, has to be done, etc.
encode/uvicorn#1710

Can someone please shed some light onto if and how this will be fixed, because this is still an issue as of today with gunicorn 20.1.- and uvicorn / uvicornworkers 0.22, and is also very easy to reproduce every single time the server is shutdown (e.g. with Ctrl-C)

Maybe
@Kludex ??

Answer highly appreciated, thanks

@crankedguy crankedguy changed the title Bad file descriptor problem still exists as of today Bad file descriptor problem still exists as of today with uvicorn workers Jun 21, 2023
@Kludex
Copy link

Kludex commented Jun 21, 2023

To me, what I read here is: "I don't want to read all the issues, can you summarize for me?".

As I mentioned on #1877 (comment), I believe the problem is on gunicorn. On that message, I also mention that this error will appear more frequently, since there was a bigger issue (the shutdown event was not being triggered, which was solved by encode/uvicorn#1710).

I've also asked about the SIGQUIT behavior almost 2 years ago on #2604, and on #2604 (comment) @benoitc said he'll look for it, and I was ignored, and the issue was further closed with this comment: #2604 (comment).

So... I've already offered, and dedicated too much of my time. Also, I find it very rude of you for pinging me here. 👍

@crankedguy
Copy link
Author

Actually I did read it, but at some point I got simply lost. There are countless threads involved, changes implemented, reverted.. I also saw that this was more or less ignored on this side sadly. As fas as I understood the SIGQUIT handler was implemented in uvicorn worker but the problem persists...

On being rude : Although I can't really see a rudeness in a simple question, it was certainly not the intention if you consider it like that. So you can take my sorry for it. I won't bother any further...

@Kludex
Copy link

Kludex commented Jun 21, 2023

Although I can't really see a rudeness in a simple question

🤷‍♂️

sorry for it.

Apologize accepted.

As fas as I understood the SIGQUIT handler was implemented in uvicorn worker but the problem persists...

The problem that was solved is the shutdown event not running when using uvicorn, which was indeed solved. The compromise was having the message "Bad file descriptor" more often on the logs.

The problem that needs to be solved is mentioned here: #1877 (comment).

@crankedguy
Copy link
Author

crankedguy commented Jun 21, 2023

The problem that needs to be solved is mentioned here: #1877 (comment).

Yes I also saw that one. That comment came after that particular issue was just closed, and there was no activity on it except from users still having the problem 🙏

Sometimes you only get the already documented
[2023-06-21 16:13:39 +0200] [1238732] [INFO] Error while closing socket [Errno 9] Bad file descriptor

and when you get - also already documented -
[2023-06-21 16:13:39 +0200] [1180582] [WARNING] Worker with pid 1238730 was terminated due to signal 15
you have these trace remainings in job output

Jun 21 16:13:39 domain.com gunicorn[1180582]:   File "/usr/lib64/python3.11/logging/__init__.py", line 1094, in flush
Jun 21 16:13:39 domain.com gunicorn[1180582]:     self.stream.flush()
Jun 21 16:13:39 domain.com gunicorn[1180582]:   File "/home/usr/.venvs/xxx_api/lib64/python3.11/site-packages/gunicorn/arbiter.py", line 242, in handle_chld
Jun 21 16:13:39 domain.com gunicorn[1180582]:     self.reap_workers()
Jun 21 16:13:39 domain.com gunicorn[1180582]:   File "/home/usr/.venvs/xxx_api/lib64/python3.11/site-packages/gunicorn/arbiter.py", line 530, in reap_workers
Jun 21 16:13:39 domain.com gunicorn[1180582]:     self.log.warning(
Jun 21 16:13:39 domain.com gunicorn[1180582]:   File "/home/usr/.venvs/xxx_api/lib64/python3.11/site-packages/gunicorn/glogging.py", line 261, in warning
Jun 21 16:13:39 domain.com gunicorn[1180582]:     self.error_log.warning(msg, *args, **kwargs)
Jun 21 16:13:39 domain.com gunicorn[1180582]: Message: 'Worker with pid %s was terminated due to signal %s'
Jun 21 16:13:39 domain.com gunicorn[1180582]: Arguments: (1238732, 15)

EDIT full trace :

Jun 21 16:35:19 domain.com gunicorn[1441959]: --- Logging error ---
Jun 21 16:35:19 domain.com gunicorn[1441959]: Traceback (most recent call last):
Jun 21 16:35:19 domain.com gunicorn[1441959]:   File "/usr/lib64/python3.11/logging/__init__.py", line 1114, in emit
Jun 21 16:35:19 domain.com gunicorn[1441959]:     self.flush()
Jun 21 16:35:19 domain.com gunicorn[1441959]:   File "/usr/lib64/python3.11/logging/__init__.py", line 1094, in flush
Jun 21 16:35:19 domain.com gunicorn[1441959]:     self.stream.flush()
Jun 21 16:35:19 domain.com gunicorn[1441959]: RuntimeError: reentrant call inside <_io.BufferedWriter name='/var/log/xxx_api/gunicorn.log'>
Jun 21 16:35:19 domain.com gunicorn[1441959]: Call stack:
Jun 21 16:35:19 domain.com gunicorn[1441959]:   File "/home/usr/.venvs/xxx_api/bin/gunicorn", line 8, in <module>
Jun 21 16:35:19 domain.com gunicorn[1441959]:     sys.exit(run())
Jun 21 16:35:19 domain.com gunicorn[1441959]:   File "/home/usr/.venvs/xxx_api/lib64/python3.11/site-packages/gunicorn/app/wsgiapp.py", line 67, in run
Jun 21 16:35:19 domain.com gunicorn[1441959]:     WSGIApplication("%(prog)s [OPTIONS] [APP_MODULE]").run()
Jun 21 16:35:19 domain.com gunicorn[1441959]:   File "/home/usr/.venvs/xxx_api/lib64/python3.11/site-packages/gunicorn/app/base.py", line 231, in run
Jun 21 16:35:19 domain.com gunicorn[1441959]:     super().run()
Jun 21 16:35:19 domain.com gunicorn[1441959]:   File "/home/usr/.venvs/xxx_api/lib64/python3.11/site-packages/gunicorn/app/base.py", line 72, in run
Jun 21 16:35:19 domain.com gunicorn[1441959]:     Arbiter(self).run()
Jun 21 16:35:19 domain.com gunicorn[1441959]:   File "/home/usr/.venvs/xxx_api/lib64/python3.11/site-packages/gunicorn/arbiter.py", line 209, in run
Jun 21 16:35:19 domain.com gunicorn[1441959]:     self.sleep()
Jun 21 16:35:19 domain.com gunicorn[1441959]:   File "/home/usr/.venvs/xxx_api/lib64/python3.11/site-packages/gunicorn/arbiter.py", line 357, in sleep
Jun 21 16:35:19 domain.com gunicorn[1441959]:     ready = select.select([self.PIPE[0]], [], [], 1.0)
Jun 21 16:35:19 domain.com gunicorn[1441959]:   File "/home/usr/.venvs/xxx_api/lib64/python3.11/site-packages/gunicorn/arbiter.py", line 242, in handle_chld
Jun 21 16:35:19 domain.com gunicorn[1441959]:     self.reap_workers()
Jun 21 16:35:19 domain.com gunicorn[1441959]:   File "/home/usr/.venvs/xxx_api/lib64/python3.11/site-packages/gunicorn/arbiter.py", line 530, in reap_workers
Jun 21 16:35:19 domain.com gunicorn[1441959]:     self.log.warning(
Jun 21 16:35:19 domain.com gunicorn[1441959]:   File "/home/usr/.venvs/xxx_api/lib64/python3.11/site-packages/gunicorn/glogging.py", line 261, in warning
Jun 21 16:35:19 domain.com gunicorn[1441959]:     self.error_log.warning(msg, *args, **kwargs)
Jun 21 16:35:19 domain.com gunicorn[1441959]:   File "/usr/lib64/python3.11/logging/__init__.py", line 1501, in warning
Jun 21 16:35:19 domain.com gunicorn[1441959]:     self._log(WARNING, msg, args, **kwargs)
Jun 21 16:35:19 domain.com gunicorn[1441959]:   File "/usr/lib64/python3.11/logging/__init__.py", line 1634, in _log
Jun 21 16:35:19 domain.com gunicorn[1441959]:     self.handle(record)
Jun 21 16:35:19 domain.com gunicorn[1441959]:   File "/usr/lib64/python3.11/logging/__init__.py", line 1644, in handle
Jun 21 16:35:19 domain.com gunicorn[1441959]:     self.callHandlers(record)
Jun 21 16:35:19 domain.com gunicorn[1441959]:   File "/usr/lib64/python3.11/logging/__init__.py", line 1706, in callHandlers
Jun 21 16:35:19 domain.com gunicorn[1441959]:     hdlr.handle(record)
Jun 21 16:35:19 domain.com gunicorn[1441959]:   File "/usr/lib64/python3.11/logging/__init__.py", line 978, in handle
Jun 21 16:35:19 domain.com gunicorn[1441959]:     self.emit(record)
Jun 21 16:35:19 domain.com gunicorn[1441959]:   File "/usr/lib64/python3.11/logging/__init__.py", line 1230, in emit
Jun 21 16:35:19 domain.com gunicorn[1441959]:     StreamHandler.emit(self, record)
Jun 21 16:35:19 domain.com gunicorn[1441959]:   File "/usr/lib64/python3.11/logging/__init__.py", line 1114, in emit
Jun 21 16:35:19 domain.com gunicorn[1441959]:     self.flush()
Jun 21 16:35:19 domain.com gunicorn[1441959]:   File "/usr/lib64/python3.11/logging/__init__.py", line 1094, in flush
Jun 21 16:35:19 domain.com gunicorn[1441959]:     self.stream.flush()
Jun 21 16:35:19 domain.com gunicorn[1441959]:   File "/home/usr/.venvs/xxx_api/lib64/python3.11/site-packages/gunicorn/arbiter.py", line 242, in handle_chld
Jun 21 16:35:19 domain.com gunicorn[1441959]:     self.reap_workers()
Jun 21 16:35:19 domain.com gunicorn[1441959]:   File "/home/usr/.venvs/xxx_api/lib64/python3.11/site-packages/gunicorn/arbiter.py", line 530, in reap_workers
Jun 21 16:35:19 domain.com gunicorn[1441959]:     self.log.warning(
Jun 21 16:35:19 domain.com gunicorn[1441959]:   File "/home/usr/.venvs/xxx_api/lib64/python3.11/site-packages/gunicorn/glogging.py", line 261, in warning
Jun 21 16:35:19 domain.com gunicorn[1441959]:     self.error_log.warning(msg, *args, **kwargs)
Jun 21 16:35:19 domain.com gunicorn[1441959]: Message: 'Worker with pid %s was terminated due to signal %s'
Jun 21 16:35:19 domain.com gunicorn[1441959]: Arguments: (1442079, 15)

@benoitc
Copy link
Owner

benoitc commented Jun 30, 2023

@Kludex you weren't ignored. I offered a response to your issue that didn't match your expectation at this time. This happen sometimes, but don't say it was ignored...

@crankedguy that's the good approach. All issues closed are closed and if a new problem arise or if an issue still exist a new ticket should be created. This is described in the contributing doc.

About this issue. I didn't have yet re-tested latest master with uvicorn. This is planned this we. I have been side tracked with some coming changes in the http parser and some other changes for the new release. The behavior is not reproduced with others workers,

The expected behaviour is the following: A worker should expect to receive 2 signals for termination : SIGTERM and SIGQUIT. This allows graceful shutdown. This message is sent by the arbiter. The arbiter control the worker and do not simply pass signals.

Anyway I will try to understand what is the design difference there that causes this issue.

@crankedguy
Copy link
Author

crankedguy commented Jun 30, 2023

That would be great because this is kinda a problem not only with command line invoke instances but also if you handle them over systemd which leads to unclean unit shutdowns

@Judimax
Copy link

Judimax commented Nov 7, 2023

What worked for me if watchdog and socketio was being used in the application (in this case flask) the error would show up as socket is also a file in way. Hopefully this is useful and it helps

@tilgovi
Copy link
Collaborator

tilgovi commented Dec 28, 2023

In this comment, @gg349 very helpfully tracked down the uvicorn commit that re-ordered calls to close the asyncio servers and close the sockets.

In the asyncio documentation for loop.create_server, it says:

Note: The sock argument transfers ownership of the socket to the server created. To close the socket, call the server’s close() method.

That means that uvicorn should not call socket.close() because it has given ownership of the socket to the server. Before the referenced commit, uvicorn was closing the socket first and uvloop was tolerating EBADF for its own internal reasons. After the commit, uvicorn is raising an EBADF because the server has already closed the socket.

There is nothing gunicorn can do for this issue. The error happens in the worker, where uvicorn is closing the socket after it was already closed.

I suggest we close the issue here if this diagnosis makes sense to @crankedguy and @Kludex.

@tilgovi
Copy link
Collaborator

tilgovi commented Dec 28, 2023

Note that this only happens with the uvloop dependency.

Python sockets can actually be closed multiple times without issue. Python tracks when a socket has been closed and only issues a close to the OS once. However, if you close a file descriptor out from underneath the Python socket, then closing it from Python will cause an EBADF error:

>>> s = socket.socket(socket.SOCK_STREAM)
>>> socket.close(s.fileno())
>>> s.close()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.11/socket.py", line 503, in close
    self._real_close()
  File "/usr/lib/python3.11/socket.py", line 497, in _real_close
    _ss.close(self)
OSError: [Errno 9] Bad file descriptor

Gunicorn actually passes a gunicorn.sock.TCPSocket object, a wrapper around the Python socket object. Gunicorn knows not to close the underlying socket twice, even though doing so should be fine.

So, it doesn't normally cause problems that uvicorn closes both the server and the socket. However, things get weird with uvloop and gunicorn.

When the server gets closed, uvloop calls uv_close() in the underlying libuv. The uv_close() documentation says:

Handles that wrap file descriptors are closed immediately but close_cb will still be deferred to the next iteration of the event loop. It gives you a chance to free up any resources associated with the handle.

In other words, socket handles in uvloop expect that the base handle has already closed the fd and want to simply call detach() on the Python socket. Since the gunicorn socket isn't an instance of the built-in socket type, uvloop calls close() instead of detach(). The gunicorn.sock.TCPSocket wrapper calls close() on the real socket even though its file descriptor is already closed by libuv, raising the EBADF.

Gunicorn could add a defensive try/except in its socket wrapper, but I don't think it should. Regular Python sockets don't do that. Instead, uvicorn should rely on servers to close their sockets, as documented in the asyncio module.

@tilgovi
Copy link
Collaborator

tilgovi commented Dec 28, 2023

If uvicorn cannot track which sockets it has and has not passed to servers, it may want to have its own defensive try/except. Better might be to pass s.sock for s in sockets from the UvicornWorker, so that uvicorn is working with the underlying sockets.

If we consider any change in gunicorn itself, I'd consider trying to remove the socket wrapper classes, or making them actually inherit from socket.

tilgovi added a commit that referenced this issue Dec 28, 2023
Refactor socket creation to remove the socket wrapper classes so that
these objects have less surprising behavior when used in worker hooks,
worker classes, and custom applications.

Close #3013.
@tilgovi tilgovi linked a pull request Dec 28, 2023 that will close this issue
@tilgovi
Copy link
Collaborator

tilgovi commented Dec 28, 2023

I've opened a draft PR for getting rid of the socket wrapper classes and verified that this fixes this issue with uvloop.

tilgovi added a commit that referenced this issue Dec 28, 2023
Refactor socket creation to remove the socket wrapper classes so that
these objects have less surprising behavior when used in worker hooks,
worker classes, and custom applications.

Close #3013.
tilgovi added a commit that referenced this issue Dec 28, 2023
Refactor socket creation to remove the socket wrapper classes so that
these objects have less surprising behavior when used in worker hooks,
worker classes, and custom applications.

Close #3013.
@jsprieto10
Copy link

thanks @tilgovi Waiting for you PR to be ready

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants