Gunicorn gthread deadlock #3146

javiertejero · 2024-02-01T11:36:44Z

Problem

On receiving multiple connections at the same time I observe a deadlock that causes gunicorn worker to completely freeze forever (the worker timeout does not kill the frozen worker).

This happens when the number of worker_connections is smaller than the number of concurrent connections received.

Affected versions:

21.0.0
21.0.1
21.1.0
21.2.0

How to reproduce

Run a gunicorn app with gthread worker and these options:

    options = {
        # ...
        'workers': 1,
        'threads': 3,
        'worker_connections': 4,
    }

Then run apache bench with enough concurrency (higher than 4): ab -n1000 -c100 http://0.0.0.0:8080/

I am opening a PR soon with the fix and the script to reproduce this easily (UPDATE: PR #3147).

Proposed solution

Partial revert of #2918

The text was updated successfully, but these errors were encountered:

when opening many sockets the worker gets frozen consuming 100% cpu to reproduce: ``` cd examples python standalone_app_gthread.py ``` and then using Apache Bench tool: ``` ab -n1000 -c100 http://0.0.0.0:8080/ ``` Fixes benoitc#3146

javiertejero · 2024-02-01T13:11:58Z

@JorisOnGithub could you please confirm if this patch is not breaking all the improvements needed to solve #2917

PS: I couldn't reproduce the issues you described, but I don't have WebView2, and I couldn't reproduce with Microsoft Edge for macos

chris-griffin · 2024-02-11T21:42:56Z

Hi @benoitc and team - First off, thank you all for what you do! I tried to contact security@gunicorn.org about this issue, but the email bounced.

Since this issue is already out in the public it seemed appropriate to post here instead. This issue is a denial-of-service (DOS) vulnerability although not labeled as such in the original description.

While the default of 1000 for worker_connections should hopefully mean that implementors that have reasonable rate limits in upstream reverse proxies/WAFs should be protected, anyone with a rate limit >1000 per IP address would make this trivial to exploit. Additionally, an effective DDoS with a relatively low number of IP addresses should be more plausible.

I see that gunicorn has not historically used GitHub's security advisory feature. I just wanted to highlight that it makes it very easy to request a CVE in case you are not familiar with this feature.

sylt · 2024-02-14T20:17:04Z

I've debugged the issue, and I think the smallest diff for fixing it (while maintaining the functionality of #2917) is:

diff --git a/gunicorn/workers/gthread.py b/gunicorn/workers/gthread.py
index c9c42345..9c1a92ad 100644
--- a/gunicorn/workers/gthread.py
+++ b/gunicorn/workers/gthread.py
@@ -119,6 +119,9 @@ class ThreadWorker(base.Worker):
 
     def accept(self, server, listener):
         try:
+            if self.nr_conns >= self.worker_connections:
+                return
+
             sock, client = listener.accept()
             # initialize the connection object
             conn = TConn(self.cfg, sock, client, server)
@@ -218,6 +221,10 @@ class ThreadWorker(base.Worker):
                                       return_when=futures.FIRST_COMPLETED)
             else:
                 # wait for a request to finish
+                events = self.poller.select(0.0)
+                for key, _ in events:
+                    callback = key.data
+                    callback(key.fileobj)
                 result = futures.wait(self.futures, timeout=1.0,
                                       return_when=futures.FIRST_COMPLETED)

For those interested, I think the root cause of the bug is the following:

All is good. There are no connections. We're running in run(), specifically in self.poller.select(), waiting for a connection to arrive to us.
All of a sudden, someone wants to connect. self.poller.select() returns our self.accept() function, and we enter and accept the connection. The resulting socket returned from accept() is added to the event queue.
Step 2. Is repeated until we can't accept any more connections (i.e., self.nr_conns hit the limit self.worker_connections).
Now, it's time for another loop in run(). Since we've hit our connection limit, the code will avoid executing self.poller.select(), since we shouldn't accept any new connections. Unfortunately, since our accept()ed sockets are also on the event queue, we won't process their requests either! Oops...
So we loop ad infinitum. Since the newly added sockets haven't gotten their initial handlers run, there won't be any futures to process either. And doing futures.wait(self.futures) with self.futures = [] returns immediately, and then it's back to the beginning again of the loop, turning the loop into a busy loop.

About the patch: by running self.poller.select() even in the case when we want to avoid accepting connections, we'll always serve our already accepted connections, thereby fixing the bug. However, we need an extra if statement in self.accept() to protect us from accepting too many connections.

I had some other clean-ups in mind which would fix this bug, but I'm just posting this minimal version early on in case @benoitc would like to incorporate it.

This change-set aims change so that we do everything non-related to the request on the main thread. This way, we can hopefully avoid lots of strange race conditions and make it easier to reason about the code. This also fixes benoitc#3146, but there are leaner ways of doing that. Work in progress.

The main purpose is to remove complexity from gthread by: * Removing the lock for handling self._keep and self.poller. This is possible since we now do all such manipulation on the main thread instead. When a connection is done, it posts a callback through the PollableMethodCaller which gets executed on the main thread. * Having a single event queue (self.poller), as opposed to also managing a set of futures. This fixes benoitc#3146 (although there are more minimal ways of doing it). There are other more minor things as well: * Renaming some variables, e.g. self._keep to self.keepalived_conns. * Remove self-explanatory comments (what the code does, not why). * Remove fiddling with connection socket block/not blocked. Some complexity has been added to the shutdown sequence, but hopefully for good reason: it's to make sure that all already accepted connections are served within the grace period.

The main purpose is to remove complexity from gthread by: * Removing the lock for handling self._keep and self.poller. This is possible since we now do all such manipulation on the main thread instead. When a connection is done, it posts a callback through the PollableMethodCaller which gets executed on the main thread. * Having a single event queue (self.poller), as opposed to also managing a set of futures. This fixes benoitc#3146 (although there are more minimal ways of doing it). There are other more minor things as well: * Renaming some variables, e.g. self._keep to self.keepalived_conns. * Remove self-explanatory comments (what the code does, not why). * Just decide that socket is blocking. Some complexity has been added to the shutdown sequence, but hopefully for good reason: it's to make sure that all already accepted connections are served within the grace period.

javiertejero linked a pull request Feb 1, 2024 that will close this issue

partial revert of #2918 due to concurrency issues: Deadlock #3147

Open

benoitc added this to the 22.0 milestone Feb 3, 2024

sylt linked a pull request Feb 17, 2024 that will close this issue

workers/gthread: Remove locks + one event queue + general cleanup #3157

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gunicorn gthread deadlock #3146

Gunicorn gthread deadlock #3146

javiertejero commented Feb 1, 2024 •

edited

javiertejero commented Feb 1, 2024

chris-griffin commented Feb 11, 2024

sylt commented Feb 14, 2024

Gunicorn gthread deadlock #3146

Gunicorn gthread deadlock #3146

Comments

javiertejero commented Feb 1, 2024 • edited

Problem

How to reproduce

Proposed solution

javiertejero commented Feb 1, 2024

chris-griffin commented Feb 11, 2024

sylt commented Feb 14, 2024

javiertejero commented Feb 1, 2024 •

edited