Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gunicorn gthread deadlock #3146

Open
javiertejero opened this issue Feb 1, 2024 · 3 comments · May be fixed by #3147 or #3157
Open

Gunicorn gthread deadlock #3146

javiertejero opened this issue Feb 1, 2024 · 3 comments · May be fixed by #3147 or #3157
Milestone

Comments

@javiertejero
Copy link

javiertejero commented Feb 1, 2024

Problem

On receiving multiple connections at the same time I observe a deadlock that causes gunicorn worker to completely freeze forever (the worker timeout does not kill the frozen worker).

This happens when the number of worker_connections is smaller than the number of concurrent connections received.

Affected versions:

  • 21.0.0
  • 21.0.1
  • 21.1.0
  • 21.2.0

How to reproduce

Run a gunicorn app with gthread worker and these options:

    options = {
        # ...
        'workers': 1,
        'threads': 3,
        'worker_connections': 4,
    }

Then run apache bench with enough concurrency (higher than 4): ab -n1000 -c100 http://0.0.0.0:8080/

I am opening a PR soon with the fix and the script to reproduce this easily (UPDATE: PR #3147).

Proposed solution

Partial revert of #2918

javiertejero added a commit to travelperk/gunicorn that referenced this issue Feb 1, 2024
when opening many sockets the worker gets frozen consuming 100% cpu

to reproduce:

```
cd examples
python standalone_app_gthread.py
```

and then using Apache Bench tool:
```
ab -n1000 -c100 http://0.0.0.0:8080/
```

Fixes benoitc#3146
@javiertejero
Copy link
Author

@JorisOnGithub could you please confirm if this patch is not breaking all the improvements needed to solve #2917

PS: I couldn't reproduce the issues you described, but I don't have WebView2, and I couldn't reproduce with Microsoft Edge for macos

@benoitc benoitc added this to the 22.0 milestone Feb 3, 2024
@chris-griffin
Copy link

Hi @benoitc and team - First off, thank you all for what you do! I tried to contact security@gunicorn.org about this issue, but the email bounced.

image

Since this issue is already out in the public it seemed appropriate to post here instead. This issue is a denial-of-service (DOS) vulnerability although not labeled as such in the original description.

While the default of 1000 for worker_connections should hopefully mean that implementors that have reasonable rate limits in upstream reverse proxies/WAFs should be protected, anyone with a rate limit >1000 per IP address would make this trivial to exploit. Additionally, an effective DDoS with a relatively low number of IP addresses should be more plausible.

I see that gunicorn has not historically used GitHub's security advisory feature. I just wanted to highlight that it makes it very easy to request a CVE in case you are not familiar with this feature.

@sylt
Copy link

sylt commented Feb 14, 2024

I've debugged the issue, and I think the smallest diff for fixing it (while maintaining the functionality of #2917) is:

diff --git a/gunicorn/workers/gthread.py b/gunicorn/workers/gthread.py
index c9c42345..9c1a92ad 100644
--- a/gunicorn/workers/gthread.py
+++ b/gunicorn/workers/gthread.py
@@ -119,6 +119,9 @@ class ThreadWorker(base.Worker):
 
     def accept(self, server, listener):
         try:
+            if self.nr_conns >= self.worker_connections:
+                return
+
             sock, client = listener.accept()
             # initialize the connection object
             conn = TConn(self.cfg, sock, client, server)
@@ -218,6 +221,10 @@ class ThreadWorker(base.Worker):
                                       return_when=futures.FIRST_COMPLETED)
             else:
                 # wait for a request to finish
+                events = self.poller.select(0.0)
+                for key, _ in events:
+                    callback = key.data
+                    callback(key.fileobj)
                 result = futures.wait(self.futures, timeout=1.0,
                                       return_when=futures.FIRST_COMPLETED)
 

For those interested, I think the root cause of the bug is the following:

  1. All is good. There are no connections. We're running in run(), specifically in self.poller.select(), waiting for a connection to arrive to us.
  2. All of a sudden, someone wants to connect. self.poller.select() returns our self.accept() function, and we enter and accept the connection. The resulting socket returned from accept() is added to the event queue.
  3. Step 2. Is repeated until we can't accept any more connections (i.e., self.nr_conns hit the limit self.worker_connections).
  4. Now, it's time for another loop in run(). Since we've hit our connection limit, the code will avoid executing self.poller.select(), since we shouldn't accept any new connections. Unfortunately, since our accept()ed sockets are also on the event queue, we won't process their requests either! Oops...
  5. So we loop ad infinitum. Since the newly added sockets haven't gotten their initial handlers run, there won't be any futures to process either. And doing futures.wait(self.futures) with self.futures = [] returns immediately, and then it's back to the beginning again of the loop, turning the loop into a busy loop.

About the patch: by running self.poller.select() even in the case when we want to avoid accepting connections, we'll always serve our already accepted connections, thereby fixing the bug. However, we need an extra if statement in self.accept() to protect us from accepting too many connections.

I had some other clean-ups in mind which would fix this bug, but I'm just posting this minimal version early on in case @benoitc would like to incorporate it.

sylt added a commit to sylt/gunicorn that referenced this issue Feb 14, 2024
This change-set aims change so that we do everything non-related to
the request on the main thread. This way, we can hopefully avoid lots
of strange race conditions and make it easier to reason about the code.

This also fixes benoitc#3146, but there are leaner ways of doing that.

Work in progress.
sylt added a commit to sylt/gunicorn that referenced this issue Feb 17, 2024
The main purpose is to remove complexity from gthread by:

* Removing the lock for handling self._keep and self.poller. This is
  possible since we now do all such manipulation on the main thread
  instead. When a connection is done, it posts a callback through the
  PollableMethodCaller which gets executed on the main thread.

* Having a single event queue (self.poller), as opposed to also
  managing a set of futures. This fixes benoitc#3146 (although there are
  more minimal ways of doing it).

There are other more minor things as well:

* Renaming some variables, e.g. self._keep to self.keepalived_conns.
* Remove self-explanatory comments (what the code does, not why).
* Remove fiddling with connection socket block/not blocked.

Some complexity has been added to the shutdown sequence, but hopefully
for good reason: it's to make sure that all already accepted
connections are served within the grace period.
sylt added a commit to sylt/gunicorn that referenced this issue Feb 18, 2024
The main purpose is to remove complexity from gthread by:

* Removing the lock for handling self._keep and self.poller. This is
  possible since we now do all such manipulation on the main thread
  instead. When a connection is done, it posts a callback through the
  PollableMethodCaller which gets executed on the main thread.

* Having a single event queue (self.poller), as opposed to also
  managing a set of futures. This fixes benoitc#3146 (although there are
  more minimal ways of doing it).

There are other more minor things as well:

* Renaming some variables, e.g. self._keep to self.keepalived_conns.
* Remove self-explanatory comments (what the code does, not why).
* Just decide that socket is blocking.

Some complexity has been added to the shutdown sequence, but hopefully
for good reason: it's to make sure that all already accepted
connections are served within the grace period.
sylt added a commit to sylt/gunicorn that referenced this issue Feb 18, 2024
The main purpose is to remove complexity from gthread by:

* Removing the lock for handling self._keep and self.poller. This is
  possible since we now do all such manipulation on the main thread
  instead. When a connection is done, it posts a callback through the
  PollableMethodCaller which gets executed on the main thread.

* Having a single event queue (self.poller), as opposed to also
  managing a set of futures. This fixes benoitc#3146 (although there are
  more minimal ways of doing it).

There are other more minor things as well:

* Renaming some variables, e.g. self._keep to self.keepalived_conns.
* Remove self-explanatory comments (what the code does, not why).
* Just decide that socket is blocking.

Some complexity has been added to the shutdown sequence, but hopefully
for good reason: it's to make sure that all already accepted
connections are served within the grace period.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants