Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Graceful shutdown for tornado worker #2317

Open
tilgovi opened this issue Apr 21, 2020 · 7 comments
Open

Graceful shutdown for tornado worker #2317

tilgovi opened this issue Apr 21, 2020 · 7 comments
Labels
good first issue Open for everyone. A good first issue to become familiar with codebase. help wanted Open for everyone. You do not need permission to work on these. May need familiarity with codebase. Thirdparty/Tornado

Comments

@tilgovi
Copy link
Collaborator

tilgovi commented Apr 21, 2020

Following from #1236, the tornado worker should behave like the other workers.

On graceful shutdown:

  • Stop accepting and close the listening socket
  • Start a graceful shutdown timer
  • Force end all connections if there are running requests after the timeout expires

I think that the call to server.stop() is already stopping it from accepting, but I don't know if it closes the listener socket. I don't see any graceful timeout handling.

We should also double checking that calling server.stop() prevents open connections from being reused for keep-alive requests.

@tilgovi tilgovi added good first issue Open for everyone. A good first issue to become familiar with codebase. help wanted Open for everyone. You do not need permission to work on these. May need familiarity with codebase. Thirdparty/Tornado labels Apr 21, 2020
@vgrebenschikov
Copy link

vgrebenschikov commented Jan 19, 2021

Proposed solution will not help with in case of keep-alive sockets:

see proof in comment: #1236 (comment)

To shutdown keep-alive socket gracefully - we should do ether one of:

  1. first respond on request with connect:close (downgrade keepalive to close connection), then close socket

or

  1. catch a window when client does not yet send next request (to be in time with closing until client fire next request, otherwise it will be failed, which is bad)

p.s.
also, original issue #1236 affects us without any tornado, just plain gunicorn ... not sure way it was closed to this one

@benoitc
Copy link
Owner

benoitc commented Jan 19, 2021 via email

@vgrebenschikov
Copy link

vgrebenschikov commented Jan 20, 2021

i don't think it's expected by the spec to close gracefully such socket if the request is not started.
Client should normally expect to timeout.
Do you have any example of a client that bug with the current behavior?

How to reproduce problem:
(in fact problem easily can be observed on busy production during graceful shutdown)

Run gunicorn with keepalive (trivial app returning some data)
$ gunicorn --max-requests 512 --keep-alive 2 --threads 20 --workers 4 t:app
run apache benchmark:

$ ab -n 10000 -c 20 -s 1 -k -r 127.0.0.1:8000/

...
Concurrency Level: 20
Time taken for tests: 2.693 seconds
Complete requests: 10000
Failed requests: 435

See > 4% failed requests just due to restarted workers (in this case by max-requests)

Run gunicorn without keepalive (same app)
$ gunicorn --max-requests 512 --keep-alive 0 --threads 20 --workers 4 t:app
$ ab -n 10000 -c 20 -s 1 -k -r 127.0.0.1:8000/

...
Complete requests: 10000
Failed requests: 0

See no failed requests

Errors in first case caused by closing keep-alive socket by the server (due to max-requests) while next request in progress.
gunicorn-20.0.4

@tilgovi
Copy link
Collaborator Author

tilgovi commented Jan 24, 2021

As I answered in #1236 (comment) the design is to send "Connection: close" once the server is shutting down and to wait for the graceful timeout before forcing connection closed. I'm able to reproduce what you see. Based on the distribution of request times, it seems impossible that connections are getting closed during the keepalive timeout, and instead they must be getting closed by the shutdown before they receive "Connection: close".

I'm not able to reproduce this with gevent, but I can reproduce it with eventlet and the threaded worker. I'll see if I can narrow down a bug in either of those.

@tilgovi
Copy link
Collaborator Author

tilgovi commented Jan 24, 2021

I've narrowed down the issue in eventlet to be a failing getsockname() call during graceful shutdown. I'll file a separate bug for that tomorrow and tackle it. I haven't isolated the issue for threaded workers yet, but believe there may be a bug. Thanks for your report @vgrebenschikov.

@tilgovi
Copy link
Collaborator Author

tilgovi commented Jan 24, 2021

Alright, I've isolated the bug for the threaded worker, as well. I'll file issues or PRs to fix both tomorrow.

@tilgovi
Copy link
Collaborator Author

tilgovi commented Jan 24, 2021

Here's the gist if you want to check my work and try it out before I get a chance to look at it with eyes that have slept:

diff --git a/gunicorn/workers/gthread.py b/gunicorn/workers/gthread.py
index d531811..ed70dcb 100644
--- a/gunicorn/workers/gthread.py
+++ b/gunicorn/workers/gthread.py
@@ -220,13 +220,37 @@ class ThreadWorker(base.Worker):
             # handle keepalive timeouts
             self.murder_keepalived()
 
-        self.tpool.shutdown(False)
-        self.poller.close()
-
+        # stop accepting new connections
         for s in self.sockets:
+            self.poller.unregister(s)
             s.close()
 
-        futures.wait(self.futures, timeout=self.cfg.graceful_timeout)
+        # handle existing connections until the graceful timeout
+        deadline = time.time() + self.cfg.graceful_timeout
+        while self.nr_conns:
+            # notify the arbiter we are alive
+            self.notify()
+
+            timeout = deadline - time.time()
+            if timeout <= 0:
+                break
+
+            # clean up finished requests
+            result = futures.wait(self.futures, timeout=0)
+            for fut in result.done:
+                self.futures.remove(fut)
+
+            # handle keepalive requests
+            events = self.poller.select(1.0)
+            for key, _ in events:
+                callback = key.data
+                callback(key.fileobj)
+
+            # handle keepalive timeouts
+            self.murder_keepalived()
+
+        self.tpool.shutdown(False)
+        self.poller.close()
 
     def finish_request(self, fs):
         if fs.cancelled():
@@ -238,7 +262,7 @@ class ThreadWorker(base.Worker):
             (keepalive, conn) = fs.result()
             # if the connection should be kept alived add it
             # to the eventloop and record it
-            if keepalive and self.alive:
+            if keepalive:
                 # flag the socket as non blocked
                 conn.sock.setblocking(False)

The first change is to ensure that the worker continues to handle keepalive requests in progress while shutting down. The second chunk fixes a bug preventing keepalive requests that started before the shutdown began from handling their next request and receiving the "Connection: close" that they need to shut down gracefully.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Open for everyone. A good first issue to become familiar with codebase. help wanted Open for everyone. You do not need permission to work on these. May need familiarity with codebase. Thirdparty/Tornado
Projects
None yet
Development

No branches or pull requests

3 participants