New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
De-duplicate and parameterize exit error logging. #3057
base: master
Are you sure you want to change the base?
Conversation
The signal log message had the PID and signal in the message format string, which makes each message unique. This is awkward in log processing with for example Sentry, where you want to see similar messages grouped together. The exit message was emitted twice, from the "if exitcode != 0" and the "if exitcode > 0" clause. You would be seeing things like: [2023-08-17 09:50:10 +0200] [66450] [ERROR] Worker (pid:66451) exited with code 1 [2023-08-17 09:50:10 +0200] [66450] [ERROR] Worker (pid:66451) exited with code 1.
I like this change. Are we abble to create a test for it? |
Something like a logging class in dummy application in test_arbiter.py that captures log messages? I was assuming that arbiter.stop() hits these changes, but instead only reap_workers() hits this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the change and think we should merge it with or without tests. It's pretty straightforward. That said, a simple test would be to just call reap_workers
with os.waitpid
patched on an arbiter constructed with a patched log
.
But it seems excessive to me.
There's also PR 3094. #3094, which is related and prevents "operator confusion" so to say. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please replace the word Child by Worker. Can you tell me why you don't want to use the exitcode= status >> 8
?
if exitcode != 0: | ||
self.log.error('Worker (pid:%s) exited with code %s', wpid, exitcode) | ||
self.log.error("Child exited with code %s (pid: %s)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should use "Worker" instead of "Child" here. This is more meaningful in the context of Gunicorn arbitration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Benoit,
Sorry, I totally missed that I committed to this open PR! That was not my intention. I'll have a look what happened before and what I want from these changes. My intent was to lower the log level on the regular reload/sigterms. And the two raises in between the log statements were irksome to me.
What I find confusing that both master and worker are logging the same event, with the same wording. In the logs it's not clear which process is which.
# 0: as seen from macos/py311 from reload on workers | ||
# where workers from SIGTERM but waitpid/nohang gives | ||
# me 0. | ||
self.log.info("Child terminated (pid: %s)", wpid) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same commmenta s above.
@@ -522,37 +522,40 @@ def reap_workers(self): | |||
# A worker was terminated. If the termination reason was | |||
# that it could not boot, we'll shut it down to avoid | |||
# infinite start/stop cycles. | |||
exitcode = status >> 8 | |||
|
|||
exitcode = os.WEXITSTATUS(status) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not reusing the status ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying to find out why I was/am seeing 0 on worker reload, where I was expecting to see 15 for SIGTERM. I was hoping that this might give me more, but clearly 0 stays 0.
wpid, sig_name) | ||
|
||
# Additional hint for SIGKILL | ||
if status == signal.SIGKILL: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we miss this extra hint in the change. Could you add it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You miss the hint? I thought it was misleading because it never is a memory issue whenever we get to see these error logs.
The signal log message had the PID and signal in the message format string, which makes each message unique. This is awkward in log processing with for example Sentry, where you want to see similar messages grouped together.
The exit message was emitted twice, from the "if exitcode != 0" and the "if exitcode > 0" clause. You would be seeing things like:
See issue #3056