Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

De-duplicate and parameterize exit error logging. #3057

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

jeroenp
Copy link

@jeroenp jeroenp commented Aug 17, 2023

The signal log message had the PID and signal in the message format string, which makes each message unique. This is awkward in log processing with for example Sentry, where you want to see similar messages grouped together.

The exit message was emitted twice, from the "if exitcode != 0" and the "if exitcode > 0" clause. You would be seeing things like:

[2023-08-17 09:50:10 +0200] [66450] [ERROR] Worker (pid:66451) exited with code 1 
[2023-08-17 09:50:10 +0200] [66450] [ERROR] Worker (pid:66451) exited with code 1.

See issue #3056

The signal log message had the PID and signal in the message format string,
which makes each message unique. This is awkward in log processing with for
example Sentry, where you want to see similar messages grouped together.

The exit message was emitted twice, from the "if exitcode != 0" and
the "if exitcode > 0" clause. You would be seeing things like:

[2023-08-17 09:50:10 +0200] [66450] [ERROR] Worker (pid:66451) exited with code 1
[2023-08-17 09:50:10 +0200] [66450] [ERROR] Worker (pid:66451) exited with code 1.
@benoitc
Copy link
Owner

benoitc commented Sep 5, 2023

I like this change. Are we abble to create a test for it?

@jeroenp
Copy link
Author

jeroenp commented Sep 7, 2023

I like this change. Are we abble to create a test for it?

Something like a logging class in dummy application in test_arbiter.py that captures log messages? I was assuming that arbiter.stop() hits these changes, but instead only reap_workers() hits this.

Copy link
Collaborator

@tilgovi tilgovi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the change and think we should merge it with or without tests. It's pretty straightforward. That said, a simple test would be to just call reap_workers with os.waitpid patched on an arbiter constructed with a patched log.

But it seems excessive to me.

@tilgovi tilgovi added this to the 22.0 milestone Dec 29, 2023
@jeroenp
Copy link
Author

jeroenp commented Jan 2, 2024

There's also PR 3094. #3094, which is related and prevents "operator confusion" so to say.

Copy link
Owner

@benoitc benoitc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please replace the word Child by Worker. Can you tell me why you don't want to use the exitcode= status >> 8 ?

if exitcode != 0:
self.log.error('Worker (pid:%s) exited with code %s', wpid, exitcode)
self.log.error("Child exited with code %s (pid: %s)",
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should use "Worker" instead of "Child" here. This is more meaningful in the context of Gunicorn arbitration.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Benoit,

Sorry, I totally missed that I committed to this open PR! That was not my intention. I'll have a look what happened before and what I want from these changes. My intent was to lower the log level on the regular reload/sigterms. And the two raises in between the log statements were irksome to me.

What I find confusing that both master and worker are logging the same event, with the same wording. In the logs it's not clear which process is which.

# 0: as seen from macos/py311 from reload on workers
# where workers from SIGTERM but waitpid/nohang gives
# me 0.
self.log.info("Child terminated (pid: %s)", wpid)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same commmenta s above.

@@ -522,37 +522,40 @@ def reap_workers(self):
# A worker was terminated. If the termination reason was
# that it could not boot, we'll shut it down to avoid
# infinite start/stop cycles.
exitcode = status >> 8

exitcode = os.WEXITSTATUS(status) \
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not reusing the status ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to find out why I was/am seeing 0 on worker reload, where I was expecting to see 15 for SIGTERM. I was hoping that this might give me more, but clearly 0 stays 0.

wpid, sig_name)

# Additional hint for SIGKILL
if status == signal.SIGKILL:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we miss this extra hint in the change. Could you add it?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You miss the hint? I thought it was misleading because it never is a memory issue whenever we get to see these error logs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants