New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cluster.rb - fixup worker wait logic in #check_workers and #stop_workers #1892
Conversation
pids = [] | ||
while pid = Process.waitpid(-1, Process::WNOHANG) do | ||
pids << pid | ||
@workers.reject! do |w| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this code now seems similar enough to be extracted into a new method, yes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did think about that, but see below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic seems like the same one that causing zombies because we will delete from @workers
if a process is dead due to signal but the process is still running. When it actually does exit, we won't call wait on it, so this reintroduces the bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if a process is dead due to signal
I did 'turn around' things so that wait was called before dead?
was checked, but I've kind of left the dead?
check in the because I thought it was previously used for a long time. But, wait
may not have been using WNOHANG
then. Also, I need to review (again) how & when @dead
is set, as I'm not clear on that.
Regardless, what you're saying makes sense to me, should I remove the || w.dead?
code on line 220?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above, I removed the || w.dead?
code in my fork, and it passed. Would you prefer that? I think I would...
I don't think there are any tests that create a situation where a worker needs a 'graceful' shutdown and then externally closes it, and the waits to see if a zombie exists. Maybe a bit messy...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using a separate array is what I started to do in that other PR. I don't think it's probably necessary if we remove w.dead?
from here, but I'm mentioning it just so that we've at least thought about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked through the code and it's fine. The main effect will be that there will be a very small delay perhaps before spawning a replacement worker and that's ok. The dead status code is sent right before the worker exits anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for looking at it. Next major computer purchase is a system for Ubuntu/*nix something...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MSP-Greg or just rent an AWS instance by the hour!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After finally noticing the reject!
typo (funny how reject
doesn't work), I think I've got the code for a 'separate array' working. If you'd prefer it, let me know.
Patch: MSP-Greg@f141b02
Travis: https://travis-ci.org/MSP-Greg/puma/builds/569122723
@nateberkopec you're on macOS I assume. You're 90% of the way there already. I might as well be in the next solar system...
t_end = Process.clock_gettime(Process::CLOCK_MONOTONIC) | ||
log format(" worker shutdown time: %6.2f", t_end - t_st) | ||
break if pids.empty? | ||
sleep 0.2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we sleep/wait for everyone to die here but not on line 215?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stop_workers
is just that, it needs to wait for all the workers to stop.
check_workers
checks for workers (possibly stopped externally) that have been SIGTERM
'd, and removes then from the array, then moves on. It's not meant to pause, as it essentially runs continuously.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing I meant to check into is that if a worker is stopped externally, I thought that check_workers
might possibly wait up to Const::WORKER_CHECK_INTERVAL
(5 seconds) before it runs wait
and then respawns? Is that terrible, or ok? Or is it running before then by something setting the force
parameter to true?
Make sense?
A little distracted right now...
end | ||
t_end = Process.clock_gettime(Process::CLOCK_MONOTONIC) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not a fan of logging? 😆
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know. I think I added that recently. I can add log/debug lines if you'd like...
pids = [] | ||
while pid = Process.waitpid(-1, Process::WNOHANG) do | ||
pids << pid | ||
@workers.reject! do |w| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic seems like the same one that causing zombies because we will delete from @workers
if a process is dead due to signal but the process is still running. When it actually does exit, we won't call wait on it, so this reintroduces the bug.
2278fa0
to
3d1a0d7
Compare
3d1a0d7
to
48d7ca4
Compare
This PR and some of the issues involving worker 'proper exit' do need more tests. One thing I wonder is if there's a need for a For worker 'w', we have logic like the following used in a begin
Process.wait(w.pid, Process::WNOHANG)
rescue Errno::ECHILD
true # child is already terminated
end With the addition of the begin
if Process.wait(w.pid, Process::WNOHANG)
true
else
w.term if w.term?
nil
end
rescue Errno::ECHILD
true # child is already terminated
end EDIT: another possible use for |
Closing in favor of #1908 |
After #1887 was accepted, I was wondering about the wait code in stop_workers. While thinking about that, I was also bothered by:
Process.waitpid(-1, Process::WNOHANG)
can raiseErrno::ECHILD
, but isn't caught in the current code used incheck_workers
. This may be an issue with extenal use ofSIGTERM
.@evanphx expressed a preference for never using
wait(-1)
, which I would agree with.Intermittent failures have happened in CI, but always on Ruby versions < 2.6.
Code in
stop_workers
is Ruby version specific, but not because of a new feature. I've never liked that.JFYI, every job passed in my fork on this. I may run it a few more times.