Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cluster.rb - fixup worker wait logic in #check_workers and #stop_workers #1892

Closed
wants to merge 1 commit into from

Conversation

MSP-Greg
Copy link
Member

@MSP-Greg MSP-Greg commented Aug 7, 2019

After #1887 was accepted, I was wondering about the wait code in stop_workers. While thinking about that, I was also bothered by:

  1. Process.waitpid(-1, Process::WNOHANG) can raise Errno::ECHILD, but isn't caught in the current code used in check_workers. This may be an issue with extenal use of SIGTERM.

  2. @evanphx expressed a preference for never using wait(-1), which I would agree with.

  3. Intermittent failures have happened in CI, but always on Ruby versions < 2.6.

  4. Code in stop_workers is Ruby version specific, but not because of a new feature. I've never liked that.

JFYI, every job passed in my fork on this. I may run it a few more times.

pids = []
while pid = Process.waitpid(-1, Process::WNOHANG) do
pids << pid
@workers.reject! do |w|
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this code now seems similar enough to be extracted into a new method, yes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did think about that, but see below.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic seems like the same one that causing zombies because we will delete from @workers if a process is dead due to signal but the process is still running. When it actually does exit, we won't call wait on it, so this reintroduces the bug.

Copy link
Member Author

@MSP-Greg MSP-Greg Aug 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if a process is dead due to signal

I did 'turn around' things so that wait was called before dead? was checked, but I've kind of left the dead? check in the because I thought it was previously used for a long time. But, wait may not have been using WNOHANG then. Also, I need to review (again) how & when @dead is set, as I'm not clear on that.

Regardless, what you're saying makes sense to me, should I remove the || w.dead? code on line 220?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@evanphx

As above, I removed the || w.dead? code in my fork, and it passed. Would you prefer that? I think I would...

I don't think there are any tests that create a situation where a worker needs a 'graceful' shutdown and then externally closes it, and the waits to see if a zombie exists. Maybe a bit messy...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a separate array is what I started to do in that other PR. I don't think it's probably necessary if we remove w.dead? from here, but I'm mentioning it just so that we've at least thought about it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked through the code and it's fine. The main effect will be that there will be a very small delay perhaps before spawning a replacement worker and that's ok. The dead status code is sent right before the worker exits anyway.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking at it. Next major computer purchase is a system for Ubuntu/*nix something...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MSP-Greg or just rent an AWS instance by the hour!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@evanphx

After finally noticing the reject! typo (funny how reject doesn't work), I think I've got the code for a 'separate array' working. If you'd prefer it, let me know.

Patch: MSP-Greg@f141b02
Travis: https://travis-ci.org/MSP-Greg/puma/builds/569122723

@nateberkopec you're on macOS I assume. You're 90% of the way there already. I might as well be in the next solar system...

t_end = Process.clock_gettime(Process::CLOCK_MONOTONIC)
log format(" worker shutdown time: %6.2f", t_end - t_st)
break if pids.empty?
sleep 0.2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we sleep/wait for everyone to die here but not on line 215?

Copy link
Member Author

@MSP-Greg MSP-Greg Aug 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stop_workers is just that, it needs to wait for all the workers to stop.

check_workers checks for workers (possibly stopped externally) that have been SIGTERM'd, and removes then from the array, then moves on. It's not meant to pause, as it essentially runs continuously.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I meant to check into is that if a worker is stopped externally, I thought that check_workers might possibly wait up to Const::WORKER_CHECK_INTERVAL (5 seconds) before it runs wait and then respawns? Is that terrible, or ok? Or is it running before then by something setting the force parameter to true?

Make sense?

A little distracted right now...

end
t_end = Process.clock_gettime(Process::CLOCK_MONOTONIC)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a fan of logging? 😆

Copy link
Member Author

@MSP-Greg MSP-Greg Aug 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know. I think I added that recently. I can add log/debug lines if you'd like...

@nateberkopec nateberkopec added this to the 4.1.0 milestone Aug 7, 2019
pids = []
while pid = Process.waitpid(-1, Process::WNOHANG) do
pids << pid
@workers.reject! do |w|
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic seems like the same one that causing zombies because we will delete from @workers if a process is dead due to signal but the process is still running. When it actually does exit, we won't call wait on it, so this reintroduces the bug.

@MSP-Greg
Copy link
Member Author

MSP-Greg commented Aug 9, 2019

@evanphx, @nateberkopec

This PR and some of the issues involving worker 'proper exit' do need more tests. One thing I wonder is if there's a need for a Worker#term? method to indicate whether Worker#term has been called.

For worker 'w', we have logic like the following used in a @workers.each loop:

begin
  Process.wait(w.pid, Process::WNOHANG)
rescue Errno::ECHILD
  true # child is already terminated
end

With the addition of the #term? flag/attribute/method

begin
  if Process.wait(w.pid, Process::WNOHANG)
    true
  else
    w.term if w.term?
    nil
  end
rescue Errno::ECHILD
  true # child is already terminated
end

EDIT: another possible use for Worker#term? would be to spawn/cull workers based on the length of @workers.reject { |w| w.term? }, which I think could be dropped in at a later time. Hence, spawn/cull would be based on term?, not whether the worker has exited...

@MSP-Greg
Copy link
Member Author

Closing in favor of #1908

@MSP-Greg MSP-Greg closed this Aug 13, 2019
@MSP-Greg MSP-Greg deleted the cluster-stop-workers branch August 23, 2019 12:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants