cluster.rb - fixup worker wait logic in #check_workers and #stop_workers #1892

MSP-Greg · 2019-08-07T14:40:16Z

After #1887 was accepted, I was wondering about the wait code in stop_workers. While thinking about that, I was also bothered by:

Process.waitpid(-1, Process::WNOHANG) can raise Errno::ECHILD, but isn't caught in the current code used in check_workers. This may be an issue with extenal use of SIGTERM.
@evanphx expressed a preference for never using wait(-1), which I would agree with.
Intermittent failures have happened in CI, but always on Ruby versions < 2.6.
Code in stop_workers is Ruby version specific, but not because of a new feature. I've never liked that.

JFYI, every job passed in my fork on this. I may run it a few more times.

nateberkopec · 2019-08-07T14:49:42Z

lib/puma/cluster.rb

-      pids = []
-      while pid = Process.waitpid(-1, Process::WNOHANG) do
-        pids << pid
+      @workers.reject! do |w|


this code now seems similar enough to be extracted into a new method, yes?

I did think about that, but see below.

This logic seems like the same one that causing zombies because we will delete from @workers if a process is dead due to signal but the process is still running. When it actually does exit, we won't call wait on it, so this reintroduces the bug.

if a process is dead due to signal

I did 'turn around' things so that wait was called before dead? was checked, but I've kind of left the dead? check in the because I thought it was previously used for a long time. But, wait may not have been using WNOHANG then. Also, I need to review (again) how & when @dead is set, as I'm not clear on that.

Regardless, what you're saying makes sense to me, should I remove the || w.dead? code on line 220?

@evanphx

As above, I removed the || w.dead? code in my fork, and it passed. Would you prefer that? I think I would...

I don't think there are any tests that create a situation where a worker needs a 'graceful' shutdown and then externally closes it, and the waits to see if a zombie exists. Maybe a bit messy...

Using a separate array is what I started to do in that other PR. I don't think it's probably necessary if we remove w.dead? from here, but I'm mentioning it just so that we've at least thought about it.

I looked through the code and it's fine. The main effect will be that there will be a very small delay perhaps before spawning a replacement worker and that's ok. The dead status code is sent right before the worker exits anyway.

Thanks for looking at it. Next major computer purchase is a system for Ubuntu/*nix something...

@MSP-Greg or just rent an AWS instance by the hour!

@evanphx

After finally noticing the reject! typo (funny how reject doesn't work), I think I've got the code for a 'separate array' working. If you'd prefer it, let me know.

Patch: MSP-Greg@f141b02
Travis: https://travis-ci.org/MSP-Greg/puma/builds/569122723

@nateberkopec you're on macOS I assume. You're 90% of the way there already. I might as well be in the next solar system...

nateberkopec · 2019-08-07T14:50:17Z

lib/puma/cluster.rb

-          t_end = Process.clock_gettime(Process::CLOCK_MONOTONIC)
-          log format("    worker shutdown time: %6.2f", t_end - t_st)
+          break if pids.empty?
+          sleep 0.2


Why do we sleep/wait for everyone to die here but not on line 215?

stop_workers is just that, it needs to wait for all the workers to stop.

check_workers checks for workers (possibly stopped externally) that have been SIGTERM'd, and removes then from the array, then moves on. It's not meant to pause, as it essentially runs continuously.

One thing I meant to check into is that if a worker is stopped externally, I thought that check_workers might possibly wait up to Const::WORKER_CHECK_INTERVAL (5 seconds) before it runs wait and then respawns? Is that terrible, or ok? Or is it running before then by something setting the force parameter to true?

Make sense?

A little distracted right now...

nateberkopec · 2019-08-07T14:50:34Z

lib/puma/cluster.rb

          end
-          t_end = Process.clock_gettime(Process::CLOCK_MONOTONIC)


not a fan of logging? 😆

I don't know. I think I added that recently. I can add log/debug lines if you'd like...

evanphx · 2019-08-07T18:11:56Z

lib/puma/cluster.rb

-      pids = []
-      while pid = Process.waitpid(-1, Process::WNOHANG) do
-        pids << pid
+      @workers.reject! do |w|


This logic seems like the same one that causing zombies because we will delete from @workers if a process is dead due to signal but the process is still running. When it actually does exit, we won't call wait on it, so this reintroduces the bug.

MSP-Greg · 2019-08-09T14:37:28Z

@evanphx, @nateberkopec

This PR and some of the issues involving worker 'proper exit' do need more tests. One thing I wonder is if there's a need for a Worker#term? method to indicate whether Worker#term has been called.

For worker 'w', we have logic like the following used in a @workers.each loop:

begin
  Process.wait(w.pid, Process::WNOHANG)
rescue Errno::ECHILD
  true # child is already terminated
end

With the addition of the #term? flag/attribute/method

begin
  if Process.wait(w.pid, Process::WNOHANG)
    true
  else
    w.term if w.term?
    nil
  end
rescue Errno::ECHILD
  true # child is already terminated
end

EDIT: another possible use for Worker#term? would be to spawn/cull workers based on the length of @workers.reject { |w| w.term? }, which I think could be dropped in at a later time. Hence, spawn/cull would be based on term?, not whether the worker has exited...

MSP-Greg · 2019-08-13T01:40:40Z

Closing in favor of #1908

nateberkopec reviewed Aug 7, 2019

View reviewed changes

nateberkopec added this to the 4.1.0 milestone Aug 7, 2019

nateberkopec added bug restart labels Aug 7, 2019

evanphx requested changes Aug 7, 2019

View reviewed changes

MSP-Greg force-pushed the cluster-stop-workers branch from 2278fa0 to 3d1a0d7 Compare August 7, 2019 20:25

cluster.rb - fixup worker wait logic in #check_workers and #stop_workers

48d7ca4

MSP-Greg force-pushed the cluster-stop-workers branch from 3d1a0d7 to 48d7ca4 Compare August 7, 2019 23:15

This was referenced Aug 8, 2019

cluster.rb - add @worker_pids, fixup Process.wait logic in #check_workers and #stop_workers #1894

Closed

In Ruby on Rails v5.2 and Puma v4, after restarting puma at the time of logrotate, unnecessary workers continue to remain #1891

Closed

This was referenced Aug 11, 2019

[Bug] TestIntegration#test_phased_restart_via_pumactl, proper Worker shutdown #1904

Closed

Fixes Cluster worker shutdown/restart #1908

Merged

MSP-Greg closed this Aug 13, 2019

MSP-Greg deleted the cluster-stop-workers branch August 23, 2019 12:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cluster.rb - fixup worker wait logic in #check_workers and #stop_workers #1892

cluster.rb - fixup worker wait logic in #check_workers and #stop_workers #1892

MSP-Greg commented Aug 7, 2019 •

edited

nateberkopec Aug 7, 2019

MSP-Greg Aug 7, 2019

evanphx Aug 7, 2019

MSP-Greg Aug 7, 2019 •

edited

MSP-Greg Aug 7, 2019

evanphx Aug 7, 2019

evanphx Aug 7, 2019

MSP-Greg Aug 7, 2019

nateberkopec Aug 8, 2019

MSP-Greg Aug 8, 2019

nateberkopec Aug 7, 2019

MSP-Greg Aug 7, 2019 •

edited

MSP-Greg Aug 7, 2019

nateberkopec Aug 7, 2019

MSP-Greg Aug 7, 2019 •

edited

evanphx Aug 7, 2019

MSP-Greg commented Aug 9, 2019 •

edited

MSP-Greg commented Aug 13, 2019

cluster.rb - fixup worker wait logic in #check_workers and #stop_workers #1892

cluster.rb - fixup worker wait logic in #check_workers and #stop_workers #1892

Conversation

MSP-Greg commented Aug 7, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MSP-Greg Aug 7, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MSP-Greg Aug 7, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MSP-Greg Aug 7, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MSP-Greg commented Aug 9, 2019 • edited

MSP-Greg commented Aug 13, 2019

MSP-Greg commented Aug 7, 2019 •

edited

MSP-Greg Aug 7, 2019 •

edited

MSP-Greg Aug 7, 2019 •

edited

MSP-Greg Aug 7, 2019 •

edited

MSP-Greg commented Aug 9, 2019 •

edited