Fixes Cluster worker shutdown/restart #1908

MSP-Greg · 2019-08-13T01:38:58Z

Hopefully fixes issues with correct handling of workers when:

Using phased-restart
externally SIGTERM'd workers

'Misbehaving' workers may not have been previously handled correctly.

Commit 'cluster.rb - fixup worker wait in Cluster, add Worker#term?' - relatively simple code that changes the handling of workers, making sure to trigger SIGKILL if @options[:worker_shutdown_timeout] has been exceeded, correct wait logic, and does not remove dead? workers from Cluster @workers array

Commit 'Adds two tests for worker SIGTERM/respawn and phased-restart' - adds two tests for the above. They're somewhat brittle, and may fail, but not often. They take about 30 seconds each.

I added the tests to current master and on top of PR #1892, and both of the new tests failed in every job.

See PR #1892 and Issue #1904 for additional discussion.

dentarg · 2019-08-13T06:42:19Z

test/test_integration.rb

+    begin
+      Process.wait2(pid)
+    rescue Errno::ECHILD
+      15


@dentarg

Being a windows type, I can't test this locally.

The change above in stop_forked_server was done because I did see infrequent (and false) Travis failures from it. But, it isn't really part of this PR, so I reverted it.

It is wrong (should be [nil, 15]), but given the 'green' builds, it shows how infrequent the failures were... Thanks.

MSP-Greg · 2019-08-16T00:55:14Z

Tests updated to make use of after_worker_fork hook to track worker creation. Faster and more reliable...

EDIT: after revising the tests, I ran the tests on top of current master, all MRI jobs failed, see:
https://travis-ci.org/MSP-Greg/puma/builds/572576272

nateberkopec · 2019-08-22T10:20:45Z

test/test_integration.rb

+      c.workers 2
+      c.worker_shutdown_timeout 2
+      c.app TestApps::SLEEP
+      c.after_worker_fork { |idx| workers_booted += 1 }


Why do we need a new mechanism for determining if the process has finished booting? We already have two methods.

nateberkopec · 2019-08-22T10:21:00Z

test/test_integration.rb

+    l = Puma::Launcher.new conf, :events => @events
+
+    t = Thread.new do
+      Thread.current.abort_on_exception = true


Don't need to set, this is already set on all thread in helper.rb

nateberkopec · 2019-08-22T10:24:07Z

test/test_integration.rb

+
+    Thread.kill worker0
+    Thread.kill worker1
+    http0   = nil


what's the reason for setting all of these to nil?

nateberkopec · 2019-08-22T10:27:29Z

test/test_integration.rb

+    assert_empty old_waited, msg
+  end
+
+  def test_worker_phased_restart


These tests are exactly the same except for 1 line, could we clean that up a bit?

nateberkopec · 2019-08-22T10:29:49Z

test/test_integration.rb

+    l.stop
+    assert_kind_of Thread, t.join, "server didn't stop"
+
+    assert_operator (Time.now.to_f - start_time).round(2), :<, 32


maybe add together what you're looking for explicitly here:

assert_operator (Time.now.to_f - start_time).round(2), :<, (conf.worker_shutdown_timeout + 30)

I forgot, where does the 30 come from?

nateberkopec · 2019-08-22T10:30:55Z

test/test_integration.rb

+    msg = "old_pids #{old_pids.inspect}  new_pids #{new_pids.inspect}  old_waited #{old_waited.inspect}"
+    assert_equal 2, new_pids.length, msg
+    assert_equal 2, old_pids.length, msg
+    assert_equal new_pids, (new_pids - old_pids), "#{msg}\nBoth workers should be replaced"


imo this would be slightly clearer:

assert_empty new_pids & old_pids

nateberkopec · 2019-08-22T10:31:38Z

lib/puma/cluster.rb

-      pids = []
-      while pid = Process.waitpid(-1, Process::WNOHANG) do
-        pids << pid
+      @workers.reject! do |w|


This looks the same as the looped one, so we should extract a new method.

1. Cluster - fix wait in #check_workers, #stop_workers 2. Cluster - add private wait_workers method for use in above 2. Worker - add #term? method for use in above

test_worker_spawn_external_term - sends SIGTERM to workers, checks respawn, etc test_worker_phased_restart - checking worker handling during phased-restart

MSP-Greg · 2019-08-23T04:52:49Z

@nateberkopec

I rebased and updated most of the items you mentioned. Some of the work in the test file could be shared by other tests, saving that for another PR...

EDIT: just noticed in this Travis log that:

puma/test/test_integration.rb

Lines 16 to 18 in 9581218

    
           @state_path   = "test/test_#{name}_puma.state" 
        
           @bind_path    = "test/test_#{name}_server.sock" 
        
           @control_path = "test/test_#{name}_control.sock"

should lose the test_ at the start of the file name...

* cluster.rb - fixup worker wait in Cluster, add Worker#term? 1. Cluster - fix wait in #check_workers, #stop_workers 2. Cluster - add private wait_workers method for use in above 2. Worker - add #term? method for use in above * Adds two tests for worker SIGTERM/respawn and phased-restart test_worker_spawn_external_term - sends SIGTERM to workers, checks respawn, etc test_worker_phased_restart - checking worker handling during phased-restart

MSP-Greg mentioned this pull request Aug 13, 2019

cluster.rb - fixup worker wait logic in #check_workers and #stop_workers #1892

Closed

dentarg reviewed Aug 13, 2019

View reviewed changes

MSP-Greg force-pushed the cluster-stop-workers-term branch 2 times, most recently from fd640e7 to c3ba779 Compare August 15, 2019 16:27

nateberkopec added bug restart labels Aug 16, 2019

nateberkopec reviewed Aug 22, 2019

View reviewed changes

nateberkopec added the waiting-for-changes Waiting on changes from the requestor label Aug 22, 2019

MSP-Greg added 2 commits August 22, 2019 22:01

cluster.rb - fixup worker wait in Cluster, add Worker#term?

41a0ec4

1. Cluster - fix wait in #check_workers, #stop_workers 2. Cluster - add private wait_workers method for use in above 2. Worker - add #term? method for use in above

Adds two tests for worker SIGTERM/respawn and phased-restart

9581218

test_worker_spawn_external_term - sends SIGTERM to workers, checks respawn, etc test_worker_phased_restart - checking worker handling during phased-restart

MSP-Greg force-pushed the cluster-stop-workers-term branch from c3ba779 to 9581218 Compare August 23, 2019 04:49

nateberkopec merged commit 9fb1228 into puma:master Aug 23, 2019

MSP-Greg deleted the cluster-stop-workers-term branch August 23, 2019 12:16

This was referenced Aug 24, 2019

cluster.rb - remove worker #dead?, #dead!, @dead #1927

Merged

test_integration.rb - cleanup, try removing log output on unneeded 'forked' servers #1925

Closed

nateberkopec mentioned this pull request Sep 2, 2019

In Ruby on Rails v5.2 and Puma v4, after restarting puma at the time of logrotate, unnecessary workers continue to remain #1891

Closed

MSP-Greg mentioned this pull request Sep 7, 2019

Fix Worker external TERM signalling #1952

Merged

dependabot bot mentioned this pull request Mar 15, 2021

Bump puma from 4.1.0 to 4.3.5 pleo-io/slack-tipbot#2

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes Cluster worker shutdown/restart #1908

Fixes Cluster worker shutdown/restart #1908

MSP-Greg commented Aug 13, 2019

dentarg Aug 13, 2019

MSP-Greg Aug 13, 2019

MSP-Greg commented Aug 16, 2019 •

edited

nateberkopec Aug 22, 2019

nateberkopec Aug 22, 2019

nateberkopec Aug 22, 2019 •

edited

nateberkopec Aug 22, 2019

nateberkopec Aug 22, 2019

nateberkopec Aug 22, 2019

nateberkopec Aug 22, 2019

MSP-Greg commented Aug 23, 2019 •

edited

Fixes Cluster worker shutdown/restart #1908

Fixes Cluster worker shutdown/restart #1908

Conversation

MSP-Greg commented Aug 13, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MSP-Greg commented Aug 16, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nateberkopec Aug 22, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MSP-Greg commented Aug 23, 2019 • edited

MSP-Greg commented Aug 16, 2019 •

edited

nateberkopec Aug 22, 2019 •

edited

MSP-Greg commented Aug 23, 2019 •

edited