New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes Cluster worker shutdown/restart #1908
Conversation
test/test_integration.rb
Outdated
begin | ||
Process.wait2(pid) | ||
rescue Errno::ECHILD | ||
15 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
15
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Being a windows type, I can't test this locally.
The change above in stop_forked_server
was done because I did see infrequent (and false) Travis failures from it. But, it isn't really part of this PR, so I reverted it.
It is wrong (should be [nil, 15]
), but given the 'green' builds, it shows how infrequent the failures were... Thanks.
fd640e7
to
c3ba779
Compare
Tests updated to make use of EDIT: after revising the tests, I ran the tests on top of current master, all MRI jobs failed, see: |
c.workers 2 | ||
c.worker_shutdown_timeout 2 | ||
c.app TestApps::SLEEP | ||
c.after_worker_fork { |idx| workers_booted += 1 } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need a new mechanism for determining if the process has finished booting? We already have two methods.
test/test_integration.rb
Outdated
l = Puma::Launcher.new conf, :events => @events | ||
|
||
t = Thread.new do | ||
Thread.current.abort_on_exception = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't need to set, this is already set on all thread in helper.rb
test/test_integration.rb
Outdated
|
||
Thread.kill worker0 | ||
Thread.kill worker1 | ||
http0 = nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the reason for setting all of these to nil?
test/test_integration.rb
Outdated
assert_empty old_waited, msg | ||
end | ||
|
||
def test_worker_phased_restart |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These tests are exactly the same except for 1 line, could we clean that up a bit?
test/test_integration.rb
Outdated
l.stop | ||
assert_kind_of Thread, t.join, "server didn't stop" | ||
|
||
assert_operator (Time.now.to_f - start_time).round(2), :<, 32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add together what you're looking for explicitly here:
assert_operator (Time.now.to_f - start_time).round(2), :<, (conf.worker_shutdown_timeout + 30)
I forgot, where does the 30 come from?
test/test_integration.rb
Outdated
msg = "old_pids #{old_pids.inspect} new_pids #{new_pids.inspect} old_waited #{old_waited.inspect}" | ||
assert_equal 2, new_pids.length, msg | ||
assert_equal 2, old_pids.length, msg | ||
assert_equal new_pids, (new_pids - old_pids), "#{msg}\nBoth workers should be replaced" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
imo this would be slightly clearer:
assert_empty new_pids & old_pids
lib/puma/cluster.rb
Outdated
pids = [] | ||
while pid = Process.waitpid(-1, Process::WNOHANG) do | ||
pids << pid | ||
@workers.reject! do |w| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks the same as the looped one, so we should extract a new method.
1. Cluster - fix wait in #check_workers, #stop_workers 2. Cluster - add private wait_workers method for use in above 2. Worker - add #term? method for use in above
test_worker_spawn_external_term - sends SIGTERM to workers, checks respawn, etc test_worker_phased_restart - checking worker handling during phased-restart
c3ba779
to
9581218
Compare
I rebased and updated most of the items you mentioned. Some of the work in the test file could be shared by other tests, saving that for another PR... EDIT: just noticed in this Travis log that: Lines 16 to 18 in 9581218
should lose the |
* cluster.rb - fixup worker wait in Cluster, add Worker#term? 1. Cluster - fix wait in #check_workers, #stop_workers 2. Cluster - add private wait_workers method for use in above 2. Worker - add #term? method for use in above * Adds two tests for worker SIGTERM/respawn and phased-restart test_worker_spawn_external_term - sends SIGTERM to workers, checks respawn, etc test_worker_phased_restart - checking worker handling during phased-restart
* cluster.rb - fixup worker wait in Cluster, add Worker#term? 1. Cluster - fix wait in #check_workers, #stop_workers 2. Cluster - add private wait_workers method for use in above 2. Worker - add #term? method for use in above * Adds two tests for worker SIGTERM/respawn and phased-restart test_worker_spawn_external_term - sends SIGTERM to workers, checks respawn, etc test_worker_phased_restart - checking worker handling during phased-restart
* cluster.rb - fixup worker wait in Cluster, add Worker#term? 1. Cluster - fix wait in #check_workers, #stop_workers 2. Cluster - add private wait_workers method for use in above 2. Worker - add #term? method for use in above * Adds two tests for worker SIGTERM/respawn and phased-restart test_worker_spawn_external_term - sends SIGTERM to workers, checks respawn, etc test_worker_phased_restart - checking worker handling during phased-restart
Hopefully fixes issues with correct handling of workers when:
'Misbehaving' workers may not have been previously handled correctly.
Commit 'cluster.rb - fixup worker wait in Cluster, add Worker#term?' - relatively simple code that changes the handling of workers, making sure to trigger SIGKILL if
@options[:worker_shutdown_timeout]
has been exceeded, correctwait
logic, and does not removedead?
workers from Cluster@workers
arrayCommit 'Adds two tests for worker SIGTERM/respawn and phased-restart' - adds two tests for the above. They're somewhat brittle, and may fail, but not often. They take about 30 seconds each.
I added the tests to current master and on top of PR #1892, and both of the new tests failed in every job.
See PR #1892 and Issue #1904 for additional discussion.