Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes Cluster worker shutdown/restart #1908

Merged
merged 2 commits into from Aug 23, 2019

Conversation

MSP-Greg
Copy link
Member

Hopefully fixes issues with correct handling of workers when:

  1. Using phased-restart
  2. externally SIGTERM'd workers

'Misbehaving' workers may not have been previously handled correctly.

Commit 'cluster.rb - fixup worker wait in Cluster, add Worker#term?' - relatively simple code that changes the handling of workers, making sure to trigger SIGKILL if @options[:worker_shutdown_timeout] has been exceeded, correct wait logic, and does not remove dead? workers from Cluster @workers array

Commit 'Adds two tests for worker SIGTERM/respawn and phased-restart' - adds two tests for the above. They're somewhat brittle, and may fail, but not often. They take about 30 seconds each.

I added the tests to current master and on top of PR #1892, and both of the new tests failed in every job.

See PR #1892 and Issue #1904 for additional discussion.

begin
Process.wait2(pid)
rescue Errno::ECHILD
15
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

15?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dentarg

Being a windows type, I can't test this locally.

The change above in stop_forked_server was done because I did see infrequent (and false) Travis failures from it. But, it isn't really part of this PR, so I reverted it.

It is wrong (should be [nil, 15]), but given the 'green' builds, it shows how infrequent the failures were... Thanks.

@MSP-Greg MSP-Greg force-pushed the cluster-stop-workers-term branch 2 times, most recently from fd640e7 to c3ba779 Compare August 15, 2019 16:27
@MSP-Greg
Copy link
Member Author

MSP-Greg commented Aug 16, 2019

Tests updated to make use of after_worker_fork hook to track worker creation. Faster and more reliable...

EDIT: after revising the tests, I ran the tests on top of current master, all MRI jobs failed, see:
https://travis-ci.org/MSP-Greg/puma/builds/572576272

c.workers 2
c.worker_shutdown_timeout 2
c.app TestApps::SLEEP
c.after_worker_fork { |idx| workers_booted += 1 }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need a new mechanism for determining if the process has finished booting? We already have two methods.

l = Puma::Launcher.new conf, :events => @events

t = Thread.new do
Thread.current.abort_on_exception = true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need to set, this is already set on all thread in helper.rb


Thread.kill worker0
Thread.kill worker1
http0 = nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the reason for setting all of these to nil?

assert_empty old_waited, msg
end

def test_worker_phased_restart
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests are exactly the same except for 1 line, could we clean that up a bit?

l.stop
assert_kind_of Thread, t.join, "server didn't stop"

assert_operator (Time.now.to_f - start_time).round(2), :<, 32
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add together what you're looking for explicitly here:

assert_operator (Time.now.to_f - start_time).round(2), :<, (conf.worker_shutdown_timeout + 30)

I forgot, where does the 30 come from?

msg = "old_pids #{old_pids.inspect} new_pids #{new_pids.inspect} old_waited #{old_waited.inspect}"
assert_equal 2, new_pids.length, msg
assert_equal 2, old_pids.length, msg
assert_equal new_pids, (new_pids - old_pids), "#{msg}\nBoth workers should be replaced"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imo this would be slightly clearer:

assert_empty new_pids & old_pids

pids = []
while pid = Process.waitpid(-1, Process::WNOHANG) do
pids << pid
@workers.reject! do |w|
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks the same as the looped one, so we should extract a new method.

@nateberkopec nateberkopec added the waiting-for-changes Waiting on changes from the requestor label Aug 22, 2019
1. Cluster - fix wait in #check_workers, #stop_workers
2. Cluster - add private wait_workers method for use in above
2. Worker  - add #term? method for use in above
test_worker_spawn_external_term - sends SIGTERM to workers, checks respawn, etc

test_worker_phased_restart - checking worker handling during phased-restart
@MSP-Greg
Copy link
Member Author

MSP-Greg commented Aug 23, 2019

@nateberkopec

I rebased and updated most of the items you mentioned. Some of the work in the test file could be shared by other tests, saving that for another PR...

EDIT: just noticed in this Travis log that:

@state_path = "test/test_#{name}_puma.state"
@bind_path = "test/test_#{name}_server.sock"
@control_path = "test/test_#{name}_control.sock"

should lose the test_ at the start of the file name...

@nateberkopec nateberkopec merged commit 9fb1228 into puma:master Aug 23, 2019
@MSP-Greg MSP-Greg deleted the cluster-stop-workers-term branch August 23, 2019 12:16
nateberkopec pushed a commit that referenced this pull request Sep 5, 2019
* cluster.rb - fixup worker wait in Cluster, add Worker#term?

1. Cluster - fix wait in #check_workers, #stop_workers
2. Cluster - add private wait_workers method for use in above
2. Worker  - add #term? method for use in above

* Adds two tests for worker SIGTERM/respawn and phased-restart

test_worker_spawn_external_term - sends SIGTERM to workers, checks respawn, etc

test_worker_phased_restart - checking worker handling during phased-restart
nateberkopec pushed a commit that referenced this pull request Sep 5, 2019
* cluster.rb - fixup worker wait in Cluster, add Worker#term?

1. Cluster - fix wait in #check_workers, #stop_workers
2. Cluster - add private wait_workers method for use in above
2. Worker  - add #term? method for use in above

* Adds two tests for worker SIGTERM/respawn and phased-restart

test_worker_spawn_external_term - sends SIGTERM to workers, checks respawn, etc

test_worker_phased_restart - checking worker handling during phased-restart
nateberkopec pushed a commit that referenced this pull request Sep 9, 2019
* cluster.rb - fixup worker wait in Cluster, add Worker#term?

1. Cluster - fix wait in #check_workers, #stop_workers
2. Cluster - add private wait_workers method for use in above
2. Worker  - add #term? method for use in above

* Adds two tests for worker SIGTERM/respawn and phased-restart

test_worker_spawn_external_term - sends SIGTERM to workers, checks respawn, etc

test_worker_phased_restart - checking worker handling during phased-restart
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug restart waiting-for-changes Waiting on changes from the requestor
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants