Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cluster.rb - no zombies #1887

Merged
merged 3 commits into from Aug 6, 2019
Merged

cluster.rb - no zombies #1887

merged 3 commits into from Aug 6, 2019

Conversation

MSP-Greg
Copy link
Member

@MSP-Greg MSP-Greg commented Aug 6, 2019

Commits:

  1. 'cluster.rb - fixup for Cluster#check_workers, removes zombies?' - reverts problematic code in Cluster#check_workers to a simplified version of what was used for several years, thru 3.12.1. Current code has been revised from previous due to issues with newer Ruby versions, etc. Simplified version appears to be working correctly.

  2. 'test_integration.rb - add zombie test' - uses test from PR Detach dead workers to prevent zombie processes #1864 (not merged, closed)

  3. 'travis.yml - misc updates' - updates a few items to make use of current/newer releases. Hoping to fix some issues with intermittent failures.

@MSP-Greg
Copy link
Member Author

MSP-Greg commented Aug 6, 2019

Forgot to mention - test added here failed on current master...

EDIT: The build on Travis was https://travis-ci.org/MSP-Greg/puma/builds/568063498. It failed in Ruby 2.2 thru 2.6 on Xenial, but I cancelled before 2.6 completed.

@MSP-Greg
Copy link
Member Author

MSP-Greg commented Aug 6, 2019

Re CI, there are two Ruby 2.2.10 jobs, one failed, which seems to be an intermittent failure. See https://bugs.ruby-lang.org/issues/13239. There was a recent patch for it, but it was only backported to Ruby 2.3.x. The failure may involve #close calls where the io is already closed. Adding closed? might help...

As to the failure in 2.3.8, it passed several times in my fork. Not sure what to think of it...

@evanphx
Copy link
Member

evanphx commented Aug 6, 2019

So a better test, rather than relying on Process.kill to search, is to just perform a Process.waitpid(-1, Process::WNOHANG) and verify that nothing is returned, which will indicate that the server waited on them properly.

Copy link
Contributor

@dzunk dzunk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, as per my comment here, returning to the waitpid(-1) behavior should reap all the zombies sooner or later.

while pid = Process.waitpid(-1, Process::WNOHANG) do
pids << pid
end
@workers.reject! { |w| w.dead? || pids.include?(w.pid) }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could move these lines into a reap_workers method to match cull_workers and spawn_workers.

Copy link
Member Author

@MSP-Greg MSP-Greg Aug 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion, I'll leave coding style up to others. It's four lines, probably only called by the method it's contained in, and any test would involve that method...

MSP-Greg and others added 2 commits August 6, 2019 09:04
Co-authored-by: Matt Duszynski <mattduszynski@gmail.com>
jruby-head to bionic
xcode10.2 to xcode11
all ruby 2.2 to 2.2.10
@MSP-Greg
Copy link
Member Author

MSP-Greg commented Aug 6, 2019

@evanphx

So a better test

Updated. Thanks.

@MSP-Greg
Copy link
Member Author

MSP-Greg commented Aug 6, 2019

Sometime ago Binder#stop_workers was revised for any issue with Ruby 2.6 and later. In it is the following loop:

puma/lib/puma/cluster.rb

Lines 38 to 45 in 3ea2471

if RUBY_VERSION < '2.6'
@workers.each do |w|
begin
Process.waitpid(w.pid)
rescue Errno::ECHILD
# child is already terminated
end
end

If the first worker in the array is the one needing the longest time to shut down, the code will not check the others until it's waited for the first to complete. Maybe I've got something wrong, but this seems incorrect.

So, should the code simply use a revised version of the code currently used for Ruby 2.6+ (removing the code for Ruby < 2.6)?

If so, a commit here, or another PR?

@@ -37,32 +37,36 @@ rvm:
matrix:
fast_finish: true
include:
- rvm: 2.2
dist: trusty
- rvm: 2.2.10
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the future @MSP-Greg, I'd prefer you make changes to travis.yml in separate PRs. That way, they won't get blocked by the corresponding feature/bugfix.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking the same.

While I was testing/debugging, Travis seemed to be acting up, and I wasn't sure if it was affecting the CI results I was seeing.. IOW, didn't just decide to throw it in, I was having failures...

How about I remove it right now & add another PR?

@nateberkopec
Copy link
Member

@MSP-Greg definitely a separate PR for anything regarding #stop_workers

@nateberkopec nateberkopec merged commit b32de94 into puma:master Aug 6, 2019
@MSP-Greg MSP-Greg deleted the no-zombies branch August 6, 2019 16:52
@MSP-Greg
Copy link
Member Author

MSP-Greg commented Aug 6, 2019

@nateberkopec

Does the paragraph after the code snippet make sense?

JFYI, I'm looking at some tests where multiple requests that take a varying amount of time to complete are used, and maybe better testing as far as graceful shutdown/phased restart can be done. Also, externally killing workers and checking respawn. Not today...

@nateberkopec
Copy link
Member

Does the paragraph after the code snippet make sense?

IDK, ask me later on that PR after I release 4.1 😆

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants