Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detach bisect subprocesses to avoid making zombie processes #2739

Merged
merged 11 commits into from Aug 28, 2020

Conversation

benoittgt
Copy link
Member

If we do not waitpid or detach the bisect process become a zombie process.

As mentioned in waitpid doc:

As long as a zombie is not removed from the system via a wait, it will consume a slot in the kernel process table, and if this table fills, it will not be possible to create further processes.

detach is a good idea. Thanks @pirj. From the Ruby doc:

Some operating systems retain the status of terminated child processes until the parent collects that status (normally using some variant of wait()). If the parent never collects this status, the child stays around as a zombie process. Process::detach prevents this by setting up a separate Ruby thread whose sole job is to reap the status of the process pid when it terminates. Use detach only when you do not intend to explicitly wait for the child to terminate.

Related:

Thanks @pirj and @andrykonchin

@benoittgt
Copy link
Member Author

benoittgt commented Jun 11, 2020

I am think about a way to test this but for the moment this is not very clean.

To validate the Process.detach(pid) and compared if to other behaviors, I used this snippet.

@read_io, @write_io = IO.pipe

# write into pipe some data
def run_specs(i)
  packet = '*' * i
  @write_io.write("#{packet.bytesize}\n#{packet}")
end

def log(action, pid)
  puts "--> #{action} process with pid: #{pid}"
end

count = 0
(65500..69500).each do |i|
  count += 1
  # create a child process
  pid = fork { run_specs(i) }
  puts "interval: #{i}, pid is: #{pid} and parent pid is: #{Process.ppid}, count: #{count}"

  # testing different behaviors
  if ENV['WAITPID']
    log("waitpid", pid)
    Process.waitpid(pid)
  elsif ENV['DETACH']
    log("detach", pid)
    Process.detach(pid)
  else
    log("no handling of", pid)
  end
  sleep 0.1

  # read result
  packet_size = Integer(@read_io.gets)
  packet = @read_io.read(packet_size)

  puts "packet size: #{packet.size}"
end

@JonRowe
Copy link
Member

JonRowe commented Jun 13, 2020

What where the results of your snippet? IIRC the problem that made us remove waitpid involved large numbers of processes until we hit the limit, have you tried that scenario?

@benoittgt
Copy link
Member Author

benoittgt commented Jun 13, 2020

Sorry Jon. I should have mention this #2669 (comment) but with this script.

  • WAITPID (previous code). ENV['WAITPID'] in script:

  • With nothing (current situation). No ENV set in the script:

    • We have zombie processes that can saturate user's process table if there is too many. For exemple in a scenario where the bisect is hard to determinate or if the OS doesn't have lot's process available. Zombie processes are properly killed when the ruby script exit.
  • With detach. ENV['DETACH'] in script:

    • We do not have zombie processes and we do not hit pipe size limitation. Expect memory increase, because of my script that increase incrementally the IO.pipe. I do not see limitations.

@JonRowe
Copy link
Member

JonRowe commented Jun 15, 2020

So I think we need to add a test for zombie processes, something like:

it 'does not leave zombie processes' do
  bisect(%W[spec/rspec/core/resources/blocking_pipe_bisect_spec.rb_], 1)

  # in theory this should be our bisect, but it might be the parent and not the child?
  pid = Process.last_status.pid

  # might not be necessary but I'm not sure how quickly the OS decides its a zombie
  sleep 1 
 
  # this doesn't work currently because either my system doesn't leave zombies, or 
  # the OS hasn't updated... obviously will only work on POSIX
  expect(%x[ps -ho pid,state -p #{pid}]).to_not include("Z")
end

@JonRowe
Copy link
Member

JonRowe commented Jun 15, 2020

Do you know the name of the limit we hit with pipes? I'd like to make the "broken" spec os dependant, as I think its passing for me regardless... (edit: verified still broken 😂)

@JonRowe
Copy link
Member

JonRowe commented Jun 15, 2020

Managed to get a generalised version of my spec working, I verified it fails on master and passes on this branch.

@JonRowe
Copy link
Member

JonRowe commented Jun 15, 2020

Let me know what you think @benoittgt / @pirj

Copy link
Member

@pirj pirj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fails on master and passes on this branch

Awesome!

# block due to the file descriptor limit on OSX / Linux.
# block due to the file descriptor limit on OSX / Linux. We need
# to detach the process to avoid having zombie process and consume
# slot in the kernel process table.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it makes sense to mention that those zombies are not forever, but only up to the moment when the parent process exits? Is that correct according to your observations, @benoittgt ? Please disregard this note if zombies remain after.

Copy link
Member Author

@benoittgt benoittgt Jun 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Zombies are properly killed when the command is done or cancel. I saw this behavior on my script.

With the patch we do not have zombie processes. So I am not sure we need to add more precisions about how zombie processes created by fork behaves when not detach or wait.

spec/integration/bisect_spec.rb Show resolved Hide resolved
@benoittgt
Copy link
Member Author

benoittgt commented Jun 15, 2020

We had a failure on this added test on Ruby 1.8.7. I looked at the documentation on fork and detach and didn't see an obvious difference, I reran and it passed. I hope it will not happen too often. :)

LGTM! 🙌🏻

benoittgt added a commit that referenced this pull request Jun 15, 2020
@JonRowe
Copy link
Member

JonRowe commented Jun 15, 2020

What failed on 1.8.7 before you restarted it?

@benoittgt
Copy link
Member Author

What failed on 1.8.7 before you restarted it?

We had zombie processes. The include was positive.

Maybe the sleep 1 you mentioned here will be necessary but this may be too early.

@benoittgt benoittgt changed the title Detach bisect process to avoid making zombie process Detach bisect subprocesses to avoid making zombie processes Jun 15, 2020
@benoittgt
Copy link
Member Author

benoittgt commented Jun 17, 2020

I reran the CI multiple times without being able to reproduce the random issue on the added spec.

I think we are good.

Changelog.md Outdated Show resolved Hide resolved
@JonRowe
Copy link
Member

JonRowe commented Jun 18, 2020

@benoittgt I've improve the spec to check for pids created by the spec, the loop ensures that the process has finished and entered its final state which will be Z without detatch.

@benoittgt
Copy link
Member Author

benoittgt commented Jun 19, 2020

@JonRowe it's up to you if you want to merge (when CI is green). This looks good to me.

It is very interesting to dig into this subject.

@benoittgt
Copy link
Member Author

😅 It seems we are waiting indefinitely

@benoittgt
Copy link
Member Author

JonRowe pushed a commit that referenced this pull request Jun 29, 2020
@JonRowe JonRowe force-pushed the detach-process-to-avoid-zombie branch from 6d8aed3 to 65cdb17 Compare June 29, 2020 19:29
while ((extra_pids = pids() - original_pids).join =~ /[RE]/i)
raise "Extra process detected" if cursor > 10
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What could be cleaner than raising here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also 10 means nothing. If we wait more than 1s then we raise an error. Is it clear enough?

Otherwise I think the PR can be merged @JonRowe

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We really just need a wait of finding the bisect pid, can we grep for the rspec command, does that mean we'll have less processes to check?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It requires a little bit more work to find child processes. I have submitted a patch that does that. 0774f41

Without the patch in fork_runner. The actual code returns:

Bisect
  when the bisect command saturates the pipe
    does not hit pipe size limit and does not get stuck
    does not leave zombie processes (FAILED - 1)
  when the spec ordering is inconsistent
    stops bisecting and surfaces the problem to the user
  when a load-time problem occurs while running the suite
    surfaces the stdout and stderr output to the user

Failures:

  1) Bisect when the bisect command saturates the pipe does not leave zombie processes
     Failure/Error:
               expect(zombie_process).to eq([]), <<-MSG
                 Expected no zombie processes got #{zombie_process.count}:
                   #{zombie_process}
               MSG

                 Expected no zombie processes got 2:
                   [#<struct RSpec::Core::RSpecChildProcess::Ps pid="31512", ppid="31503", state="Z+", command="(ruby)">, #<struct RSpec::Core::RSpecChildProcess::Ps pid="31513", ppid="31503", state="Z+", command="(ruby)">]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had no random failures. So I removed the part where we had to wait for the process execution.

@benoittgt
Copy link
Member Author

The last commit can be removed if wanted.

I reproduced the CI issue on a fresh Ubuntu. The "circuit breaker" avoid the timeout on Travis after few minutes of inactivity. The new ps + grep help us to only search for ruby processes.
The test still validate the patch. I tested with and without detach and we properly fail if we do not detach child process.

@benoittgt benoittgt requested a review from JonRowe July 2, 2020 18:57
@JonRowe JonRowe changed the base branch from master to main August 2, 2020 02:07
@benoittgt benoittgt force-pushed the detach-process-to-avoid-zombie branch from 5ae74d0 to 63b13a0 Compare August 28, 2020 13:53
benoittgt added a commit that referenced this pull request Aug 28, 2020
@benoittgt benoittgt force-pushed the detach-process-to-avoid-zombie branch from 63b13a0 to 2c56e15 Compare August 28, 2020 13:54
benoittgt and others added 10 commits August 28, 2020 16:09
If we do not `waitpid` or `detach` the bisect process become a zombie
process.

As mentionned in waitpid doc:
> As long as a zombie is not removed from the system via a wait, it will consume a slot in the kernel process table, and if this table fills, it will not be possible to create further processes.

`detach` is a good idea. From the Ruby doc:
> Some operating systems retain the status of terminated child processes until the parent collects that status (normally using some variant of wait()). If the parent never collects this status, the child stays around as a zombie process. Process::detach prevents this by setting up a separate Ruby thread whose sole job is to reap the status of the process pid when it terminates. Use detach only when you do not intend to explicitly wait for the child to terminate.

Related:
- #2669
- https://andrykonchin.github.io/rails/2019/12/25/deadlock-in-rspec.html
Co-authored-by: Jon Rowe <hello@jonrowe.co.uk>
@benoittgt benoittgt force-pushed the detach-process-to-avoid-zombie branch 2 times, most recently from 281de91 to 0774f41 Compare August 28, 2020 14:44
@benoittgt benoittgt force-pushed the detach-process-to-avoid-zombie branch from 0774f41 to 07754b6 Compare August 28, 2020 15:09
@JonRowe JonRowe merged commit d9e7d3d into main Aug 28, 2020
@JonRowe JonRowe deleted the detach-process-to-avoid-zombie branch August 28, 2020 19:58
JonRowe added a commit that referenced this pull request Sep 30, 2020
Detach bisect subprocesses to avoid making zombie processes
MatheusRich pushed a commit to MatheusRich/rspec-core that referenced this pull request Oct 30, 2020
MatheusRich pushed a commit to MatheusRich/rspec-core that referenced this pull request Oct 30, 2020
Detach bisect subprocesses to avoid making zombie processes
yujinakayama pushed a commit to yujinakayama/rspec-monorepo that referenced this pull request Oct 6, 2021
…cess-to-avoid-zombie

Detach bisect subprocesses to avoid making zombie processes

---
This commit was imported from rspec/rspec-core@efbac94.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants