New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Detach bisect subprocesses to avoid making zombie processes #2739
Conversation
I am think about a way to test this but for the moment this is not very clean. To validate the @read_io, @write_io = IO.pipe
# write into pipe some data
def run_specs(i)
packet = '*' * i
@write_io.write("#{packet.bytesize}\n#{packet}")
end
def log(action, pid)
puts "--> #{action} process with pid: #{pid}"
end
count = 0
(65500..69500).each do |i|
count += 1
# create a child process
pid = fork { run_specs(i) }
puts "interval: #{i}, pid is: #{pid} and parent pid is: #{Process.ppid}, count: #{count}"
# testing different behaviors
if ENV['WAITPID']
log("waitpid", pid)
Process.waitpid(pid)
elsif ENV['DETACH']
log("detach", pid)
Process.detach(pid)
else
log("no handling of", pid)
end
sleep 0.1
# read result
packet_size = Integer(@read_io.gets)
packet = @read_io.read(packet_size)
puts "packet size: #{packet.size}"
end |
What where the results of your snippet? IIRC the problem that made us remove |
Sorry Jon. I should have mention this #2669 (comment) but with this script.
|
So I think we need to add a test for zombie processes, something like:
|
Do you know the name of the limit we hit with pipes? I'd like to make the "broken" spec os dependant, |
Managed to get a generalised version of my spec working, I verified it fails on master and passes on this branch. |
Let me know what you think @benoittgt / @pirj |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fails on master and passes on this branch
Awesome!
lib/rspec/core/bisect/fork_runner.rb
Outdated
# block due to the file descriptor limit on OSX / Linux. | ||
# block due to the file descriptor limit on OSX / Linux. We need | ||
# to detach the process to avoid having zombie process and consume | ||
# slot in the kernel process table. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it makes sense to mention that those zombies are not forever, but only up to the moment when the parent process exits? Is that correct according to your observations, @benoittgt ? Please disregard this note if zombies remain after.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Zombies are properly killed when the command is done or cancel. I saw this behavior on my script.
With the patch we do not have zombie processes. So I am not sure we need to add more precisions about how zombie processes created by fork
behaves when not detach
or wait
.
We had a failure on this added test on Ruby 1.8.7. I looked at the documentation on LGTM! 🙌🏻 |
What failed on 1.8.7 before you restarted it? |
We had zombie processes. The include was positive. Maybe the |
I reran the CI multiple times without being able to reproduce the random issue on the added spec. I think we are good. |
@benoittgt I've improve the spec to check for pids created by the spec, the loop ensures that the process has finished and entered its final state which will be |
@JonRowe it's up to you if you want to merge (when CI is green). This looks good to me. It is very interesting to dig into this subject. |
😅 It seems we are waiting indefinitely |
The detailed output can be seen here https://travis-ci.org/github/rspec/rspec-core/jobs/701153751 |
6d8aed3
to
65cdb17
Compare
spec/integration/bisect_spec.rb
Outdated
while ((extra_pids = pids() - original_pids).join =~ /[RE]/i) | ||
raise "Extra process detected" if cursor > 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What could be cleaner than raising here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also 10 means nothing. If we wait more than 1s then we raise an error. Is it clear enough?
Otherwise I think the PR can be merged @JonRowe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We really just need a wait of finding the bisect pid, can we grep for the rspec command, does that mean we'll have less processes to check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It requires a little bit more work to find child processes. I have submitted a patch that does that. 0774f41
Without the patch in fork_runner
. The actual code returns:
Bisect
when the bisect command saturates the pipe
does not hit pipe size limit and does not get stuck
does not leave zombie processes (FAILED - 1)
when the spec ordering is inconsistent
stops bisecting and surfaces the problem to the user
when a load-time problem occurs while running the suite
surfaces the stdout and stderr output to the user
Failures:
1) Bisect when the bisect command saturates the pipe does not leave zombie processes
Failure/Error:
expect(zombie_process).to eq([]), <<-MSG
Expected no zombie processes got #{zombie_process.count}:
#{zombie_process}
MSG
Expected no zombie processes got 2:
[#<struct RSpec::Core::RSpecChildProcess::Ps pid="31512", ppid="31503", state="Z+", command="(ruby)">, #<struct RSpec::Core::RSpecChildProcess::Ps pid="31513", ppid="31503", state="Z+", command="(ruby)">]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had no random failures. So I removed the part where we had to wait for the process execution.
The last commit can be removed if wanted. I reproduced the CI issue on a fresh Ubuntu. The "circuit breaker" avoid the timeout on Travis after few minutes of inactivity. The new |
5ae74d0
to
63b13a0
Compare
63b13a0
to
2c56e15
Compare
If we do not `waitpid` or `detach` the bisect process become a zombie process. As mentionned in waitpid doc: > As long as a zombie is not removed from the system via a wait, it will consume a slot in the kernel process table, and if this table fills, it will not be possible to create further processes. `detach` is a good idea. From the Ruby doc: > Some operating systems retain the status of terminated child processes until the parent collects that status (normally using some variant of wait()). If the parent never collects this status, the child stays around as a zombie process. Process::detach prevents this by setting up a separate Ruby thread whose sole job is to reap the status of the process pid when it terminates. Use detach only when you do not intend to explicitly wait for the child to terminate. Related: - #2669 - https://andrykonchin.github.io/rails/2019/12/25/deadlock-in-rspec.html
Co-authored-by: Jon Rowe <hello@jonrowe.co.uk>
281de91
to
0774f41
Compare
0774f41
to
07754b6
Compare
Detach bisect subprocesses to avoid making zombie processes
Detach bisect subprocesses to avoid making zombie processes
…cess-to-avoid-zombie Detach bisect subprocesses to avoid making zombie processes --- This commit was imported from rspec/rspec-core@efbac94.
If we do not
waitpid
ordetach
the bisect process become a zombie process.As mentioned in waitpid doc:
detach
is a good idea. Thanks @pirj. From the Ruby doc:Related:
Thanks @pirj and @andrykonchin