Avoid bisect command to get stuck #2669

benoittgt · 2019-11-25T20:56:03Z

From Palkan in benoittgt/rspec_repro_bisect_deadlock#1

First, I've tried to play with the number of specs which led to the
interesting conclusion: the process hangs only at 1548+ specs.

 RSpec.describe "a bunch of nothing" do
   (0...(ENV.fetch('N', 3000).to_i)).each do |t|
     it { expect(t).to eq t }
   end
 end

Try to run with N=1547 and N=1548.

Seems suspicious, right?

Let's add pry-byebug to the equation (or Gemfile).

In order it to work we need to tweak our runner code a bit:

- $stdout = $stderr = @spec_output
+ # $stdout = $stderr = @spec_output

After a bit of puts debugging I localized the problem:
@channel.send.

Channel#send calls IO#write here
https://github.com/rspec/rspec-core/blob/7b6b9c3/lib/rspec/core/bisect/utilities.rb#L41:

def send(message)
  packet = Marshal.dump(message)
  @write_io.write("#{packet.bytesize}\n#{packet}")
end

Do you know, what is the packet.bytesize for N=1548? It's 65548.
This number is very important: the pipe size is only 65536 on MacOS
(see docs for IO#write_nonblock
for more).

That makes @write_io.write hangs forever, because no one reads the
buffer: we call Channel#receive only after Process.waitpid(pid),
thus waiting for the write operation to complete.

A basic proposal is to use WNOHANG. From waitpid doc:

If WNOHANG was specified in options and there were no children
in a waitable state, then waitid() returns 0 immediately (...)

To validate this proposal on OSX we run just before running bisect:
lsof -n -P -r1 -c ruby | grep -e 'PIP' -e '===' -e 'COMMAND'

This will give us in loop the PIPE sizes of Ruby processes. Without our
patch we see that quickly we hit 65536 bytes on two pipes, with the patch
we keep pipes at the right size.

COMMAND PID    USER     FD   TYPE DEVICE                SIZE/OFF  NODE NAME
ruby    40134  benoit    3   PIPE 0xf3b025a6a6cd6005    16384     ->0xf3b025a6a6cd5045
ruby    40134  benoit    4   PIPE 0xf3b025a6a6cd5045    16384     ->0xf3b025a6a6cd6005
ruby    40134  benoit    5   PIPE 0xf3b025a6a6cd7805    16384     ->0xf3b025a6a6cd7145
ruby    40134  benoit    7   PIPE 0xf3b025a6a6cd7145    16384     ->0xf3b025a6a6cd7805
ruby    40134  benoit   10   PIPE 0xf3b025a6a6cd6fc5    16384     ->0xf3b025a6a6cd5a05
ruby    40134  benoit   11   PIPE 0xf3b025a6a6cd5a05    16384     ->0xf3b025a6a6cd6fc5
ruby    40144  benoit    3   PIPE 0xf3b025a6a6cd5d05    16384     ->0xf3b025a6a6cd5c45
ruby    40144  benoit    4   PIPE 0xf3b025a6a6cd5c45    16384     ->0xf3b025a6a6cd5d05
ruby    40144  benoit    5   PIPE 0xf3b025a6a6cd7085    16384     ->0xf3b025a6a6cd6785
ruby    40144  benoit    7   PIPE 0xf3b025a6a6cd6785    16384     ->0xf3b025a6a6cd7085
ruby    40144  benoit   10   PIPE 0xf3b025a6a6cd6fc5    16384     ->0xf3b025a6a6cd5a05
ruby    40144  benoit   11   PIPE 0xf3b025a6a6cd5a05    16384     ->0xf3b025a6a6cd6fc5

Improvements:
The bisect command request lot's of ram. The next step should be to
reduce that consumption.

Thanks to Palkan, julik and Samuel Williams for the help. 🙇🏻‍♂️

pirj · 2019-11-25T23:19:04Z

Just my 2c. Wondering why waitpid is needed at all? read on a pipe seems to be a blocking operation, and while the child process has not closed the channel, it won't return.
It might be a hasty conclusion, but it seems that waitpid is preventing the parent process from reading from that buffer and locks the child on a write operation.
I can't say I completely understand the semantics of WNOHANG, but it seems that we're relying on its behaviour of immediately returning anyway. However, we make ourselves dependent on the 'child in a waitable state', potentially risking a racing condition. There's slightly other description of it I've found which states:

If status information is immediately available on an appropriate child process, waitpid() returns this information. Otherwise, waitpid() returns immediately with an error code indicating that the information was not available. In other words, WNOHANG checks child processes without causing the caller to be suspended.

With this in mind, do we really need to check that information that waitpid returns? We don't seem to use it.

Out-of-context reproduction:

def experiment(n)
  rd, wr = IO.pipe
  pid = fork { rd.close; sleep 1; wr.write('a' * n); sleep 1; wr.close }
  Process.waitpid(pid)
  wr.close
  puts 'reading'
  puts rd.read.size
  puts 'done'
end

> experiment(65536)
reading
65536
done
> experiment(65537)

the second experiment hangs forever.

However, if you comment out Process.waitpid, there are two differences in behaviour:

experiment(65537) doesn't hang
reading is printed immediately, not together with read buffer size and done; two seconds later read buffer size and done appear

With this in mind, what do you think guys of removing Process.waitpid call altogether?

Great in-depth research by the way! 💯

benoittgt · 2019-11-26T09:36:01Z

Thanks a lot @pirj for this very interesting feedback. I think I thought about removing it at first but was not sure about unexpected behaviors.

I redo all my test and removing Process.waitpid seems to be a good idea! I will amend my commit

From Palkan in benoittgt/rspec_repro_bisect_deadlock#1 First, I've tried to play with the number of specs which led to the interesting conclusion: **the process hangs only at 1548+ specs**. ```diff RSpec.describe "a bunch of nothing" do (0...(ENV.fetch('N', 3000).to_i)).each do |t| it { expect(t).to eq t } end end ``` Try to run with `N=1547` and `N=1548`. Seems suspicious, right? Let's add `pry-byebug` to the equation (or Gemfile). In order it to work we need to tweak our runner code a bit: ```diff - $stdout = $stderr = @spec_output + # $stdout = $stderr = @spec_output ``` After a bit of `puts` debugging I localized the problem: [`@channel.send`](/lib/rspec/core/bisect/fork_runner.rb@7b6b9c3#L122). `Channel#send` calls `IO#write` here /lib/rspec/core/bisect/utilities.rb@7b6b9c3#L41: ```ruby def send(message) packet = Marshal.dump(message) @write_io.write("#{packet.bytesize}\n#{packet}") end ``` Do you know, what is the `packet.bytesize` for `N=1548`? It's **65548**. This number is very important: the pipe size is only **65536** on MacOS (see docs for [`IO#write_nonblock`](ruby-doc.org/core-2.6.3/IO.html#method-i-write_nonblock) for more). That makes `@write_io.write` hangs forever, because no one reads the buffer: we call `Channel#receive` only after `Process.waitpid(pid)`, thus waiting for the write operation to complete. ----------- A basic proposal will be to use WNOHANG. From waitpid doc: > If WNOHANG was specified in options and there were no children > in a waitable state, then waitid() returns 0 immediately (...) To validate this proposal on OSX we run just before running bisect: `lsof -n -P -r1 -c ruby | grep -e 'PIP' -e '===' -e 'COMMAND'` This will give us in loop the PIPE sizes of Ruby processes. Without our patch we see that quickly we hit 65536 bytes on two pipes, with the patch we keep pipes at the right size. ``` COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME ruby 40134 benoit 3 PIPE 0xf3b025a6a6cd6005 16384 ->0xf3b025a6a6cd5045 ruby 40134 benoit 4 PIPE 0xf3b025a6a6cd5045 16384 ->0xf3b025a6a6cd6005 ruby 40134 benoit 5 PIPE 0xf3b025a6a6cd7805 16384 ->0xf3b025a6a6cd7145 ruby 40134 benoit 7 PIPE 0xf3b025a6a6cd7145 16384 ->0xf3b025a6a6cd7805 ruby 40134 benoit 10 PIPE 0xf3b025a6a6cd6fc5 16384 ->0xf3b025a6a6cd5a05 ruby 40134 benoit 11 PIPE 0xf3b025a6a6cd5a05 16384 ->0xf3b025a6a6cd6fc5 ruby 40144 benoit 3 PIPE 0xf3b025a6a6cd5d05 16384 ->0xf3b025a6a6cd5c45 ruby 40144 benoit 4 PIPE 0xf3b025a6a6cd5c45 16384 ->0xf3b025a6a6cd5d05 ruby 40144 benoit 5 PIPE 0xf3b025a6a6cd7085 16384 ->0xf3b025a6a6cd6785 ruby 40144 benoit 7 PIPE 0xf3b025a6a6cd6785 16384 ->0xf3b025a6a6cd7085 ruby 40144 benoit 10 PIPE 0xf3b025a6a6cd6fc5 16384 ->0xf3b025a6a6cd5a05 ruby 40144 benoit 11 PIPE 0xf3b025a6a6cd5a05 16384 ->0xf3b025a6a6cd6fc5 ``` But if we look properly from the doc we can even go further. > If status information is immediately available on an appropriate child process, waitpid() returns this information. Otherwise, waitpid() returns immediately with an error code indicating that the information was not available. In other words, WNOHANG checks child processes without causing the caller to be suspended. and as pirj mention: "With this in mind, do we really need to check that information that waitpid returns? We don't seem to use it." Removing "waitpid" produce the same behavior as with `WNOHANG`. Improvements: The bisect command request lot's of ram. The next step should be to reduce that consumption. Related: - fix: #2637 - PR discussion: #2669

lib/rspec/core/bisect/fork_runner.rb

From Palkan in benoittgt/rspec_repro_bisect_deadlock#1 First, I've tried to play with the number of specs which led to the interesting conclusion: **the process hangs only at 1548+ specs**. ```diff RSpec.describe "a bunch of nothing" do (0...(ENV.fetch('N', 3000).to_i)).each do |t| it { expect(t).to eq t } end end ``` Try to run with `N=1547` and `N=1548`. Seems suspicious, right? Let's add `pry-byebug` to the equation (or Gemfile). In order it to work we need to tweak our runner code a bit: ```diff - $stdout = $stderr = @spec_output + # $stdout = $stderr = @spec_output ``` After a bit of `puts` debugging I localized the problem: [`@channel.send`](/lib/rspec/core/bisect/fork_runner.rb@7b6b9c3#L122). `Channel#send` calls `IO#write` here /lib/rspec/core/bisect/utilities.rb@7b6b9c3#L41: ```ruby def send(message) packet = Marshal.dump(message) @write_io.write("#{packet.bytesize}\n#{packet}") end ``` Do you know, what is the `packet.bytesize` for `N=1548`? It's **65548**. This number is very important: the pipe size is only **65536** on MacOS (see docs for [`IO#write_nonblock`](ruby-doc.org/core-2.6.3/IO.html#method-i-write_nonblock) for more). That makes `@write_io.write` hangs forever, because no one reads the buffer: we call `Channel#receive` only after `Process.waitpid(pid)`, thus waiting for the write operation to complete. ----------- A basic proposal will be to use WNOHANG. From waitpid doc: > If WNOHANG was specified in options and there were no children > in a waitable state, then waitid() returns 0 immediately (...) To validate this proposal on OSX we run just before running bisect: `lsof -n -P -r1 -c ruby | grep -e 'PIP' -e '===' -e 'COMMAND'` This will give us in loop the PIPE sizes of Ruby processes. Without our patch we see that quickly we hit 65536 bytes on two pipes, with the patch we keep pipes at the right size. ``` COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME ruby 40134 benoit 3 PIPE 0xf3b025a6a6cd6005 16384 ->0xf3b025a6a6cd5045 ruby 40134 benoit 4 PIPE 0xf3b025a6a6cd5045 16384 ->0xf3b025a6a6cd6005 ruby 40134 benoit 5 PIPE 0xf3b025a6a6cd7805 16384 ->0xf3b025a6a6cd7145 ruby 40134 benoit 7 PIPE 0xf3b025a6a6cd7145 16384 ->0xf3b025a6a6cd7805 ruby 40134 benoit 10 PIPE 0xf3b025a6a6cd6fc5 16384 ->0xf3b025a6a6cd5a05 ruby 40134 benoit 11 PIPE 0xf3b025a6a6cd5a05 16384 ->0xf3b025a6a6cd6fc5 ruby 40144 benoit 3 PIPE 0xf3b025a6a6cd5d05 16384 ->0xf3b025a6a6cd5c45 ruby 40144 benoit 4 PIPE 0xf3b025a6a6cd5c45 16384 ->0xf3b025a6a6cd5d05 ruby 40144 benoit 5 PIPE 0xf3b025a6a6cd7085 16384 ->0xf3b025a6a6cd6785 ruby 40144 benoit 7 PIPE 0xf3b025a6a6cd6785 16384 ->0xf3b025a6a6cd7085 ruby 40144 benoit 10 PIPE 0xf3b025a6a6cd6fc5 16384 ->0xf3b025a6a6cd5a05 ruby 40144 benoit 11 PIPE 0xf3b025a6a6cd5a05 16384 ->0xf3b025a6a6cd6fc5 ``` But if we look properly from the doc we can even go further. > If status information is immediately available on an appropriate child process, waitpid() returns this information. Otherwise, waitpid() returns immediately with an error code indicating that the information was not available. In other words, WNOHANG checks child processes without causing the caller to be suspended. and as pirj mention: "With this in mind, do we really need to check that information that waitpid returns? We don't seem to use it." Removing "waitpid" produce the same behavior as with `WNOHANG`. Improvements: The bisect command request lot's of ram. The next step should be to reduce that consumption. Related: - fix: #2637 - PR discussion: #2669

JonRowe · 2019-11-27T15:05:10Z

Are we happy this fixes the problem? Have we got a way to check for a regression to this in our tests?

(It's green its just not gotten to github)

benoittgt · 2019-11-27T15:21:17Z

Are we happy this fixes the problem?

I am quite ok with the result even if I would love to test it on a linux and windows.

Have we got a way to check for a regression to this in our tests?

One test we could have is to run bisect for a lot's of test. If you want I can look at this.

JonRowe · 2019-11-27T15:22:11Z

I think we need something...

From Palkan in benoittgt/rspec_repro_bisect_deadlock#1 First, I've tried to play with the number of specs which led to the interesting conclusion: **the process hangs only at 1548+ specs**. ```diff RSpec.describe "a bunch of nothing" do (0...(ENV.fetch('N', 3000).to_i)).each do |t| it { expect(t).to eq t } end end ``` Try to run with `N=1547` and `N=1548`. Seems suspicious, right? Let's add `pry-byebug` to the equation (or Gemfile). In order it to work we need to tweak our runner code a bit: ```diff - $stdout = $stderr = @spec_output + # $stdout = $stderr = @spec_output ``` After a bit of `puts` debugging I localized the problem: [`@channel.send`](/lib/rspec/core/bisect/fork_runner.rb@7b6b9c3#L122). `Channel#send` calls `IO#write` here /lib/rspec/core/bisect/utilities.rb@7b6b9c3#L41: ```ruby def send(message) packet = Marshal.dump(message) @write_io.write("#{packet.bytesize}\n#{packet}") end ``` Do you know, what is the `packet.bytesize` for `N=1548`? It's **65548**. This number is very important: the pipe size is only **65536** on MacOS (see docs for [`IO#write_nonblock`](ruby-doc.org/core-2.6.3/IO.html#method-i-write_nonblock) for more). That makes `@write_io.write` hangs forever, because no one reads the buffer: we call `Channel#receive` only after `Process.waitpid(pid)`, thus waiting for the write operation to complete. ----------- A basic proposal will be to use WNOHANG. From waitpid doc: > If WNOHANG was specified in options and there were no children > in a waitable state, then waitid() returns 0 immediately (...) To validate this proposal on OSX we run just before running bisect: `lsof -n -P -r1 -c ruby | grep -e 'PIP' -e '===' -e 'COMMAND'` This will give us in loop the PIPE sizes of Ruby processes. Without our patch we see that quickly we hit 65536 bytes on two pipes, with the patch we keep pipes at the right size. ``` COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME ruby 40134 benoit 3 PIPE 0xf3b025a6a6cd6005 16384 ->0xf3b025a6a6cd5045 ruby 40134 benoit 4 PIPE 0xf3b025a6a6cd5045 16384 ->0xf3b025a6a6cd6005 ruby 40134 benoit 5 PIPE 0xf3b025a6a6cd7805 16384 ->0xf3b025a6a6cd7145 ruby 40134 benoit 7 PIPE 0xf3b025a6a6cd7145 16384 ->0xf3b025a6a6cd7805 ruby 40134 benoit 10 PIPE 0xf3b025a6a6cd6fc5 16384 ->0xf3b025a6a6cd5a05 ruby 40134 benoit 11 PIPE 0xf3b025a6a6cd5a05 16384 ->0xf3b025a6a6cd6fc5 ruby 40144 benoit 3 PIPE 0xf3b025a6a6cd5d05 16384 ->0xf3b025a6a6cd5c45 ruby 40144 benoit 4 PIPE 0xf3b025a6a6cd5c45 16384 ->0xf3b025a6a6cd5d05 ruby 40144 benoit 5 PIPE 0xf3b025a6a6cd7085 16384 ->0xf3b025a6a6cd6785 ruby 40144 benoit 7 PIPE 0xf3b025a6a6cd6785 16384 ->0xf3b025a6a6cd7085 ruby 40144 benoit 10 PIPE 0xf3b025a6a6cd6fc5 16384 ->0xf3b025a6a6cd5a05 ruby 40144 benoit 11 PIPE 0xf3b025a6a6cd5a05 16384 ->0xf3b025a6a6cd6fc5 ``` But if we look properly from the doc we can even go further. > If status information is immediately available on an appropriate child process, waitpid() returns this information. Otherwise, waitpid() returns immediately with an error code indicating that the information was not available. In other words, WNOHANG checks child processes without causing the caller to be suspended. and as pirj mention: "With this in mind, do we really need to check that information that waitpid returns? We don't seem to use it." Removing "waitpid" produce the same behavior as with `WNOHANG`. Improvements: The bisect command request lot's of ram. The next step should be to reduce that consumption. Related: - fix: #2637 - PR discussion: #2669

benoittgt · 2019-12-06T14:26:37Z

Sorry for the delay.

I added a test on integration side. It was the easiest to implement in my case. I was able to reproduce the lock without my patch, then test pass with waitpid removed.

From Palkan in benoittgt/rspec_repro_bisect_deadlock#1 First, I've tried to play with the number of specs which led to the interesting conclusion: **the process hangs only at 1548+ specs**. ```diff RSpec.describe "a bunch of nothing" do (0...(ENV.fetch('N', 3000).to_i)).each do |t| it { expect(t).to eq t } end end ``` Try to run with `N=1547` and `N=1548`. Seems suspicious, right? Let's add `pry-byebug` to the equation (or Gemfile). In order it to work we need to tweak our runner code a bit: ```diff - $stdout = $stderr = @spec_output + # $stdout = $stderr = @spec_output ``` After a bit of `puts` debugging I localized the problem: [`@channel.send`](/lib/rspec/core/bisect/fork_runner.rb@7b6b9c3#L122). `Channel#send` calls `IO#write` here /lib/rspec/core/bisect/utilities.rb@7b6b9c3#L41: ```ruby def send(message) packet = Marshal.dump(message) @write_io.write("#{packet.bytesize}\n#{packet}") end ``` Do you know, what is the `packet.bytesize` for `N=1548`? It's **65548**. This number is very important: the pipe size is only **65536** on MacOS (see docs for [`IO#write_nonblock`](ruby-doc.org/core-2.6.3/IO.html#method-i-write_nonblock) for more). That makes `@write_io.write` hangs forever, because no one reads the buffer: we call `Channel#receive` only after `Process.waitpid(pid)`, thus waiting for the write operation to complete. ----------- A basic proposal will be to use WNOHANG. From waitpid doc: > If WNOHANG was specified in options and there were no children > in a waitable state, then waitid() returns 0 immediately (...) To validate this proposal on OSX we run just before running bisect: `lsof -n -P -r1 -c ruby | grep -e 'PIP' -e '===' -e 'COMMAND'` This will give us in loop the PIPE sizes of Ruby processes. Without our patch we see that quickly we hit 65536 bytes on two pipes, with the patch we keep pipes at the right size. ``` COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME ruby 40134 benoit 3 PIPE 0xf3b025a6a6cd6005 16384 ->0xf3b025a6a6cd5045 ruby 40134 benoit 4 PIPE 0xf3b025a6a6cd5045 16384 ->0xf3b025a6a6cd6005 ruby 40134 benoit 5 PIPE 0xf3b025a6a6cd7805 16384 ->0xf3b025a6a6cd7145 ruby 40134 benoit 7 PIPE 0xf3b025a6a6cd7145 16384 ->0xf3b025a6a6cd7805 ruby 40134 benoit 10 PIPE 0xf3b025a6a6cd6fc5 16384 ->0xf3b025a6a6cd5a05 ruby 40134 benoit 11 PIPE 0xf3b025a6a6cd5a05 16384 ->0xf3b025a6a6cd6fc5 ruby 40144 benoit 3 PIPE 0xf3b025a6a6cd5d05 16384 ->0xf3b025a6a6cd5c45 ruby 40144 benoit 4 PIPE 0xf3b025a6a6cd5c45 16384 ->0xf3b025a6a6cd5d05 ruby 40144 benoit 5 PIPE 0xf3b025a6a6cd7085 16384 ->0xf3b025a6a6cd6785 ruby 40144 benoit 7 PIPE 0xf3b025a6a6cd6785 16384 ->0xf3b025a6a6cd7085 ruby 40144 benoit 10 PIPE 0xf3b025a6a6cd6fc5 16384 ->0xf3b025a6a6cd5a05 ruby 40144 benoit 11 PIPE 0xf3b025a6a6cd5a05 16384 ->0xf3b025a6a6cd6fc5 ``` But if we look properly from the doc we can even go further. > If status information is immediately available on an appropriate child process, waitpid() returns this information. Otherwise, waitpid() returns immediately with an error code indicating that the information was not available. In other words, WNOHANG checks child processes without causing the caller to be suspended. and as pirj mention: "With this in mind, do we really need to check that information that waitpid returns? We don't seem to use it." Removing "waitpid" produce the same behavior as with `WNOHANG`. Improvements: The bisect command request lot's of ram. The next step should be to reduce that consumption. Related: - fix: #2637 - PR discussion: #2669

spec/rspec/core/resources/blocking_pipe_bisect_spec.rb_

spec/integration/bisect_spec.rb

lib/rspec/core/bisect/fork_runner.rb

spec/integration/bisect_spec.rb

JonRowe

Really great work you two, thanks for digging into this so much @benoittgt, as a test combo this looks sound, I'd just like some minor tweaks and this should be good to go :)

benoittgt · 2019-12-12T14:37:22Z

Thanks for the review. I am not about the clarity of my comment. Open for reviews.

It was a pleasure to work on this subject.

spec/integration/bisect_spec.rb

lib/rspec/core/bisect/fork_runner.rb

benoittgt · 2019-12-14T13:22:04Z

Thanks @JonRowe. It's much better and shorter with your proposal.

JonRowe · 2019-12-14T14:11:09Z

Happy for you to merge @benoittgt don't forget the change log :)

From Palkan in benoittgt/rspec_repro_bisect_deadlock#1 First, I've tried to play with the number of specs which led to the interesting conclusion: **the process hangs only at 1548+ specs**. ```diff RSpec.describe "a bunch of nothing" do (0...(ENV.fetch('N', 3000).to_i)).each do |t| it { expect(t).to eq t } end end ``` Try to run with `N=1547` and `N=1548`. Seems suspicious, right? Let's add `pry-byebug` to the equation (or Gemfile). In order it to work we need to tweak our runner code a bit: ```diff - $stdout = $stderr = @spec_output + # $stdout = $stderr = @spec_output ``` After a bit of `puts` debugging I localized the problem: [`@channel.send`](/lib/rspec/core/bisect/fork_runner.rb@7b6b9c3#L122). `Channel#send` calls `IO#write` here /lib/rspec/core/bisect/utilities.rb@7b6b9c3#L41: ```ruby def send(message) packet = Marshal.dump(message) @write_io.write("#{packet.bytesize}\n#{packet}") end ``` Do you know, what is the `packet.bytesize` for `N=1548`? It's **65548**. This number is very important: the pipe size is only **65536** on MacOS (see docs for [`IO#write_nonblock`](ruby-doc.org/core-2.6.3/IO.html#method-i-write_nonblock) for more). That makes `@write_io.write` hangs forever, because no one reads the buffer: we call `Channel#receive` only after `Process.waitpid(pid)`, thus waiting for the write operation to complete. ----------- A basic proposal will be to use WNOHANG. From waitpid doc: > If WNOHANG was specified in options and there were no children > in a waitable state, then waitid() returns 0 immediately (...) To validate this proposal on OSX we run just before running bisect: `lsof -n -P -r1 -c ruby | grep -e 'PIP' -e '===' -e 'COMMAND'` This will give us in loop the PIPE sizes of Ruby processes. Without our patch we see that quickly we hit 65536 bytes on two pipes, with the patch we keep pipes at the right size. ``` COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME ruby 40134 benoit 3 PIPE 0xf3b025a6a6cd6005 16384 ->0xf3b025a6a6cd5045 ruby 40134 benoit 4 PIPE 0xf3b025a6a6cd5045 16384 ->0xf3b025a6a6cd6005 ruby 40134 benoit 5 PIPE 0xf3b025a6a6cd7805 16384 ->0xf3b025a6a6cd7145 ruby 40134 benoit 7 PIPE 0xf3b025a6a6cd7145 16384 ->0xf3b025a6a6cd7805 ruby 40134 benoit 10 PIPE 0xf3b025a6a6cd6fc5 16384 ->0xf3b025a6a6cd5a05 ruby 40134 benoit 11 PIPE 0xf3b025a6a6cd5a05 16384 ->0xf3b025a6a6cd6fc5 ruby 40144 benoit 3 PIPE 0xf3b025a6a6cd5d05 16384 ->0xf3b025a6a6cd5c45 ruby 40144 benoit 4 PIPE 0xf3b025a6a6cd5c45 16384 ->0xf3b025a6a6cd5d05 ruby 40144 benoit 5 PIPE 0xf3b025a6a6cd7085 16384 ->0xf3b025a6a6cd6785 ruby 40144 benoit 7 PIPE 0xf3b025a6a6cd6785 16384 ->0xf3b025a6a6cd7085 ruby 40144 benoit 10 PIPE 0xf3b025a6a6cd6fc5 16384 ->0xf3b025a6a6cd5a05 ruby 40144 benoit 11 PIPE 0xf3b025a6a6cd5a05 16384 ->0xf3b025a6a6cd6fc5 ``` But if we look properly from the doc we can even go further. > If status information is immediately available on an appropriate child process, waitpid() returns this information. Otherwise, waitpid() returns immediately with an error code indicating that the information was not available. In other words, WNOHANG checks child processes without causing the caller to be suspended. and as pirj mention: "With this in mind, do we really need to check that information that waitpid returns? We don't seem to use it." Removing "waitpid" produce the same behavior as with `WNOHANG`. Improvements: The bisect command request lot's of ram. The next step should be to reduce that consumption. Related: - fix: #2637 - PR discussion: #2669

jcabasc · 2019-12-20T15:42:13Z

Hey guys @benoittgt is there a special way to run this command and take advantage of this implementation? I am running into the same issue, so wanted to double-check with you. By the way, which version is this one? Thanks.

benoittgt · 2019-12-20T15:53:38Z

This is the master version. We did not release a new version with the patch. You could point in your gemfile rspec-core to master.

jcabasc · 2019-12-20T15:54:38Z

Thanks man @benoittgt

Avoid bisect command to get stuck

pirj · 2020-06-11T00:15:29Z

@andrykonchin brought up an interesting point that by removing Process.waitpid we're producing zombie processes in the system.

A quick search showed that to mitigate this we can use Process.detach(pid).

Maybe @ioquatix can advise here?

JonRowe · 2020-06-11T07:48:30Z

@andrykonchin brought up an interesting point that by removing Process.waitpid we're producing zombie processes in the system.

This probably means that we have not fixed the problem, merely ignored it.

benoittgt · 2020-06-11T16:59:36Z

@andrykonchin brought up an interesting point that by removing Process.waitpid we're producing zombie processes in the system.

Ouhao. Very nice article. 👏

This is a google translation but this is written:

On the one hand, they solved the problem with deadlock , but on the other, they created a new one. Now the child process remains in the “zombie” state, since no one gets its completion status by a system call waitpid, which means leak of process descriptors. When the number of processes in the operating system reaches the limit, and it is not so big, it will not be possible to create a new child process.

Mentioned in waitpid doc:

A child that terminates, but has not been waited for becomes a "zombie". The kernel maintains a minimal set of information about the zombie process (PID, termination status, resource usage information) in order to allow the parent to later perform a wait to obtain information about the child. As long as a zombie is not removed from the system via a wait, it will consume a slot in the kernel process table, and if this table fills, it will not be possible to create further processes. If a parent process terminates, then its "zombie" children (if any) are adopted by init(8), which automatically performs a wait to remove the zombies.

So after few tests, we have probably zombie processes when the bisect commands run. It is not clean while the bisect command is running but it is cleaned after.
If your bisect command needs to run many step to isolate the failings specs suite and your machine doesn't have enough process available it will ended up using all your processes and the machine become unusable.

detach seems to be a good idea. I tested it on the code example below and did not hit any limitations.

I did this test, because I wanted to understand exactly what was happening and also try @pirj proposal:

Create process and do not waitpid like rspec-core at the moment.

@read_io, @write_io = IO.pipe

# write into pipe some data
def run_specs(i)
  packet = '*' * i
  @write_io.write("#{packet.bytesize}\n#{packet}")
end

count = 0
(65500..69500).each do |i|
  count += 1
  # create a child process
  pid = fork { run_specs(i) }
  puts "interval: #{i}, pid is: #{pid} and parent pid is: #{Process.ppid}, count: #{count}"

  # wait for its terminating
  # Process.waitpid(pid)
  sleep 0.1

  # read result
  packet_size = Integer(@read_io.gets)
  packet = @read_io.read(packet_size)

  puts "packet size: #{packet.size}"
end

Get max number of process on my mac for my user

$ sysctl kern.maxprocperuid
kern.maxprocperuid: 1418

Monitor process count for my user

while true; do pstree -u <username> | wc -l ; sleep 1; clear; done

You can also monitor zombie process count while true; do ps axo pid=,stat= | awk '$2~/^Z/ { print }' | wc -l ; sleep 1; clear; done

Run the ruby script. When we reach 1418 process the systems cannot create more process, it's problematic but to go there I was able to run create nearly 1000 process (that become zombie). They are cleared when the ruby script is cancel or over.

If do not `waitpid` or `detach` the process become a zombie process. As mentionned in waitpid doc: > As long as a zombie is not removed from the system via a wait, it will consume a slot in the kernel process table, and if this table fills, it will not be possible to create further processes. Related: - #2669 - https://andrykonchin.github.io/rails/2019/12/25/deadlock-in-rspec.html

If we do not `waitpid` or `detach` the bisect process become a zombie process. As mentionned in waitpid doc: > As long as a zombie is not removed from the system via a wait, it will consume a slot in the kernel process table, and if this table fills, it will not be possible to create further processes. Related: - #2669 - https://andrykonchin.github.io/rails/2019/12/25/deadlock-in-rspec.html

If we do not `waitpid` or `detach` the bisect process become a zombie process. As mentionned in waitpid doc: > As long as a zombie is not removed from the system via a wait, it will consume a slot in the kernel process table, and if this table fills, it will not be possible to create further processes. `detach` is a good idea. From the Ruby doc: > Some operating systems retain the status of terminated child processes until the parent collects that status (normally using some variant of wait()). If the parent never collects this status, the child stays around as a zombie process. Process::detach prevents this by setting up a separate Ruby thread whose sole job is to reap the status of the process pid when it terminates. Use detach only when you do not intend to explicitly wait for the child to terminate. Related: - #2669 - https://andrykonchin.github.io/rails/2019/12/25/deadlock-in-rspec.html

From Palkan in benoittgt/rspec_repro_bisect_deadlock#1 First, I've tried to play with the number of specs which led to the interesting conclusion: **the process hangs only at 1548+ specs**. ```diff RSpec.describe "a bunch of nothing" do (0...(ENV.fetch('N', 3000).to_i)).each do |t| it { expect(t).to eq t } end end ``` Try to run with `N=1547` and `N=1548`. Seems suspicious, right? Let's add `pry-byebug` to the equation (or Gemfile). In order it to work we need to tweak our runner code a bit: ```diff - $stdout = $stderr = @spec_output + # $stdout = $stderr = @spec_output ``` After a bit of `puts` debugging I localized the problem: [`@channel.send`](/lib/rspec/core/bisect/fork_runner.rb@7b6b9c3#L122). `Channel#send` calls `IO#write` here /lib/rspec/core/bisect/utilities.rb@7b6b9c3#L41: ```ruby def send(message) packet = Marshal.dump(message) @write_io.write("#{packet.bytesize}\n#{packet}") end ``` Do you know, what is the `packet.bytesize` for `N=1548`? It's **65548**. This number is very important: the pipe size is only **65536** on MacOS (see docs for [`IO#write_nonblock`](ruby-doc.org/core-2.6.3/IO.html#method-i-write_nonblock) for more). That makes `@write_io.write` hangs forever, because no one reads the buffer: we call `Channel#receive` only after `Process.waitpid(pid)`, thus waiting for the write operation to complete. ----------- A basic proposal will be to use WNOHANG. From waitpid doc: > If WNOHANG was specified in options and there were no children > in a waitable state, then waitid() returns 0 immediately (...) To validate this proposal on OSX we run just before running bisect: `lsof -n -P -r1 -c ruby | grep -e 'PIP' -e '===' -e 'COMMAND'` This will give us in loop the PIPE sizes of Ruby processes. Without our patch we see that quickly we hit 65536 bytes on two pipes, with the patch we keep pipes at the right size. ``` COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME ruby 40134 benoit 3 PIPE 0xf3b025a6a6cd6005 16384 ->0xf3b025a6a6cd5045 ruby 40134 benoit 4 PIPE 0xf3b025a6a6cd5045 16384 ->0xf3b025a6a6cd6005 ruby 40134 benoit 5 PIPE 0xf3b025a6a6cd7805 16384 ->0xf3b025a6a6cd7145 ruby 40134 benoit 7 PIPE 0xf3b025a6a6cd7145 16384 ->0xf3b025a6a6cd7805 ruby 40134 benoit 10 PIPE 0xf3b025a6a6cd6fc5 16384 ->0xf3b025a6a6cd5a05 ruby 40134 benoit 11 PIPE 0xf3b025a6a6cd5a05 16384 ->0xf3b025a6a6cd6fc5 ruby 40144 benoit 3 PIPE 0xf3b025a6a6cd5d05 16384 ->0xf3b025a6a6cd5c45 ruby 40144 benoit 4 PIPE 0xf3b025a6a6cd5c45 16384 ->0xf3b025a6a6cd5d05 ruby 40144 benoit 5 PIPE 0xf3b025a6a6cd7085 16384 ->0xf3b025a6a6cd6785 ruby 40144 benoit 7 PIPE 0xf3b025a6a6cd6785 16384 ->0xf3b025a6a6cd7085 ruby 40144 benoit 10 PIPE 0xf3b025a6a6cd6fc5 16384 ->0xf3b025a6a6cd5a05 ruby 40144 benoit 11 PIPE 0xf3b025a6a6cd5a05 16384 ->0xf3b025a6a6cd6fc5 ``` But if we look properly from the doc we can even go further. > If status information is immediately available on an appropriate child process, waitpid() returns this information. Otherwise, waitpid() returns immediately with an error code indicating that the information was not available. In other words, WNOHANG checks child processes without causing the caller to be suspended. and as pirj mention: "With this in mind, do we really need to check that information that waitpid returns? We don't seem to use it." Removing "waitpid" produce the same behavior as with `WNOHANG`. Improvements: The bisect command request lot's of ram. The next step should be to reduce that consumption. Related: - fix: rspec#2637 - PR discussion: rspec#2669

Avoid bisect command to get stuck

If we do not `waitpid` or `detach` the bisect process become a zombie process. As mentionned in waitpid doc: > As long as a zombie is not removed from the system via a wait, it will consume a slot in the kernel process table, and if this table fills, it will not be possible to create further processes. `detach` is a good idea. From the Ruby doc: > Some operating systems retain the status of terminated child processes until the parent collects that status (normally using some variant of wait()). If the parent never collects this status, the child stays around as a zombie process. Process::detach prevents this by setting up a separate Ruby thread whose sole job is to reap the status of the process pid when it terminates. Use detach only when you do not intend to explicitly wait for the child to terminate. Related: - rspec#2669 - https://andrykonchin.github.io/rails/2019/12/25/deadlock-in-rspec.html

Avoid bisect command to get stuck --- This commit was imported from rspec/rspec-core@1c31fc4.

This commit was imported from rspec/rspec-core@968fd99.

benoittgt force-pushed the deal_with_stuck_bisect branch from 0fdec07 to 8590b53 Compare November 26, 2019 09:40

JonRowe reviewed Nov 26, 2019

View reviewed changes

lib/rspec/core/bisect/fork_runner.rb Outdated Show resolved Hide resolved

benoittgt force-pushed the deal_with_stuck_bisect branch from 8590b53 to 1f5dcf9 Compare November 26, 2019 14:34

benoittgt force-pushed the deal_with_stuck_bisect branch from 1f5dcf9 to 2ff60d3 Compare December 6, 2019 14:25

benoittgt force-pushed the deal_with_stuck_bisect branch from 2ff60d3 to f746452 Compare December 6, 2019 14:28

pirj reviewed Dec 6, 2019

View reviewed changes

spec/rspec/core/resources/blocking_pipe_bisect_spec.rb_ Show resolved Hide resolved

benoittgt force-pushed the deal_with_stuck_bisect branch from 66ec8ac to f746452 Compare December 11, 2019 14:21

JonRowe reviewed Dec 11, 2019

View reviewed changes

spec/integration/bisect_spec.rb Outdated Show resolved Hide resolved

JonRowe reviewed Dec 11, 2019

View reviewed changes

lib/rspec/core/bisect/fork_runner.rb Show resolved Hide resolved

JonRowe reviewed Dec 11, 2019

View reviewed changes

spec/integration/bisect_spec.rb Show resolved Hide resolved

JonRowe requested changes Dec 11, 2019

View reviewed changes

benoittgt force-pushed the deal_with_stuck_bisect branch from 40c768b to 7c79e17 Compare December 12, 2019 14:38

JonRowe reviewed Dec 14, 2019

View reviewed changes

spec/integration/bisect_spec.rb Outdated Show resolved Hide resolved

JonRowe reviewed Dec 14, 2019

View reviewed changes

lib/rspec/core/bisect/fork_runner.rb Outdated Show resolved Hide resolved

JonRowe approved these changes Dec 14, 2019

View reviewed changes

benoittgt force-pushed the deal_with_stuck_bisect branch from 41c7442 to 4c0e507 Compare December 14, 2019 14:22

benoittgt force-pushed the deal_with_stuck_bisect branch from 4c0e507 to f902067 Compare December 14, 2019 14:22

benoittgt merged commit 89972e6 into master Dec 14, 2019

benoittgt deleted the deal_with_stuck_bisect branch December 14, 2019 14:22

benoittgt added a commit that referenced this pull request Dec 14, 2019

Changelog for #2669

35c6f8d

JonRowe added a commit that referenced this pull request Dec 14, 2019

Update changelog for #2669

91bc14f

JonRowe pushed a commit that referenced this pull request Dec 28, 2019

Avoid bisect command to get stuck (#2669)

1c31fc4

Avoid bisect command to get stuck

JonRowe added a commit that referenced this pull request Dec 28, 2019

Update changelog for #2669

968fd99

benoittgt mentioned this pull request Jan 9, 2020

Bisect hangs / 'minimal set' not very minimal / can't bisect a bisect? #2623

Closed

VegaFromLyra mentioned this pull request Jan 20, 2020

[WIP]: Randomize spec execution forem/forem#5578

Closed

8 tasks

benoittgt mentioned this pull request Jun 4, 2020

--bisect deadlocks when reporting results #2637

Closed

benoittgt mentioned this pull request Jun 11, 2020

Detach bisect subprocesses to avoid making zombie processes #2739

Merged

MatheusRich pushed a commit to MatheusRich/rspec-core that referenced this pull request Oct 30, 2020

Avoid bisect command to get stuck (rspec#2669)

cfa190e

Avoid bisect command to get stuck

MatheusRich pushed a commit to MatheusRich/rspec-core that referenced this pull request Oct 30, 2020

Changelog for rspec#2669

2d424b5

MatheusRich pushed a commit to MatheusRich/rspec-core that referenced this pull request Oct 30, 2020

Update changelog for rspec#2669

898a486

yujinakayama pushed a commit to yujinakayama/rspec-monorepo that referenced this pull request Oct 6, 2021

[core] Avoid bisect command to get stuck (rspec/rspec-core#2669)

ed95a61

Avoid bisect command to get stuck --- This commit was imported from rspec/rspec-core@1c31fc4.

yujinakayama pushed a commit to yujinakayama/rspec-monorepo that referenced this pull request Oct 6, 2021

[core] Update changelog for rspec/rspec-core#2669

610d632

This commit was imported from rspec/rspec-core@968fd99.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid bisect command to get stuck #2669

Avoid bisect command to get stuck #2669

benoittgt commented Nov 25, 2019 •

edited by pirj

pirj commented Nov 25, 2019 •

edited

benoittgt commented Nov 26, 2019

JonRowe commented Nov 27, 2019 •

edited

benoittgt commented Nov 27, 2019

JonRowe commented Nov 27, 2019

benoittgt commented Dec 6, 2019

JonRowe left a comment

benoittgt commented Dec 12, 2019 •

edited

benoittgt commented Dec 14, 2019

JonRowe commented Dec 14, 2019

jcabasc commented Dec 20, 2019 •

edited

benoittgt commented Dec 20, 2019

jcabasc commented Dec 20, 2019

pirj commented Jun 11, 2020

JonRowe commented Jun 11, 2020

benoittgt commented Jun 11, 2020

Avoid bisect command to get stuck #2669

Avoid bisect command to get stuck #2669

Conversation

benoittgt commented Nov 25, 2019 • edited by pirj

pirj commented Nov 25, 2019 • edited

benoittgt commented Nov 26, 2019

JonRowe commented Nov 27, 2019 • edited

benoittgt commented Nov 27, 2019

JonRowe commented Nov 27, 2019

benoittgt commented Dec 6, 2019

JonRowe left a comment

Choose a reason for hiding this comment

benoittgt commented Dec 12, 2019 • edited

benoittgt commented Dec 14, 2019

JonRowe commented Dec 14, 2019

jcabasc commented Dec 20, 2019 • edited

benoittgt commented Dec 20, 2019

jcabasc commented Dec 20, 2019

pirj commented Jun 11, 2020

JonRowe commented Jun 11, 2020

benoittgt commented Jun 11, 2020

benoittgt commented Nov 25, 2019 •

edited by pirj

pirj commented Nov 25, 2019 •

edited

JonRowe commented Nov 27, 2019 •

edited

benoittgt commented Dec 12, 2019 •

edited

jcabasc commented Dec 20, 2019 •

edited