Skip to content

Commit

Permalink
Avoid bisect command to get stuck
Browse files Browse the repository at this point in the history
From Palkan in benoittgt/rspec_repro_bisect_deadlock#1

First, I've tried to play with the number of specs which led to the
interesting conclusion: **the process hangs only at 1548+ specs**.

```diff
 RSpec.describe "a bunch of nothing" do
   (0...(ENV.fetch('N', 3000).to_i)).each do |t|
     it { expect(t).to eq t }
   end
 end
```

Try to run with `N=1547` and `N=1548`.

Seems suspicious, right?

Let's add `pry-byebug` to the equation (or Gemfile).

In order it to work we need to tweak our runner code a bit:

```diff
- $stdout = $stderr = @spec_output
+ # $stdout = $stderr = @spec_output
```

After a bit of `puts` debugging I localized the problem:
[`@channel.send`](/lib/rspec/core/bisect/fork_runner.rb@7b6b9c3#L122).

 `Channel#send` calls `IO#write` here
 /lib/rspec/core/bisect/utilities.rb@7b6b9c3#L41:

 ```ruby
 def send(message)
   packet = Marshal.dump(message)
   @write_io.write("#{packet.bytesize}\n#{packet}")
 end
 ```

 Do you know, what is the `packet.bytesize` for `N=1548`? It's **65548**.
 This number is very important: the pipe size is only **65536** on MacOS
 (see docs for [`IO#write_nonblock`](ruby-doc.org/core-2.6.3/IO.html#method-i-write_nonblock)
 for more).

 That makes `@write_io.write` hangs forever, because no one reads the
 buffer: we call `Channel#receive` only after `Process.waitpid(pid)`,
 thus waiting for the write operation to complete.

 -----------

 A basic proposal will be to use WNOHANG. From waitpid doc:

 > If WNOHANG was specified in options and there were no children
 > in a waitable state, then waitid() returns 0 immediately (...)

 To validate this proposal on OSX we run just before running bisect:
 `lsof -n -P -r1 -c ruby | grep -e 'PIP' -e '===' -e 'COMMAND'`

 This will give us in loop the PIPE sizes of Ruby processes. Without our
 patch we see that quickly we hit 65536 bytes on two pipes, with the patch
 we keep pipes at the right size.

```
COMMAND PID    USER     FD   TYPE DEVICE                SIZE/OFF  NODE NAME
ruby    40134  benoit    3   PIPE 0xf3b025a6a6cd6005    16384     ->0xf3b025a6a6cd5045
ruby    40134  benoit    4   PIPE 0xf3b025a6a6cd5045    16384     ->0xf3b025a6a6cd6005
ruby    40134  benoit    5   PIPE 0xf3b025a6a6cd7805    16384     ->0xf3b025a6a6cd7145
ruby    40134  benoit    7   PIPE 0xf3b025a6a6cd7145    16384     ->0xf3b025a6a6cd7805
ruby    40134  benoit   10   PIPE 0xf3b025a6a6cd6fc5    16384     ->0xf3b025a6a6cd5a05
ruby    40134  benoit   11   PIPE 0xf3b025a6a6cd5a05    16384     ->0xf3b025a6a6cd6fc5
ruby    40144  benoit    3   PIPE 0xf3b025a6a6cd5d05    16384     ->0xf3b025a6a6cd5c45
ruby    40144  benoit    4   PIPE 0xf3b025a6a6cd5c45    16384     ->0xf3b025a6a6cd5d05
ruby    40144  benoit    5   PIPE 0xf3b025a6a6cd7085    16384     ->0xf3b025a6a6cd6785
ruby    40144  benoit    7   PIPE 0xf3b025a6a6cd6785    16384     ->0xf3b025a6a6cd7085
ruby    40144  benoit   10   PIPE 0xf3b025a6a6cd6fc5    16384     ->0xf3b025a6a6cd5a05
ruby    40144  benoit   11   PIPE 0xf3b025a6a6cd5a05    16384     ->0xf3b025a6a6cd6fc5
```

But if we look properly from the doc we can even go further.

> If status information is immediately available on an appropriate child process,
waitpid() returns this information. Otherwise, waitpid() returns immediately with
an error code indicating that the information was not available. In other words,
WNOHANG checks child processes without causing the caller to be suspended.

and as pirj mention: "With this in mind, do we really need to check that information
that waitpid returns? We don't seem to use it."

Removing "waitpid" produce the same behavior as with `WNOHANG`.

Improvements:
The bisect command request lot's of ram. The next step should be to
reduce that consumption.

Related:
 - fix: rspec#2637
 - PR discussion: rspec#2669
  • Loading branch information
benoittgt committed Dec 14, 2019
1 parent 408c2bc commit fde186f
Show file tree
Hide file tree
Showing 3 changed files with 21 additions and 2 deletions.
5 changes: 3 additions & 2 deletions lib/rspec/core/bisect/fork_runner.rb
Expand Up @@ -91,8 +91,9 @@ def initialize(runner, channel)
end

def dispatch_specs(run_descriptor)
pid = fork { run_specs(run_descriptor) }
Process.waitpid(pid)
fork { run_specs(run_descriptor) }
# We don't use Process.waitpid here as it was causing bisects to
# block due to the file descriptor limit on OSX / Linux.
end

private
Expand Down
10 changes: 10 additions & 0 deletions spec/integration/bisect_spec.rb
Expand Up @@ -32,5 +32,15 @@ def bisect(cli_args, expected_status=nil)
expect(output).to include("Bisect failed!", "The example ordering is inconsistent")
end
end

context "when the bisect commasaturingnd is long" do
# On OSX and Linux a file descriptor limit meant that the bisect process got stuck at a certain limit.
# This test demonstrates that we can run large bisects above this limit (found to be at time of commit).
# See: https://github.com/rspec/rspec-core/pull/2669
it 'does not hit pipe size limit and does not get stuck' do
output = bisect(%W[spec/rspec/core/resources/blocking_pipe_bisect_spec.rb_], 1)
expect(output).to include("No failures found.")
end
end
end
end
8 changes: 8 additions & 0 deletions spec/rspec/core/resources/blocking_pipe_bisect_spec.rb_
@@ -0,0 +1,8 @@
# Deliberately named *.rb_ to avoid being loaded except when specified

RSpec.describe "1000 tests" do
puts "Try to saturate the pipe in Bisect command"
(0..1000).each do |t|
it { expect(t).to eq t }
end
end

0 comments on commit fde186f

Please sign in to comment.