Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid bisect command to get stuck #2669

Merged
merged 1 commit into from Dec 14, 2019
Merged

Avoid bisect command to get stuck #2669

merged 1 commit into from Dec 14, 2019

Conversation

benoittgt
Copy link
Member

@benoittgt benoittgt commented Nov 25, 2019

Fix: #2637


From Palkan in benoittgt/rspec_repro_bisect_deadlock#1

First, I've tried to play with the number of specs which led to the
interesting conclusion: the process hangs only at 1548+ specs.

 RSpec.describe "a bunch of nothing" do
   (0...(ENV.fetch('N', 3000).to_i)).each do |t|
     it { expect(t).to eq t }
   end
 end

Try to run with N=1547 and N=1548.

Seems suspicious, right?

Let's add pry-byebug to the equation (or Gemfile).

In order it to work we need to tweak our runner code a bit:

- $stdout = $stderr = @spec_output
+ # $stdout = $stderr = @spec_output

After a bit of puts debugging I localized the problem:
@channel.send.

Channel#send calls IO#write here
https://github.com/rspec/rspec-core/blob/7b6b9c3/lib/rspec/core/bisect/utilities.rb#L41:

def send(message)
  packet = Marshal.dump(message)
  @write_io.write("#{packet.bytesize}\n#{packet}")
end

Do you know, what is the packet.bytesize for N=1548? It's 65548.
This number is very important: the pipe size is only 65536 on MacOS
(see docs for IO#write_nonblock
for more).

That makes @write_io.write hangs forever, because no one reads the
buffer: we call Channel#receive only after Process.waitpid(pid),
thus waiting for the write operation to complete.


A basic proposal is to use WNOHANG. From waitpid doc:

If WNOHANG was specified in options and there were no children
in a waitable state, then waitid() returns 0 immediately (...)

To validate this proposal on OSX we run just before running bisect:
lsof -n -P -r1 -c ruby | grep -e 'PIP' -e '===' -e 'COMMAND'

This will give us in loop the PIPE sizes of Ruby processes. Without our
patch we see that quickly we hit 65536 bytes on two pipes, with the patch
we keep pipes at the right size.

COMMAND PID    USER     FD   TYPE DEVICE                SIZE/OFF  NODE NAME
ruby    40134  benoit    3   PIPE 0xf3b025a6a6cd6005    16384     ->0xf3b025a6a6cd5045
ruby    40134  benoit    4   PIPE 0xf3b025a6a6cd5045    16384     ->0xf3b025a6a6cd6005
ruby    40134  benoit    5   PIPE 0xf3b025a6a6cd7805    16384     ->0xf3b025a6a6cd7145
ruby    40134  benoit    7   PIPE 0xf3b025a6a6cd7145    16384     ->0xf3b025a6a6cd7805
ruby    40134  benoit   10   PIPE 0xf3b025a6a6cd6fc5    16384     ->0xf3b025a6a6cd5a05
ruby    40134  benoit   11   PIPE 0xf3b025a6a6cd5a05    16384     ->0xf3b025a6a6cd6fc5
ruby    40144  benoit    3   PIPE 0xf3b025a6a6cd5d05    16384     ->0xf3b025a6a6cd5c45
ruby    40144  benoit    4   PIPE 0xf3b025a6a6cd5c45    16384     ->0xf3b025a6a6cd5d05
ruby    40144  benoit    5   PIPE 0xf3b025a6a6cd7085    16384     ->0xf3b025a6a6cd6785
ruby    40144  benoit    7   PIPE 0xf3b025a6a6cd6785    16384     ->0xf3b025a6a6cd7085
ruby    40144  benoit   10   PIPE 0xf3b025a6a6cd6fc5    16384     ->0xf3b025a6a6cd5a05
ruby    40144  benoit   11   PIPE 0xf3b025a6a6cd5a05    16384     ->0xf3b025a6a6cd6fc5

Improvements:
The bisect command request lot's of ram. The next step should be to
reduce that consumption.

Thanks to Palkan, julik and Samuel Williams for the help. 🙇🏻‍♂️

@pirj
Copy link
Member

pirj commented Nov 25, 2019

Just my 2c. Wondering why waitpid is needed at all? read on a pipe seems to be a blocking operation, and while the child process has not closed the channel, it won't return.
It might be a hasty conclusion, but it seems that waitpid is preventing the parent process from reading from that buffer and locks the child on a write operation.
I can't say I completely understand the semantics of WNOHANG, but it seems that we're relying on its behaviour of immediately returning anyway. However, we make ourselves dependent on the 'child in a waitable state', potentially risking a racing condition. There's slightly other description of it I've found which states:

If status information is immediately available on an appropriate child process, waitpid() returns this information. Otherwise, waitpid() returns immediately with an error code indicating that the information was not available. In other words, WNOHANG checks child processes without causing the caller to be suspended.

With this in mind, do we really need to check that information that waitpid returns? We don't seem to use it.

Out-of-context reproduction:

def experiment(n)
  rd, wr = IO.pipe
  pid = fork { rd.close; sleep 1; wr.write('a' * n); sleep 1; wr.close }
  Process.waitpid(pid)
  wr.close
  puts 'reading'
  puts rd.read.size
  puts 'done'
end
> experiment(65536)
reading
65536
done
> experiment(65537)

the second experiment hangs forever.

However, if you comment out Process.waitpid, there are two differences in behaviour:

  • experiment(65537) doesn't hang
  • reading is printed immediately, not together with read buffer size and done; two seconds later read buffer size and done appear

With this in mind, what do you think guys of removing Process.waitpid call altogether?

Great in-depth research by the way! 💯

@benoittgt
Copy link
Member Author

Thanks a lot @pirj for this very interesting feedback. I think I thought about removing it at first but was not sure about unexpected behaviors.

I redo all my test and removing Process.waitpid seems to be a good idea! I will amend my commit

benoittgt added a commit that referenced this pull request Nov 26, 2019
From Palkan in benoittgt/rspec_repro_bisect_deadlock#1

First, I've tried to play with the number of specs which led to the
interesting conclusion: **the process hangs only at 1548+ specs**.

```diff
 RSpec.describe "a bunch of nothing" do
   (0...(ENV.fetch('N', 3000).to_i)).each do |t|
     it { expect(t).to eq t }
   end
 end
```

Try to run with `N=1547` and `N=1548`.

Seems suspicious, right?

Let's add `pry-byebug` to the equation (or Gemfile).

In order it to work we need to tweak our runner code a bit:

```diff
- $stdout = $stderr = @spec_output
+ # $stdout = $stderr = @spec_output
```

After a bit of `puts` debugging I localized the problem:
[`@channel.send`](/lib/rspec/core/bisect/fork_runner.rb@7b6b9c3#L122).

 `Channel#send` calls `IO#write` here
 /lib/rspec/core/bisect/utilities.rb@7b6b9c3#L41:

 ```ruby
 def send(message)
   packet = Marshal.dump(message)
   @write_io.write("#{packet.bytesize}\n#{packet}")
 end
 ```

 Do you know, what is the `packet.bytesize` for `N=1548`? It's **65548**.
 This number is very important: the pipe size is only **65536** on MacOS
 (see docs for [`IO#write_nonblock`](ruby-doc.org/core-2.6.3/IO.html#method-i-write_nonblock)
 for more).

 That makes `@write_io.write` hangs forever, because no one reads the
 buffer: we call `Channel#receive` only after `Process.waitpid(pid)`,
 thus waiting for the write operation to complete.

 -----------

 A basic proposal will be to use WNOHANG. From waitpid doc:

 > If WNOHANG was specified in options and there were no children
 > in a waitable state, then waitid() returns 0 immediately (...)

 To validate this proposal on OSX we run just before running bisect:
 `lsof -n -P -r1 -c ruby | grep -e 'PIP' -e '===' -e 'COMMAND'`

 This will give us in loop the PIPE sizes of Ruby processes. Without our
 patch we see that quickly we hit 65536 bytes on two pipes, with the patch
 we keep pipes at the right size.

```
COMMAND PID    USER     FD   TYPE DEVICE                SIZE/OFF  NODE NAME
ruby    40134  benoit    3   PIPE 0xf3b025a6a6cd6005    16384     ->0xf3b025a6a6cd5045
ruby    40134  benoit    4   PIPE 0xf3b025a6a6cd5045    16384     ->0xf3b025a6a6cd6005
ruby    40134  benoit    5   PIPE 0xf3b025a6a6cd7805    16384     ->0xf3b025a6a6cd7145
ruby    40134  benoit    7   PIPE 0xf3b025a6a6cd7145    16384     ->0xf3b025a6a6cd7805
ruby    40134  benoit   10   PIPE 0xf3b025a6a6cd6fc5    16384     ->0xf3b025a6a6cd5a05
ruby    40134  benoit   11   PIPE 0xf3b025a6a6cd5a05    16384     ->0xf3b025a6a6cd6fc5
ruby    40144  benoit    3   PIPE 0xf3b025a6a6cd5d05    16384     ->0xf3b025a6a6cd5c45
ruby    40144  benoit    4   PIPE 0xf3b025a6a6cd5c45    16384     ->0xf3b025a6a6cd5d05
ruby    40144  benoit    5   PIPE 0xf3b025a6a6cd7085    16384     ->0xf3b025a6a6cd6785
ruby    40144  benoit    7   PIPE 0xf3b025a6a6cd6785    16384     ->0xf3b025a6a6cd7085
ruby    40144  benoit   10   PIPE 0xf3b025a6a6cd6fc5    16384     ->0xf3b025a6a6cd5a05
ruby    40144  benoit   11   PIPE 0xf3b025a6a6cd5a05    16384     ->0xf3b025a6a6cd6fc5
```

But if we look properly from the doc we can even go further.

> If status information is immediately available on an appropriate child process,
waitpid() returns this information. Otherwise, waitpid() returns immediately with
an error code indicating that the information was not available. In other words,
WNOHANG checks child processes without causing the caller to be suspended.

and as pirj mention: "With this in mind, do we really need to check that information
that waitpid returns? We don't seem to use it."

Removing "waitpid" produce the same behavior as with `WNOHANG`.

Improvements:
The bisect command request lot's of ram. The next step should be to
reduce that consumption.

Related:
 - fix: #2637
 - PR discussion: #2669
benoittgt added a commit that referenced this pull request Nov 26, 2019
From Palkan in benoittgt/rspec_repro_bisect_deadlock#1

First, I've tried to play with the number of specs which led to the
interesting conclusion: **the process hangs only at 1548+ specs**.

```diff
 RSpec.describe "a bunch of nothing" do
   (0...(ENV.fetch('N', 3000).to_i)).each do |t|
     it { expect(t).to eq t }
   end
 end
```

Try to run with `N=1547` and `N=1548`.

Seems suspicious, right?

Let's add `pry-byebug` to the equation (or Gemfile).

In order it to work we need to tweak our runner code a bit:

```diff
- $stdout = $stderr = @spec_output
+ # $stdout = $stderr = @spec_output
```

After a bit of `puts` debugging I localized the problem:
[`@channel.send`](/lib/rspec/core/bisect/fork_runner.rb@7b6b9c3#L122).

 `Channel#send` calls `IO#write` here
 /lib/rspec/core/bisect/utilities.rb@7b6b9c3#L41:

 ```ruby
 def send(message)
   packet = Marshal.dump(message)
   @write_io.write("#{packet.bytesize}\n#{packet}")
 end
 ```

 Do you know, what is the `packet.bytesize` for `N=1548`? It's **65548**.
 This number is very important: the pipe size is only **65536** on MacOS
 (see docs for [`IO#write_nonblock`](ruby-doc.org/core-2.6.3/IO.html#method-i-write_nonblock)
 for more).

 That makes `@write_io.write` hangs forever, because no one reads the
 buffer: we call `Channel#receive` only after `Process.waitpid(pid)`,
 thus waiting for the write operation to complete.

 -----------

 A basic proposal will be to use WNOHANG. From waitpid doc:

 > If WNOHANG was specified in options and there were no children
 > in a waitable state, then waitid() returns 0 immediately (...)

 To validate this proposal on OSX we run just before running bisect:
 `lsof -n -P -r1 -c ruby | grep -e 'PIP' -e '===' -e 'COMMAND'`

 This will give us in loop the PIPE sizes of Ruby processes. Without our
 patch we see that quickly we hit 65536 bytes on two pipes, with the patch
 we keep pipes at the right size.

```
COMMAND PID    USER     FD   TYPE DEVICE                SIZE/OFF  NODE NAME
ruby    40134  benoit    3   PIPE 0xf3b025a6a6cd6005    16384     ->0xf3b025a6a6cd5045
ruby    40134  benoit    4   PIPE 0xf3b025a6a6cd5045    16384     ->0xf3b025a6a6cd6005
ruby    40134  benoit    5   PIPE 0xf3b025a6a6cd7805    16384     ->0xf3b025a6a6cd7145
ruby    40134  benoit    7   PIPE 0xf3b025a6a6cd7145    16384     ->0xf3b025a6a6cd7805
ruby    40134  benoit   10   PIPE 0xf3b025a6a6cd6fc5    16384     ->0xf3b025a6a6cd5a05
ruby    40134  benoit   11   PIPE 0xf3b025a6a6cd5a05    16384     ->0xf3b025a6a6cd6fc5
ruby    40144  benoit    3   PIPE 0xf3b025a6a6cd5d05    16384     ->0xf3b025a6a6cd5c45
ruby    40144  benoit    4   PIPE 0xf3b025a6a6cd5c45    16384     ->0xf3b025a6a6cd5d05
ruby    40144  benoit    5   PIPE 0xf3b025a6a6cd7085    16384     ->0xf3b025a6a6cd6785
ruby    40144  benoit    7   PIPE 0xf3b025a6a6cd6785    16384     ->0xf3b025a6a6cd7085
ruby    40144  benoit   10   PIPE 0xf3b025a6a6cd6fc5    16384     ->0xf3b025a6a6cd5a05
ruby    40144  benoit   11   PIPE 0xf3b025a6a6cd5a05    16384     ->0xf3b025a6a6cd6fc5
```

But if we look properly from the doc we can even go further.

> If status information is immediately available on an appropriate child process,
waitpid() returns this information. Otherwise, waitpid() returns immediately with
an error code indicating that the information was not available. In other words,
WNOHANG checks child processes without causing the caller to be suspended.

and as pirj mention: "With this in mind, do we really need to check that information
that waitpid returns? We don't seem to use it."

Removing "waitpid" produce the same behavior as with `WNOHANG`.

Improvements:
The bisect command request lot's of ram. The next step should be to
reduce that consumption.

Related:
 - fix: #2637
 - PR discussion: #2669
@JonRowe
Copy link
Member

JonRowe commented Nov 27, 2019

Are we happy this fixes the problem? Have we got a way to check for a regression to this in our tests?

(It's green its just not gotten to github)

@benoittgt
Copy link
Member Author

Are we happy this fixes the problem?

I am quite ok with the result even if I would love to test it on a linux and windows.

Have we got a way to check for a regression to this in our tests?

One test we could have is to run bisect for a lot's of test. If you want I can look at this.

@JonRowe
Copy link
Member

JonRowe commented Nov 27, 2019

I think we need something...

benoittgt added a commit that referenced this pull request Dec 6, 2019
From Palkan in benoittgt/rspec_repro_bisect_deadlock#1

First, I've tried to play with the number of specs which led to the
interesting conclusion: **the process hangs only at 1548+ specs**.

```diff
 RSpec.describe "a bunch of nothing" do
   (0...(ENV.fetch('N', 3000).to_i)).each do |t|
     it { expect(t).to eq t }
   end
 end
```

Try to run with `N=1547` and `N=1548`.

Seems suspicious, right?

Let's add `pry-byebug` to the equation (or Gemfile).

In order it to work we need to tweak our runner code a bit:

```diff
- $stdout = $stderr = @spec_output
+ # $stdout = $stderr = @spec_output
```

After a bit of `puts` debugging I localized the problem:
[`@channel.send`](/lib/rspec/core/bisect/fork_runner.rb@7b6b9c3#L122).

 `Channel#send` calls `IO#write` here
 /lib/rspec/core/bisect/utilities.rb@7b6b9c3#L41:

 ```ruby
 def send(message)
   packet = Marshal.dump(message)
   @write_io.write("#{packet.bytesize}\n#{packet}")
 end
 ```

 Do you know, what is the `packet.bytesize` for `N=1548`? It's **65548**.
 This number is very important: the pipe size is only **65536** on MacOS
 (see docs for [`IO#write_nonblock`](ruby-doc.org/core-2.6.3/IO.html#method-i-write_nonblock)
 for more).

 That makes `@write_io.write` hangs forever, because no one reads the
 buffer: we call `Channel#receive` only after `Process.waitpid(pid)`,
 thus waiting for the write operation to complete.

 -----------

 A basic proposal will be to use WNOHANG. From waitpid doc:

 > If WNOHANG was specified in options and there were no children
 > in a waitable state, then waitid() returns 0 immediately (...)

 To validate this proposal on OSX we run just before running bisect:
 `lsof -n -P -r1 -c ruby | grep -e 'PIP' -e '===' -e 'COMMAND'`

 This will give us in loop the PIPE sizes of Ruby processes. Without our
 patch we see that quickly we hit 65536 bytes on two pipes, with the patch
 we keep pipes at the right size.

```
COMMAND PID    USER     FD   TYPE DEVICE                SIZE/OFF  NODE NAME
ruby    40134  benoit    3   PIPE 0xf3b025a6a6cd6005    16384     ->0xf3b025a6a6cd5045
ruby    40134  benoit    4   PIPE 0xf3b025a6a6cd5045    16384     ->0xf3b025a6a6cd6005
ruby    40134  benoit    5   PIPE 0xf3b025a6a6cd7805    16384     ->0xf3b025a6a6cd7145
ruby    40134  benoit    7   PIPE 0xf3b025a6a6cd7145    16384     ->0xf3b025a6a6cd7805
ruby    40134  benoit   10   PIPE 0xf3b025a6a6cd6fc5    16384     ->0xf3b025a6a6cd5a05
ruby    40134  benoit   11   PIPE 0xf3b025a6a6cd5a05    16384     ->0xf3b025a6a6cd6fc5
ruby    40144  benoit    3   PIPE 0xf3b025a6a6cd5d05    16384     ->0xf3b025a6a6cd5c45
ruby    40144  benoit    4   PIPE 0xf3b025a6a6cd5c45    16384     ->0xf3b025a6a6cd5d05
ruby    40144  benoit    5   PIPE 0xf3b025a6a6cd7085    16384     ->0xf3b025a6a6cd6785
ruby    40144  benoit    7   PIPE 0xf3b025a6a6cd6785    16384     ->0xf3b025a6a6cd7085
ruby    40144  benoit   10   PIPE 0xf3b025a6a6cd6fc5    16384     ->0xf3b025a6a6cd5a05
ruby    40144  benoit   11   PIPE 0xf3b025a6a6cd5a05    16384     ->0xf3b025a6a6cd6fc5
```

But if we look properly from the doc we can even go further.

> If status information is immediately available on an appropriate child process,
waitpid() returns this information. Otherwise, waitpid() returns immediately with
an error code indicating that the information was not available. In other words,
WNOHANG checks child processes without causing the caller to be suspended.

and as pirj mention: "With this in mind, do we really need to check that information
that waitpid returns? We don't seem to use it."

Removing "waitpid" produce the same behavior as with `WNOHANG`.

Improvements:
The bisect command request lot's of ram. The next step should be to
reduce that consumption.

Related:
 - fix: #2637
 - PR discussion: #2669
@benoittgt
Copy link
Member Author

Sorry for the delay.

I added a test on integration side. It was the easiest to implement in my case. I was able to reproduce the lock without my patch, then test pass with waitpid removed.

benoittgt added a commit that referenced this pull request Dec 6, 2019
From Palkan in benoittgt/rspec_repro_bisect_deadlock#1

First, I've tried to play with the number of specs which led to the
interesting conclusion: **the process hangs only at 1548+ specs**.

```diff
 RSpec.describe "a bunch of nothing" do
   (0...(ENV.fetch('N', 3000).to_i)).each do |t|
     it { expect(t).to eq t }
   end
 end
```

Try to run with `N=1547` and `N=1548`.

Seems suspicious, right?

Let's add `pry-byebug` to the equation (or Gemfile).

In order it to work we need to tweak our runner code a bit:

```diff
- $stdout = $stderr = @spec_output
+ # $stdout = $stderr = @spec_output
```

After a bit of `puts` debugging I localized the problem:
[`@channel.send`](/lib/rspec/core/bisect/fork_runner.rb@7b6b9c3#L122).

 `Channel#send` calls `IO#write` here
 /lib/rspec/core/bisect/utilities.rb@7b6b9c3#L41:

 ```ruby
 def send(message)
   packet = Marshal.dump(message)
   @write_io.write("#{packet.bytesize}\n#{packet}")
 end
 ```

 Do you know, what is the `packet.bytesize` for `N=1548`? It's **65548**.
 This number is very important: the pipe size is only **65536** on MacOS
 (see docs for [`IO#write_nonblock`](ruby-doc.org/core-2.6.3/IO.html#method-i-write_nonblock)
 for more).

 That makes `@write_io.write` hangs forever, because no one reads the
 buffer: we call `Channel#receive` only after `Process.waitpid(pid)`,
 thus waiting for the write operation to complete.

 -----------

 A basic proposal will be to use WNOHANG. From waitpid doc:

 > If WNOHANG was specified in options and there were no children
 > in a waitable state, then waitid() returns 0 immediately (...)

 To validate this proposal on OSX we run just before running bisect:
 `lsof -n -P -r1 -c ruby | grep -e 'PIP' -e '===' -e 'COMMAND'`

 This will give us in loop the PIPE sizes of Ruby processes. Without our
 patch we see that quickly we hit 65536 bytes on two pipes, with the patch
 we keep pipes at the right size.

```
COMMAND PID    USER     FD   TYPE DEVICE                SIZE/OFF  NODE NAME
ruby    40134  benoit    3   PIPE 0xf3b025a6a6cd6005    16384     ->0xf3b025a6a6cd5045
ruby    40134  benoit    4   PIPE 0xf3b025a6a6cd5045    16384     ->0xf3b025a6a6cd6005
ruby    40134  benoit    5   PIPE 0xf3b025a6a6cd7805    16384     ->0xf3b025a6a6cd7145
ruby    40134  benoit    7   PIPE 0xf3b025a6a6cd7145    16384     ->0xf3b025a6a6cd7805
ruby    40134  benoit   10   PIPE 0xf3b025a6a6cd6fc5    16384     ->0xf3b025a6a6cd5a05
ruby    40134  benoit   11   PIPE 0xf3b025a6a6cd5a05    16384     ->0xf3b025a6a6cd6fc5
ruby    40144  benoit    3   PIPE 0xf3b025a6a6cd5d05    16384     ->0xf3b025a6a6cd5c45
ruby    40144  benoit    4   PIPE 0xf3b025a6a6cd5c45    16384     ->0xf3b025a6a6cd5d05
ruby    40144  benoit    5   PIPE 0xf3b025a6a6cd7085    16384     ->0xf3b025a6a6cd6785
ruby    40144  benoit    7   PIPE 0xf3b025a6a6cd6785    16384     ->0xf3b025a6a6cd7085
ruby    40144  benoit   10   PIPE 0xf3b025a6a6cd6fc5    16384     ->0xf3b025a6a6cd5a05
ruby    40144  benoit   11   PIPE 0xf3b025a6a6cd5a05    16384     ->0xf3b025a6a6cd6fc5
```

But if we look properly from the doc we can even go further.

> If status information is immediately available on an appropriate child process,
waitpid() returns this information. Otherwise, waitpid() returns immediately with
an error code indicating that the information was not available. In other words,
WNOHANG checks child processes without causing the caller to be suspended.

and as pirj mention: "With this in mind, do we really need to check that information
that waitpid returns? We don't seem to use it."

Removing "waitpid" produce the same behavior as with `WNOHANG`.

Improvements:
The bisect command request lot's of ram. The next step should be to
reduce that consumption.

Related:
 - fix: #2637
 - PR discussion: #2669
Copy link
Member

@JonRowe JonRowe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really great work you two, thanks for digging into this so much @benoittgt, as a test combo this looks sound, I'd just like some minor tweaks and this should be good to go :)

@benoittgt
Copy link
Member Author

benoittgt commented Dec 12, 2019

Thanks for the review. I am not about the clarity of my comment. Open for reviews.

It was a pleasure to work on this subject.

@benoittgt
Copy link
Member Author

Thanks @JonRowe. It's much better and shorter with your proposal.

@JonRowe
Copy link
Member

JonRowe commented Dec 14, 2019

Happy for you to merge @benoittgt don't forget the change log :)

benoittgt added a commit that referenced this pull request Dec 14, 2019
From Palkan in benoittgt/rspec_repro_bisect_deadlock#1

First, I've tried to play with the number of specs which led to the
interesting conclusion: **the process hangs only at 1548+ specs**.

```diff
 RSpec.describe "a bunch of nothing" do
   (0...(ENV.fetch('N', 3000).to_i)).each do |t|
     it { expect(t).to eq t }
   end
 end
```

Try to run with `N=1547` and `N=1548`.

Seems suspicious, right?

Let's add `pry-byebug` to the equation (or Gemfile).

In order it to work we need to tweak our runner code a bit:

```diff
- $stdout = $stderr = @spec_output
+ # $stdout = $stderr = @spec_output
```

After a bit of `puts` debugging I localized the problem:
[`@channel.send`](/lib/rspec/core/bisect/fork_runner.rb@7b6b9c3#L122).

 `Channel#send` calls `IO#write` here
 /lib/rspec/core/bisect/utilities.rb@7b6b9c3#L41:

 ```ruby
 def send(message)
   packet = Marshal.dump(message)
   @write_io.write("#{packet.bytesize}\n#{packet}")
 end
 ```

 Do you know, what is the `packet.bytesize` for `N=1548`? It's **65548**.
 This number is very important: the pipe size is only **65536** on MacOS
 (see docs for [`IO#write_nonblock`](ruby-doc.org/core-2.6.3/IO.html#method-i-write_nonblock)
 for more).

 That makes `@write_io.write` hangs forever, because no one reads the
 buffer: we call `Channel#receive` only after `Process.waitpid(pid)`,
 thus waiting for the write operation to complete.

 -----------

 A basic proposal will be to use WNOHANG. From waitpid doc:

 > If WNOHANG was specified in options and there were no children
 > in a waitable state, then waitid() returns 0 immediately (...)

 To validate this proposal on OSX we run just before running bisect:
 `lsof -n -P -r1 -c ruby | grep -e 'PIP' -e '===' -e 'COMMAND'`

 This will give us in loop the PIPE sizes of Ruby processes. Without our
 patch we see that quickly we hit 65536 bytes on two pipes, with the patch
 we keep pipes at the right size.

```
COMMAND PID    USER     FD   TYPE DEVICE                SIZE/OFF  NODE NAME
ruby    40134  benoit    3   PIPE 0xf3b025a6a6cd6005    16384     ->0xf3b025a6a6cd5045
ruby    40134  benoit    4   PIPE 0xf3b025a6a6cd5045    16384     ->0xf3b025a6a6cd6005
ruby    40134  benoit    5   PIPE 0xf3b025a6a6cd7805    16384     ->0xf3b025a6a6cd7145
ruby    40134  benoit    7   PIPE 0xf3b025a6a6cd7145    16384     ->0xf3b025a6a6cd7805
ruby    40134  benoit   10   PIPE 0xf3b025a6a6cd6fc5    16384     ->0xf3b025a6a6cd5a05
ruby    40134  benoit   11   PIPE 0xf3b025a6a6cd5a05    16384     ->0xf3b025a6a6cd6fc5
ruby    40144  benoit    3   PIPE 0xf3b025a6a6cd5d05    16384     ->0xf3b025a6a6cd5c45
ruby    40144  benoit    4   PIPE 0xf3b025a6a6cd5c45    16384     ->0xf3b025a6a6cd5d05
ruby    40144  benoit    5   PIPE 0xf3b025a6a6cd7085    16384     ->0xf3b025a6a6cd6785
ruby    40144  benoit    7   PIPE 0xf3b025a6a6cd6785    16384     ->0xf3b025a6a6cd7085
ruby    40144  benoit   10   PIPE 0xf3b025a6a6cd6fc5    16384     ->0xf3b025a6a6cd5a05
ruby    40144  benoit   11   PIPE 0xf3b025a6a6cd5a05    16384     ->0xf3b025a6a6cd6fc5
```

But if we look properly from the doc we can even go further.

> If status information is immediately available on an appropriate child process,
waitpid() returns this information. Otherwise, waitpid() returns immediately with
an error code indicating that the information was not available. In other words,
WNOHANG checks child processes without causing the caller to be suspended.

and as pirj mention: "With this in mind, do we really need to check that information
that waitpid returns? We don't seem to use it."

Removing "waitpid" produce the same behavior as with `WNOHANG`.

Improvements:
The bisect command request lot's of ram. The next step should be to
reduce that consumption.

Related:
 - fix: #2637
 - PR discussion: #2669
@benoittgt benoittgt merged commit 89972e6 into master Dec 14, 2019
@benoittgt benoittgt deleted the deal_with_stuck_bisect branch December 14, 2019 14:22
benoittgt added a commit that referenced this pull request Dec 14, 2019
JonRowe added a commit that referenced this pull request Dec 14, 2019
@jcabasc
Copy link

jcabasc commented Dec 20, 2019

Hey guys @benoittgt is there a special way to run this command and take advantage of this implementation? I am running into the same issue, so wanted to double-check with you. By the way, which version is this one? Thanks.

@benoittgt
Copy link
Member Author

This is the master version. We did not release a new version with the patch. You could point in your gemfile rspec-core to master.

@jcabasc
Copy link

jcabasc commented Dec 20, 2019

Thanks man @benoittgt

JonRowe pushed a commit that referenced this pull request Dec 28, 2019
Avoid bisect command to get stuck
JonRowe added a commit that referenced this pull request Dec 28, 2019
@pirj
Copy link
Member

pirj commented Jun 11, 2020

@andrykonchin brought up an interesting point that by removing Process.waitpid we're producing zombie processes in the system.

A quick search showed that to mitigate this we can use Process.detach(pid).

Maybe @ioquatix can advise here?

@JonRowe
Copy link
Member

JonRowe commented Jun 11, 2020

@andrykonchin brought up an interesting point that by removing Process.waitpid we're producing zombie processes in the system.

This probably means that we have not fixed the problem, merely ignored it.

@benoittgt
Copy link
Member Author

@andrykonchin brought up an interesting point that by removing Process.waitpid we're producing zombie processes in the system.

Ouhao. Very nice article. 👏

This is a google translation but this is written:

On the one hand, they solved the problem with deadlock , but on the other, they created a new one. Now the child process remains in the “zombie” state, since no one gets its completion status by a system call waitpid, which means leak of process descriptors. When the number of processes in the operating system reaches the limit, and it is not so big, it will not be possible to create a new child process.

Mentioned in waitpid doc:

A child that terminates, but has not been waited for becomes a "zombie". The kernel maintains a minimal set of information about the zombie process (PID, termination status, resource usage information) in order to allow the parent to later perform a wait to obtain information about the child. As long as a zombie is not removed from the system via a wait, it will consume a slot in the kernel process table, and if this table fills, it will not be possible to create further processes. If a parent process terminates, then its "zombie" children (if any) are adopted by init(8), which automatically performs a wait to remove the zombies.

So after few tests, we have probably zombie processes when the bisect commands run. It is not clean while the bisect command is running but it is cleaned after.
If your bisect command needs to run many step to isolate the failings specs suite and your machine doesn't have enough process available it will ended up using all your processes and the machine become unusable.

detach seems to be a good idea. I tested it on the code example below and did not hit any limitations.


I did this test, because I wanted to understand exactly what was happening and also try @pirj proposal:

  1. Create process and do not waitpid like rspec-core at the moment.
@read_io, @write_io = IO.pipe

# write into pipe some data
def run_specs(i)
  packet = '*' * i
  @write_io.write("#{packet.bytesize}\n#{packet}")
end

count = 0
(65500..69500).each do |i|
  count += 1
  # create a child process
  pid = fork { run_specs(i) }
  puts "interval: #{i}, pid is: #{pid} and parent pid is: #{Process.ppid}, count: #{count}"

  # wait for its terminating
  # Process.waitpid(pid)
  sleep 0.1

  # read result
  packet_size = Integer(@read_io.gets)
  packet = @read_io.read(packet_size)

  puts "packet size: #{packet.size}"
end
  1. Get max number of process on my mac for my user
$ sysctl kern.maxprocperuid
kern.maxprocperuid: 1418
  1. Monitor process count for my user
while true; do pstree -u <username> | wc -l ; sleep 1; clear; done

You can also monitor zombie process count while true; do ps axo pid=,stat= | awk '$2~/^Z/ { print }' | wc -l ; sleep 1; clear; done

  1. Run the ruby script. When we reach 1418 process the systems cannot create more process, it's problematic but to go there I was able to run create nearly 1000 process (that become zombie). They are cleared when the ruby script is cancel or over.

benoittgt added a commit that referenced this pull request Jun 11, 2020
If do not `waitpid` or `detach` the process become a zombie process.
As mentionned in waitpid doc:
> As long as a zombie is not removed from the system via a wait, it
will consume a slot in the kernel process table, and if this table
fills, it will not be possible to create further processes.

Related:
- #2669
- https://andrykonchin.github.io/rails/2019/12/25/deadlock-in-rspec.html
benoittgt added a commit that referenced this pull request Jun 11, 2020
If we do not `waitpid` or `detach` the bisect process become a zombie
process.

As mentionned in waitpid doc:
> As long as a zombie is not removed from the system via a wait, it
will consume a slot in the kernel process table, and if this table
fills, it will not be possible to create further processes.

Related:
- #2669
- https://andrykonchin.github.io/rails/2019/12/25/deadlock-in-rspec.html
benoittgt added a commit that referenced this pull request Jun 11, 2020
If we do not `waitpid` or `detach` the bisect process become a zombie
process.

As mentionned in waitpid doc:
> As long as a zombie is not removed from the system via a wait, it will consume a slot in the kernel process table, and if this table fills, it will not be possible to create further processes.

`detach` is a good idea. From the Ruby doc:
> Some operating systems retain the status of terminated child processes until the parent collects that status (normally using some variant of wait()). If the parent never collects this status, the child stays around as a zombie process. Process::detach prevents this by setting up a separate Ruby thread whose sole job is to reap the status of the process pid when it terminates. Use detach only when you do not intend to explicitly wait for the child to terminate.

Related:
- #2669
- https://andrykonchin.github.io/rails/2019/12/25/deadlock-in-rspec.html
JonRowe pushed a commit that referenced this pull request Jun 29, 2020
If we do not `waitpid` or `detach` the bisect process become a zombie
process.

As mentionned in waitpid doc:
> As long as a zombie is not removed from the system via a wait, it will consume a slot in the kernel process table, and if this table fills, it will not be possible to create further processes.

`detach` is a good idea. From the Ruby doc:
> Some operating systems retain the status of terminated child processes until the parent collects that status (normally using some variant of wait()). If the parent never collects this status, the child stays around as a zombie process. Process::detach prevents this by setting up a separate Ruby thread whose sole job is to reap the status of the process pid when it terminates. Use detach only when you do not intend to explicitly wait for the child to terminate.

Related:
- #2669
- https://andrykonchin.github.io/rails/2019/12/25/deadlock-in-rspec.html
benoittgt added a commit that referenced this pull request Aug 28, 2020
If we do not `waitpid` or `detach` the bisect process become a zombie
process.

As mentionned in waitpid doc:
> As long as a zombie is not removed from the system via a wait, it will consume a slot in the kernel process table, and if this table fills, it will not be possible to create further processes.

`detach` is a good idea. From the Ruby doc:
> Some operating systems retain the status of terminated child processes until the parent collects that status (normally using some variant of wait()). If the parent never collects this status, the child stays around as a zombie process. Process::detach prevents this by setting up a separate Ruby thread whose sole job is to reap the status of the process pid when it terminates. Use detach only when you do not intend to explicitly wait for the child to terminate.

Related:
- #2669
- https://andrykonchin.github.io/rails/2019/12/25/deadlock-in-rspec.html
benoittgt added a commit that referenced this pull request Aug 28, 2020
If we do not `waitpid` or `detach` the bisect process become a zombie
process.

As mentionned in waitpid doc:
> As long as a zombie is not removed from the system via a wait, it will consume a slot in the kernel process table, and if this table fills, it will not be possible to create further processes.

`detach` is a good idea. From the Ruby doc:
> Some operating systems retain the status of terminated child processes until the parent collects that status (normally using some variant of wait()). If the parent never collects this status, the child stays around as a zombie process. Process::detach prevents this by setting up a separate Ruby thread whose sole job is to reap the status of the process pid when it terminates. Use detach only when you do not intend to explicitly wait for the child to terminate.

Related:
- #2669
- https://andrykonchin.github.io/rails/2019/12/25/deadlock-in-rspec.html
MatheusRich pushed a commit to MatheusRich/rspec-core that referenced this pull request Oct 30, 2020
From Palkan in benoittgt/rspec_repro_bisect_deadlock#1

First, I've tried to play with the number of specs which led to the
interesting conclusion: **the process hangs only at 1548+ specs**.

```diff
 RSpec.describe "a bunch of nothing" do
   (0...(ENV.fetch('N', 3000).to_i)).each do |t|
     it { expect(t).to eq t }
   end
 end
```

Try to run with `N=1547` and `N=1548`.

Seems suspicious, right?

Let's add `pry-byebug` to the equation (or Gemfile).

In order it to work we need to tweak our runner code a bit:

```diff
- $stdout = $stderr = @spec_output
+ # $stdout = $stderr = @spec_output
```

After a bit of `puts` debugging I localized the problem:
[`@channel.send`](/lib/rspec/core/bisect/fork_runner.rb@7b6b9c3#L122).

 `Channel#send` calls `IO#write` here
 /lib/rspec/core/bisect/utilities.rb@7b6b9c3#L41:

 ```ruby
 def send(message)
   packet = Marshal.dump(message)
   @write_io.write("#{packet.bytesize}\n#{packet}")
 end
 ```

 Do you know, what is the `packet.bytesize` for `N=1548`? It's **65548**.
 This number is very important: the pipe size is only **65536** on MacOS
 (see docs for [`IO#write_nonblock`](ruby-doc.org/core-2.6.3/IO.html#method-i-write_nonblock)
 for more).

 That makes `@write_io.write` hangs forever, because no one reads the
 buffer: we call `Channel#receive` only after `Process.waitpid(pid)`,
 thus waiting for the write operation to complete.

 -----------

 A basic proposal will be to use WNOHANG. From waitpid doc:

 > If WNOHANG was specified in options and there were no children
 > in a waitable state, then waitid() returns 0 immediately (...)

 To validate this proposal on OSX we run just before running bisect:
 `lsof -n -P -r1 -c ruby | grep -e 'PIP' -e '===' -e 'COMMAND'`

 This will give us in loop the PIPE sizes of Ruby processes. Without our
 patch we see that quickly we hit 65536 bytes on two pipes, with the patch
 we keep pipes at the right size.

```
COMMAND PID    USER     FD   TYPE DEVICE                SIZE/OFF  NODE NAME
ruby    40134  benoit    3   PIPE 0xf3b025a6a6cd6005    16384     ->0xf3b025a6a6cd5045
ruby    40134  benoit    4   PIPE 0xf3b025a6a6cd5045    16384     ->0xf3b025a6a6cd6005
ruby    40134  benoit    5   PIPE 0xf3b025a6a6cd7805    16384     ->0xf3b025a6a6cd7145
ruby    40134  benoit    7   PIPE 0xf3b025a6a6cd7145    16384     ->0xf3b025a6a6cd7805
ruby    40134  benoit   10   PIPE 0xf3b025a6a6cd6fc5    16384     ->0xf3b025a6a6cd5a05
ruby    40134  benoit   11   PIPE 0xf3b025a6a6cd5a05    16384     ->0xf3b025a6a6cd6fc5
ruby    40144  benoit    3   PIPE 0xf3b025a6a6cd5d05    16384     ->0xf3b025a6a6cd5c45
ruby    40144  benoit    4   PIPE 0xf3b025a6a6cd5c45    16384     ->0xf3b025a6a6cd5d05
ruby    40144  benoit    5   PIPE 0xf3b025a6a6cd7085    16384     ->0xf3b025a6a6cd6785
ruby    40144  benoit    7   PIPE 0xf3b025a6a6cd6785    16384     ->0xf3b025a6a6cd7085
ruby    40144  benoit   10   PIPE 0xf3b025a6a6cd6fc5    16384     ->0xf3b025a6a6cd5a05
ruby    40144  benoit   11   PIPE 0xf3b025a6a6cd5a05    16384     ->0xf3b025a6a6cd6fc5
```

But if we look properly from the doc we can even go further.

> If status information is immediately available on an appropriate child process,
waitpid() returns this information. Otherwise, waitpid() returns immediately with
an error code indicating that the information was not available. In other words,
WNOHANG checks child processes without causing the caller to be suspended.

and as pirj mention: "With this in mind, do we really need to check that information
that waitpid returns? We don't seem to use it."

Removing "waitpid" produce the same behavior as with `WNOHANG`.

Improvements:
The bisect command request lot's of ram. The next step should be to
reduce that consumption.

Related:
 - fix: rspec#2637
 - PR discussion: rspec#2669
MatheusRich pushed a commit to MatheusRich/rspec-core that referenced this pull request Oct 30, 2020
Avoid bisect command to get stuck
MatheusRich pushed a commit to MatheusRich/rspec-core that referenced this pull request Oct 30, 2020
MatheusRich pushed a commit to MatheusRich/rspec-core that referenced this pull request Oct 30, 2020
MatheusRich pushed a commit to MatheusRich/rspec-core that referenced this pull request Oct 30, 2020
If we do not `waitpid` or `detach` the bisect process become a zombie
process.

As mentionned in waitpid doc:
> As long as a zombie is not removed from the system via a wait, it will consume a slot in the kernel process table, and if this table fills, it will not be possible to create further processes.

`detach` is a good idea. From the Ruby doc:
> Some operating systems retain the status of terminated child processes until the parent collects that status (normally using some variant of wait()). If the parent never collects this status, the child stays around as a zombie process. Process::detach prevents this by setting up a separate Ruby thread whose sole job is to reap the status of the process pid when it terminates. Use detach only when you do not intend to explicitly wait for the child to terminate.

Related:
- rspec#2669
- https://andrykonchin.github.io/rails/2019/12/25/deadlock-in-rspec.html
yujinakayama pushed a commit to yujinakayama/rspec-monorepo that referenced this pull request Oct 6, 2021
Avoid bisect command to get stuck

---
This commit was imported from rspec/rspec-core@1c31fc4.
yujinakayama pushed a commit to yujinakayama/rspec-monorepo that referenced this pull request Oct 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

--bisect deadlocks when reporting results
4 participants