exec.Cmd.Stop() causes a panic when it tries to SIGKILL; fixing it is insufficient due to a data race #291

dsharp-pivotal · 2023-09-02T03:25:23Z

What happened:

If the command run by "utils/exec".New().Command() ignores SIGTERM and does not exit, or takes longer than 10 seconds to exit, using cmd.Stop() will result in a panic after 10 seconds.

What you expected to happen:

After SIGTERM fails to stop the process, Stop should send SIGKILL.

How to reproduce it:

I wrote a test that demonstrates the issue. It can be added to exec_test.go

func TestStopWhenSigtermIgnored(t *testing.T) {
  // Create a process that will ignore SIGTERM
  pr, pw, err := os.Pipe()
  if err != nil {
    t.Errorf("pipe: %v", err)
  }
  cmd := New().Command("/bin/sh", "-c", `trap "" TERM && echo "trap" && exec sleep 15`)
  cmd.SetStdin(nil)
  cmd.SetStdout(pw)
  cmd.SetStderr(pw)
  if err := cmd.Start(); err != nil {
    t.Errorf("expected Start() not to error, got: %v", err)
  }

  // Wait for SIGTERM's disposition to be set
  r := bufio.NewReader(pr)
  var line string
  if line, err = r.ReadString('\n'); err != nil {
    t.Errorf("ReadString: %v", err)
  }
  if line != "trap\n" {
    t.Errorf("unexpected write from shell script: %v", line)
  }

  // no race on process state after waiting 10 seconds to send SIGKILL
  cmd.Stop()
  if err := cmd.Wait(); err == nil || err.Error() != "signal: killed" {
    t.Errorf("expected Wait() to receive an error about a signal, got: %v", err)
  }
}

→ go test -v ./exec/ -test.run ^TestStopWhenSigtermIgnored$
=== RUN   TestStopWhenSigtermIgnored
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x1109a40]

goroutine 19 [running]:
os.(*ProcessState).exited(...)
	/usr/local/Cellar/go/1.20.6/libexec/src/os/exec_posix.go:83
os.(*ProcessState).Exited(...)
	/usr/local/Cellar/go/1.20.6/libexec/src/os/exec.go:155
k8s.io/utils/exec.(*cmdWrapper).Stop.func1()
	/Users/dsharp/workspace/kubernetes-utils/exec/exec.go:189 +0x20
created by time.goFunc
	/usr/local/Cellar/go/1.20.6/libexec/src/time/sleep.go:176 +0x32
FAIL	k8s.io/utils/exec	10.188s
FAIL

Anything else we need to know?:

The bug seems to come from a misconception about "os/exec".Cmd.ProcessState.Exited()

// Exited reports whether the program has exited.
// On Unix systems this reports true if the program exited due to calling exit,
// but false if the program terminated due to a signal.

However:

	// ProcessState contains information about an exited process.
	// If the process was started successfully, Wait or Run will
	// populate its ProcessState when the command completes.

That is, Cmd.ProcessState is nil until Cmd.Wait() successfully returns. Exited() really only is useful to distinguish if the process called exit() / returned normally, or was killed by a signal.

I also have a patch to address the panic:

 → git diff
diff --git a/exec/exec.go b/exec/exec.go
index d9c91e3..fd9ead2 100644
--- a/exec/exec.go
+++ b/exec/exec.go
@@ -186,7 +186,7 @@ func (cmd *cmdWrapper) Stop() {
        c.Process.Signal(syscall.SIGTERM)

        time.AfterFunc(10*time.Second, func() {
-               if !c.ProcessState.Exited() {
+               if c.ProcessState == nil {
                        c.Process.Signal(syscall.SIGKILL)
                }
        })

HOWEVER I'm concerned about a possible data race on the osexec.Cmd object between the time.AfterFunc goroutine and whichever routine calls cmd.Wait(). Running under the go race detector after the patch above has been applied does not show a race. But, if patched with the wrong condition (if c.ProcessState != nil {), then the race detector is triggered (It's not clear to me why the detection is sensitive to inverting the condition):

→ go test -race -count=1 -v ./exec/ -test.run ^TestStopWhenSigtermIgnored$
=== RUN   TestStopWhenSigtermIgnored
==================
WARNING: DATA RACE
Write at 0x00c0001b20a8 by goroutine 6:
  os/exec.(*Cmd).Wait()
      /usr/local/Cellar/go/1.20.6/libexec/src/os/exec/exec.go:894 +0x1b7
  k8s.io/utils/exec.(*cmdWrapper).Wait()
      /Users/dsharp/workspace/kubernetes-utils/exec/exec.go:157 +0x30
  k8s.io/utils/exec.TestStopWhenSigtermIgnored()
      /Users/dsharp/workspace/kubernetes-utils/exec/exec_test.go:160 +0x4ee
  testing.tRunner()
      /usr/local/Cellar/go/1.20.6/libexec/src/testing/testing.go:1576 +0x216
  testing.(*T).Run.func1()
      /usr/local/Cellar/go/1.20.6/libexec/src/testing/testing.go:1629 +0x47

Previous read at 0x00c0001b20a8 by goroutine 8:
  k8s.io/utils/exec.(*cmdWrapper).Stop.func1()
      /Users/dsharp/workspace/kubernetes-utils/exec/exec.go:189 +0x44

Goroutine 6 (running) created at:
  testing.(*T).Run()
      /usr/local/Cellar/go/1.20.6/libexec/src/testing/testing.go:1629 +0x805
  testing.runTests.func1()
      /usr/local/Cellar/go/1.20.6/libexec/src/testing/testing.go:2036 +0x8d
  testing.tRunner()
      /usr/local/Cellar/go/1.20.6/libexec/src/testing/testing.go:1576 +0x216
  testing.runTests()
      /usr/local/Cellar/go/1.20.6/libexec/src/testing/testing.go:2034 +0x87c
  testing.(*M).Run()
      /usr/local/Cellar/go/1.20.6/libexec/src/testing/testing.go:1906 +0xb44
  main.main()
      _testmain.go:71 +0x2e9

Goroutine 8 (finished) created at:
  time.goFunc()
      /usr/local/Cellar/go/1.20.6/libexec/src/time/sleep.go:176 +0x47
==================
    exec_test.go:161: expected Wait() to receive an error about a signal, got: <nil>
    testing.go:1446: race detected during execution of test
--- FAIL: TestStopWhenSigtermIgnored (15.02s)
=== NAME
    testing.go:1446: race detected during execution of test
FAIL
FAIL	k8s.io/utils/exec	15.218s
FAIL

If sending SIGKILL is to depend on the termination state of cmd (and it seems like it should), then there needs to be some kind of synchronization point between the timer.AfterFunc goroutine and the goroutine calling Wait().

It's possible that Stop() should be removed. I could only find one user of "utils/exec".Cmd.Stop() in public repositories by searching with sourcegraph, and that is in kubernetes/kubernetes/pkg/volume/flexvolume/driver-call.go. That code itself is lacking sufficient synchronization, and is in a deprecated component.
https://github.com/kubernetes/kubernetes/blob/99190634ab252604a4496882912ac328542d649d/pkg/volume/flexvolume/driver-call.go#L131

Environment:

Kubernetes version (use kubectl version): n/a
OS (e.g. from /etc/os-release): MacOS (dev environment)
Kernel (e.g. uname -a): Darwin
Install tools:
Others:

The text was updated successfully, but these errors were encountered:

k8s-triage-robot · 2024-01-27T11:43:55Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

dsharp-pivotal · 2024-01-29T23:55:30Z

/remove-lifecycle stale

dsharp-pivotal · 2024-01-29T23:57:54Z

I re-tested this, and it's still present. In reviewing, I found some errors in my analysis and patch, so I updated the description.

k8s-triage-robot · 2024-04-29T00:55:47Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 27, 2024

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 29, 2024

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

exec.Cmd.Stop() causes a panic when it tries to SIGKILL; fixing it is insufficient due to a data race #291

exec.Cmd.Stop() causes a panic when it tries to SIGKILL; fixing it is insufficient due to a data race #291

dsharp-pivotal commented Sep 2, 2023 •

edited

k8s-triage-robot commented Jan 27, 2024

dsharp-pivotal commented Jan 29, 2024

dsharp-pivotal commented Jan 29, 2024

k8s-triage-robot commented Apr 29, 2024

exec.Cmd.Stop() causes a panic when it tries to SIGKILL; fixing it is insufficient due to a data race #291

exec.Cmd.Stop() causes a panic when it tries to SIGKILL; fixing it is insufficient due to a data race #291

Comments

dsharp-pivotal commented Sep 2, 2023 • edited

k8s-triage-robot commented Jan 27, 2024

dsharp-pivotal commented Jan 29, 2024

dsharp-pivotal commented Jan 29, 2024

k8s-triage-robot commented Apr 29, 2024

dsharp-pivotal commented Sep 2, 2023 •

edited