Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

exec.Cmd.Stop() causes a panic when it tries to SIGKILL; fixing it is insufficient due to a data race #291

Open
dsharp-pivotal opened this issue Sep 2, 2023 · 4 comments
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@dsharp-pivotal
Copy link

dsharp-pivotal commented Sep 2, 2023

What happened:

If the command run by "utils/exec".New().Command() ignores SIGTERM and does not exit, or takes longer than 10 seconds to exit, using cmd.Stop() will result in a panic after 10 seconds.

What you expected to happen:

After SIGTERM fails to stop the process, Stop should send SIGKILL.

How to reproduce it:

I wrote a test that demonstrates the issue. It can be added to exec_test.go

func TestStopWhenSigtermIgnored(t *testing.T) {
  // Create a process that will ignore SIGTERM
  pr, pw, err := os.Pipe()
  if err != nil {
    t.Errorf("pipe: %v", err)
  }
  cmd := New().Command("/bin/sh", "-c", `trap "" TERM && echo "trap" && exec sleep 15`)
  cmd.SetStdin(nil)
  cmd.SetStdout(pw)
  cmd.SetStderr(pw)
  if err := cmd.Start(); err != nil {
    t.Errorf("expected Start() not to error, got: %v", err)
  }

  // Wait for SIGTERM's disposition to be set
  r := bufio.NewReader(pr)
  var line string
  if line, err = r.ReadString('\n'); err != nil {
    t.Errorf("ReadString: %v", err)
  }
  if line != "trap\n" {
    t.Errorf("unexpected write from shell script: %v", line)
  }

  // no race on process state after waiting 10 seconds to send SIGKILL
  cmd.Stop()
  if err := cmd.Wait(); err == nil || err.Error() != "signal: killed" {
    t.Errorf("expected Wait() to receive an error about a signal, got: %v", err)
  }
}
→ go test -v ./exec/ -test.run ^TestStopWhenSigtermIgnored$
=== RUN   TestStopWhenSigtermIgnored
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x1109a40]

goroutine 19 [running]:
os.(*ProcessState).exited(...)
	/usr/local/Cellar/go/1.20.6/libexec/src/os/exec_posix.go:83
os.(*ProcessState).Exited(...)
	/usr/local/Cellar/go/1.20.6/libexec/src/os/exec.go:155
k8s.io/utils/exec.(*cmdWrapper).Stop.func1()
	/Users/dsharp/workspace/kubernetes-utils/exec/exec.go:189 +0x20
created by time.goFunc
	/usr/local/Cellar/go/1.20.6/libexec/src/time/sleep.go:176 +0x32
FAIL	k8s.io/utils/exec	10.188s
FAIL

Anything else we need to know?:

The bug seems to come from a misconception about "os/exec".Cmd.ProcessState.Exited()

// Exited reports whether the program has exited.
// On Unix systems this reports true if the program exited due to calling exit,
// but false if the program terminated due to a signal.

However:

	// ProcessState contains information about an exited process.
	// If the process was started successfully, Wait or Run will
	// populate its ProcessState when the command completes.

That is, Cmd.ProcessState is nil until Cmd.Wait() successfully returns. Exited() really only is useful to distinguish if the process called exit() / returned normally, or was killed by a signal.


I also have a patch to address the panic:

 → git diff
diff --git a/exec/exec.go b/exec/exec.go
index d9c91e3..fd9ead2 100644
--- a/exec/exec.go
+++ b/exec/exec.go
@@ -186,7 +186,7 @@ func (cmd *cmdWrapper) Stop() {
        c.Process.Signal(syscall.SIGTERM)

        time.AfterFunc(10*time.Second, func() {
-               if !c.ProcessState.Exited() {
+               if c.ProcessState == nil {
                        c.Process.Signal(syscall.SIGKILL)
                }
        })

HOWEVER I'm concerned about a possible data race on the osexec.Cmd object between the time.AfterFunc goroutine and whichever routine calls cmd.Wait(). Running under the go race detector after the patch above has been applied does not show a race. But, if patched with the wrong condition (if c.ProcessState != nil {), then the race detector is triggered (It's not clear to me why the detection is sensitive to inverting the condition):

→ go test -race -count=1 -v ./exec/ -test.run ^TestStopWhenSigtermIgnored$
=== RUN   TestStopWhenSigtermIgnored
==================
WARNING: DATA RACE
Write at 0x00c0001b20a8 by goroutine 6:
  os/exec.(*Cmd).Wait()
      /usr/local/Cellar/go/1.20.6/libexec/src/os/exec/exec.go:894 +0x1b7
  k8s.io/utils/exec.(*cmdWrapper).Wait()
      /Users/dsharp/workspace/kubernetes-utils/exec/exec.go:157 +0x30
  k8s.io/utils/exec.TestStopWhenSigtermIgnored()
      /Users/dsharp/workspace/kubernetes-utils/exec/exec_test.go:160 +0x4ee
  testing.tRunner()
      /usr/local/Cellar/go/1.20.6/libexec/src/testing/testing.go:1576 +0x216
  testing.(*T).Run.func1()
      /usr/local/Cellar/go/1.20.6/libexec/src/testing/testing.go:1629 +0x47

Previous read at 0x00c0001b20a8 by goroutine 8:
  k8s.io/utils/exec.(*cmdWrapper).Stop.func1()
      /Users/dsharp/workspace/kubernetes-utils/exec/exec.go:189 +0x44

Goroutine 6 (running) created at:
  testing.(*T).Run()
      /usr/local/Cellar/go/1.20.6/libexec/src/testing/testing.go:1629 +0x805
  testing.runTests.func1()
      /usr/local/Cellar/go/1.20.6/libexec/src/testing/testing.go:2036 +0x8d
  testing.tRunner()
      /usr/local/Cellar/go/1.20.6/libexec/src/testing/testing.go:1576 +0x216
  testing.runTests()
      /usr/local/Cellar/go/1.20.6/libexec/src/testing/testing.go:2034 +0x87c
  testing.(*M).Run()
      /usr/local/Cellar/go/1.20.6/libexec/src/testing/testing.go:1906 +0xb44
  main.main()
      _testmain.go:71 +0x2e9

Goroutine 8 (finished) created at:
  time.goFunc()
      /usr/local/Cellar/go/1.20.6/libexec/src/time/sleep.go:176 +0x47
==================
    exec_test.go:161: expected Wait() to receive an error about a signal, got: <nil>
    testing.go:1446: race detected during execution of test
--- FAIL: TestStopWhenSigtermIgnored (15.02s)
=== NAME
    testing.go:1446: race detected during execution of test
FAIL
FAIL	k8s.io/utils/exec	15.218s
FAIL

If sending SIGKILL is to depend on the termination state of cmd (and it seems like it should), then there needs to be some kind of synchronization point between the timer.AfterFunc goroutine and the goroutine calling Wait().

It's possible that Stop() should be removed. I could only find one user of "utils/exec".Cmd.Stop() in public repositories by searching with sourcegraph, and that is in kubernetes/kubernetes/pkg/volume/flexvolume/driver-call.go. That code itself is lacking sufficient synchronization, and is in a deprecated component.
https://github.com/kubernetes/kubernetes/blob/99190634ab252604a4496882912ac328542d649d/pkg/volume/flexvolume/driver-call.go#L131

Environment:

  • Kubernetes version (use kubectl version): n/a
  • OS (e.g. from /etc/os-release): MacOS (dev environment)
  • Kernel (e.g. uname -a): Darwin
  • Install tools:
  • Others:
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 27, 2024
@dsharp-pivotal
Copy link
Author

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 29, 2024
@dsharp-pivotal
Copy link
Author

I re-tested this, and it's still present. In reviewing, I found some errors in my analysis and patch, so I updated the description.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

No branches or pull requests

3 participants