Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubelet: Panic on portforward streams #102480

Closed
mikkeloscar opened this issue Jun 1, 2021 · 9 comments · Fixed by #102489
Closed

kubelet: Panic on portforward streams #102480

mikkeloscar opened this issue Jun 1, 2021 · 9 comments · Fixed by #102489
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/node Categorizes an issue or PR as relevant to SIG Node. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@mikkeloscar
Copy link
Contributor

What happened:

Since upgrading to Kubernetes v1.19.11 (from v1.19.10) we see occasional panics in kubelet related to portforward streams:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x28 pc=0x17e78f8]
goroutine 1168860 [running]:
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/httpstream/spdy.(*connection).RemoveStreams(0xc002b30080, 0xc0032f7160, 0x2, 0x2)
 /workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/httpstream/spdy/connection.go:101 +0x78
k8s.io/kubernetes/pkg/kubelet/cri/streaming/portforward.(*httpStreamHandler).removeStreamPair(0xc00349c310, 0x6c657a8, 0x1)
 /workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/kubelet/cri/streaming/portforward/httpstream.go:168 +0x16d
k8s.io/kubernetes/pkg/kubelet/cri/streaming/portforward.(*httpStreamHandler).monitorStreamPair(0xc00349c310, 0xc000f6ef50, 0xc00369fd40)
 /workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/kubelet/cri/streaming/portforward/httpstream.go:148 +0x1fe
created by k8s.io/kubernetes/pkg/kubelet/cri/streaming/portforward.(*httpStreamHandler).run
 /workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/kubelet/cri/streaming/portforward/httpstream.go:227 +0x58a

This issue seem to be introduced with #99839

What you expected to happen:

kubelet should not panic and crash.

How to reproduce it (as minimally and precisely as possible):

We have not been able to successfully force this to happen, but it has happened 41 times in the last 2 weeks across 7 clusters (~100 nodes).

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version):
Server Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.11", GitCommit:"c6a2f08fc4378c5381dd948d9ad9d1080e3e6b33", GitTreeState:"clean", BuildDate:"2021-05-12T12:19:22Z", GoVersion:"go1.15.12", Compiler:"gc", Platform:"linux/amd64"}
  • Cloud provider or hardware configuration: Self managed clusters on AWS.
  • OS (e.g: cat /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Network plugin and version (if this is a network-related bug):
  • Others:

/sig node

/cc @saschagrunert

@mikkeloscar mikkeloscar added the kind/bug Categorizes issue or PR as related to a bug. label Jun 1, 2021
@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 1, 2021
@saschagrunert
Copy link
Member

saschagrunert commented Jun 1, 2021

Hey @mikkeloscar, thank you for the report! 🙏

Uhh, thats unfortunate. The panic occurs there on L101:

func (c *connection) RemoveStreams(streams ...httpstream.Stream) {
c.streamLock.Lock()
for _, stream := range streams {
delete(c.streams, stream.Identifier())
}
c.streamLock.Unlock()
}

  • It looks like c != nil because c.streamLock.Lock() works. The stack trace indicates the same: RemoveStreams(0xc002b30080, …
  • A delete() on nil is a no-op, so c.streams == nil would not be harmful.
  • stream does also not seem to be nil, a crash in stream.Identifier() would have been visible in the stack trace.

It may be possible that something assigned nil to the stream interface. Which means a check for stream == nil will return false, but accessing its fields is causing a segmentation fault. This does not explain why the crash is not visible in the trace.

@mikkeloscar
Copy link
Contributor Author

With a quick look at the streaming code it's likely to be triggered by a timeout in waiting for streams: https://github.com/saschagrunert/kubernetes/blob/v1.19.11/pkg/kubelet/cri/streaming/portforward/httpstream.go#L141

If the above case is hit then it goes down to h.removeStreamPair(p.requestID), and if the timeout occurs before p.complete then it means that either p.errorStream == nil or p.dataStream == nil https://github.com/saschagrunert/kubernetes/blob/c6a2f08fc4378c5381dd948d9ad9d1080e3e6b33/pkg/kubelet/cri/streaming/portforward/httpstream.go#L299

And then it attempts to remove the streams where at least one is nil: https://github.com/saschagrunert/kubernetes/blob/c6a2f08fc4378c5381dd948d9ad9d1080e3e6b33/pkg/kubelet/cri/streaming/portforward/httpstream.go#L168

So it probably makes sense to check for nil either before passing to RemoveStreams or inside the RemoveStreams method where it crashes.

This theory lines up with our kubelet logs, where I can see this just before the panic (didn't notice before):

E0531 18:35:34.270668   98804 httpstream.go:143] (conn=&{0xc006a109a0 map[1479:0xc00582a0a0] {0 0} 0x1d77000}, request=370) timed out waiting for streams

@saschagrunert
Copy link
Member

Thank you for the summary @mikkeloscar, working on a fix in #102489

@saschagrunert
Copy link
Member

/triage accepted
/assign

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 1, 2021
@saschagrunert
Copy link
Member

@mikkeloscar do you use dockershim within your cluster?

@mikkeloscar
Copy link
Contributor Author

@saschagrunert Yes, we still use dockershim.

@saschagrunert
Copy link
Member

@saschagrunert Yes, we still use dockershim.

Good, that's also why the panic happens within the kubelet.

@aauren
Copy link
Contributor

aauren commented Jun 9, 2021

Just to make sure I understand the scope of this issue, this only affects things like kubectl port-forward or other things that use the /portforward API, correct?

@saschagrunert
Copy link
Member

Just to make sure I understand the scope of this issue, this only affects things like kubectl port-forward or other things that use the /portforward API, correct?

Exactly 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/node Categorizes an issue or PR as relevant to SIG Node. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants