New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
data race in nodeshutdown tests #110854
Comments
@kerthcet: This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
cc @pohly as we introduced contextual logging weeks ago, I have no idea whether this is related. |
Contextual logging is involved here because it enables log output through testing.T.Log. It's interesting that the data race seems to be inside testing.T itself. It does mutex locking but apparently the access to the mutex itself is racing?! /assign |
The change which triggered this is from PR #110504 . |
The main problem seems to be that the
One the one hand, keeping goroutines running in the background after test completion is bad and should be avoided. On the other hand, this used to be okay(is) in the past (it just made the global log output even less useful) and cannot always be avoided. What could be done to solve this is to "mute" testinglogger once the test completes. But that depends on a klog API extension. |
When testing.T.Log gets called after the test has completed, it panics. There's also a data race (kubernetes/kubernetes#110854). Normally that should never happen because tests should ensure that all goroutines have stopped before returning. But sometimes it is not possible to do that. For those cases, "defer Stop(logger)" may be added to a test. When called, it will cause all future usage of the testing.T instance to be skipped.
When testing.T.Log gets called after the test has completed, it panics. There's also a data race (kubernetes/kubernetes#110854). Normally that should never happen because tests should ensure that all goroutines have stopped before returning. But sometimes it is not possible to do that. For those cases, "defer Stop(logger)" may be added to a test. When called, it will cause all future usage of the testing.T instance to be skipped.
Hi @pohly Isn't systemDbus defer function get called earlier kubernetes/pkg/kubelet/nodeshutdown/nodeshutdown_manager_linux_test.go Lines 380 to 390 in f4abde9
therefore following line get executed immediately. Hence, testing.T instance became unusable before go routines inside
Also, saw your above PR kubernetes/klog#337 |
We have to add |
Scratch that. The version of TestRestart that you linked to doesn't use ktesting, but master does. |
Hmm, does this test also have other data races? After adding
That's a race around setting and calling When checking out the version prior to my contextual logging change, I get: $ git checkout 65385fec209fb5a6d549129fb03cd529c25a2cff~
Previous HEAD position was 10bea49c12d Merge pull request #110140 from marosset/hpc-sandbox-config-fixes
$ go test -count=5 -race ./pkg/kubelet/nodeshutdown/
...
--- FAIL: TestRestart (0.00s)
panic: close of closed channel [recovered]
panic: close of closed channel
goroutine 424 [running]:
testing.tRunner.func1.2({0x25cb900, 0x32fe1d0})
/nvme/gopath/go-1.18.1/src/testing/testing.go:1389 +0x366
testing.tRunner.func1()
/nvme/gopath/go-1.18.1/src/testing/testing.go:1392 +0x5d2
panic({0x25cb900, 0x32fe1d0})
/nvme/gopath/go-1.18.1/src/runtime/panic.go:844 +0x258
k8s.io/kubernetes/pkg/kubelet/nodeshutdown.TestRestart(0xc000583ba0)
/nvme/gopath/src/k8s.io/kubernetes/pkg/kubelet/nodeshutdown/nodeshutdown_manager_linux_test.go:422 +0x686
testing.tRunner(0xc000583ba0, 0x28f3468)
/nvme/gopath/go-1.18.1/src/testing/testing.go:1439 +0x214
created by testing.(*T).Run
/nvme/gopath/go-1.18.1/src/testing/testing.go:1486 +0x725
FAIL k8s.io/kubernetes/pkg/kubelet/nodeshutdown 5.162s
FAIL Trying again gives me the same race I saw earlier:
|
But a better solution would be to shut down all goroutines cleanly... |
That probably also would fix the non-logging race, too. I bet it is the goroutine from one test which reads |
I have accidently linked the old one but I mentioned I thought the for loop we added at the end was to wait for the test to complete and as following chan read will not wait |
Hi @pohly This way we will not be calling |
That doesn't solve the problem that the test leaks goroutines and the race I mentioned in #110854 (comment) |
I tried to ensure that all goroutines terminate, but some functions blocked in some test cases. So a "proper" solution might not work and we need something like kubernetes/klog#337 |
in kubernetes/klog#337, new Stop method will only solve the panic case, right? Go routines will still leaks from the test cases. |
Correct.
The test would have to ensure that the goroutine is done with logging before checking the output. |
Then we have to change our functionality under test to have the ability to inject something like a channel and then in the test we have to listen on that chan, in this way we can keep the |
When testing.T.Log gets called after the test has completed, it panics. There's also a data race (kubernetes/kubernetes#110854). Normally that should never happen because tests should ensure that all goroutines have stopped before returning. But sometimes it is not possible to do that. ktesting now automatically protects against that by registering a cleanup function and redirecting all future output into klog.
This makes ktesting more resilient against logging from leaked goroutines, which is a problem that came up in kubelet node shutdown tests (kubernetes/kubernetes#110854). Kubernetes-commit: 3581e308835c69b11b2c9437db44073129e0e2bf
This makes ktesting more resilient against logging from leaked goroutines, which is a problem that came up in kubelet node shutdown tests (kubernetes/kubernetes#110854). Kubernetes-commit: 3581e308835c69b11b2c9437db44073129e0e2bf
This makes ktesting more resilient against logging from leaked goroutines, which is a problem that came up in kubelet node shutdown tests (kubernetes/kubernetes#110854). Kubernetes-commit: 3581e308835c69b11b2c9437db44073129e0e2bf
This makes ktesting more resilient against logging from leaked goroutines, which is a problem that came up in kubelet node shutdown tests (kubernetes/kubernetes#110854). Kubernetes-commit: 3581e308835c69b11b2c9437db44073129e0e2bf
This makes ktesting more resilient against logging from leaked goroutines, which is a problem that came up in kubelet node shutdown tests (kubernetes/kubernetes#110854). Kubernetes-commit: 3581e308835c69b11b2c9437db44073129e0e2bf
This makes ktesting more resilient against logging from leaked goroutines, which is a problem that came up in kubelet node shutdown tests (kubernetes/kubernetes#110854). Kubernetes-commit: 3581e308835c69b11b2c9437db44073129e0e2bf
This makes ktesting more resilient against logging from leaked goroutines, which is a problem that came up in kubelet node shutdown tests (kubernetes/kubernetes#110854). Kubernetes-commit: 3581e308835c69b11b2c9437db44073129e0e2bf
This makes ktesting more resilient against logging from leaked goroutines, which is a problem that came up in kubelet node shutdown tests (kubernetes/kubernetes#110854). Kubernetes-commit: 3581e308835c69b11b2c9437db44073129e0e2bf
This makes ktesting more resilient against logging from leaked goroutines, which is a problem that came up in kubelet node shutdown tests (kubernetes/kubernetes#110854). Kubernetes-commit: 3581e308835c69b11b2c9437db44073129e0e2bf
This makes ktesting more resilient against logging from leaked goroutines, which is a problem that came up in kubelet node shutdown tests (kubernetes/kubernetes#110854). Kubernetes-commit: 3581e308835c69b11b2c9437db44073129e0e2bf
This makes ktesting more resilient against logging from leaked goroutines, which is a problem that came up in kubelet node shutdown tests (kubernetes/kubernetes#110854). Kubernetes-commit: 3581e308835c69b11b2c9437db44073129e0e2bf
This makes ktesting more resilient against logging from leaked goroutines, which is a problem that came up in kubelet node shutdown tests (kubernetes/kubernetes#110854). Kubernetes-commit: 3581e308835c69b11b2c9437db44073129e0e2bf
This makes ktesting more resilient against logging from leaked goroutines, which is a problem that came up in kubelet node shutdown tests (kubernetes/kubernetes#110854). Kubernetes-commit: 3581e308835c69b11b2c9437db44073129e0e2bf
This makes ktesting more resilient against logging from leaked goroutines, which is a problem that came up in kubelet node shutdown tests (kubernetes/kubernetes#110854). Kubernetes-commit: 3581e308835c69b11b2c9437db44073129e0e2bf
This makes ktesting more resilient against logging from leaked goroutines, which is a problem that came up in kubelet node shutdown tests (kubernetes/kubernetes#110854). Kubernetes-commit: 3581e308835c69b11b2c9437db44073129e0e2bf
This makes ktesting more resilient against logging from leaked goroutines, which is a problem that came up in kubelet node shutdown tests (kubernetes/kubernetes#110854). Kubernetes-commit: 3581e308835c69b11b2c9437db44073129e0e2bf
This makes ktesting more resilient against logging from leaked goroutines, which is a problem that came up in kubelet node shutdown tests (kubernetes/kubernetes#110854). Kubernetes-commit: 3581e308835c69b11b2c9437db44073129e0e2bf
This makes ktesting more resilient against logging from leaked goroutines, which is a problem that came up in kubelet node shutdown tests (kubernetes/kubernetes#110854). Kubernetes-commit: 3581e308835c69b11b2c9437db44073129e0e2bf
This makes ktesting more resilient against logging from leaked goroutines, which is a problem that came up in kubelet node shutdown tests (kubernetes/kubernetes#110854). Kubernetes-commit: 3581e308835c69b11b2c9437db44073129e0e2bf
This makes ktesting more resilient against logging from leaked goroutines, which is a problem that came up in kubelet node shutdown tests (kubernetes/kubernetes#110854). Kubernetes-commit: 3581e308835c69b11b2c9437db44073129e0e2bf
This makes ktesting more resilient against logging from leaked goroutines, which is a problem that came up in kubelet node shutdown tests (kubernetes/kubernetes#110854). Kubernetes-commit: 3581e308835c69b11b2c9437db44073129e0e2bf
This makes ktesting more resilient against logging from leaked goroutines, which is a problem that came up in kubelet node shutdown tests (kubernetes/kubernetes#110854). Kubernetes-commit: 3581e308835c69b11b2c9437db44073129e0e2bf
This makes ktesting more resilient against logging from leaked goroutines, which is a problem that came up in kubelet node shutdown tests (kubernetes/kubernetes#110854). Kubernetes-commit: 3581e308835c69b11b2c9437db44073129e0e2bf
This makes ktesting more resilient against logging from leaked goroutines, which is a problem that came up in kubelet node shutdown tests (kubernetes/kubernetes#110854). Kubernetes-commit: 3581e308835c69b11b2c9437db44073129e0e2bf
The klog update went in, the race in the logging path should be gone now. The kubelet shutdown test still has other data races, but they don't seem to occur in the CI. |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
Refer to the comment #110854 (comment), I'd like to close this issue then. |
@kerthcet: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Which jobs are flaking?
https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/110768/pull-kubernetes-unit/1541991414412349440
Which tests are flaking?
TestLocalStorage
inpkg/kubelet/nodeshutdown
Since when has it been flaking?
N/A
Testgrid link
No response
Reason for failure (if possible)
No response
Anything else we need to know?
No response
Relevant SIG(s)
/sig node
The text was updated successfully, but these errors were encountered: