data race in nodeshutdown tests #110854

kerthcet · 2022-06-29T06:03:54Z

Which jobs are flaking?

https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/110768/pull-kubernetes-unit/1541991414412349440

WARNING: DATA RACE
Read at 0x00c0004c1703 by goroutine 96:
  testing.(*common).logDepth()
      /usr/local/go/src/testing/testing.go:882 +0x4ce
  testing.(*common).log()
      /usr/local/go/src/testing/testing.go:869 +0x84
  testing.(*common).Log()
      /usr/local/go/src/testing/testing.go:910 +0x58
  testing.(*T).Log()
      <autogenerated>:1 +0x55
  k8s.io/kubernetes/vendor/k8s.io/klog/v2/ktesting.(*tlogger).log()
      /home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/klog/v2/ktesting/testinglogger.go:279 +0x524
  k8s.io/kubernetes/vendor/k8s.io/klog/v2/ktesting.(*tlogger).Info()
      /home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/klog/v2/ktesting/testinglogger.go:250 +0x157
  k8s.io/kubernetes/vendor/github.com/go-logr/logr.Logger.Info()
      /home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/github.com/go-logr/logr/logr.go:261 +0xe3
  k8s.io/kubernetes/pkg/kubelet/nodeshutdown.(*managerImpl).processShutdownEvent.func3()
      /home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/pkg/kubelet/nodeshutdown/nodeshutdown_manager_linux.go:386 +0x7e4
  k8s.io/kubernetes/pkg/kubelet/nodeshutdown.(*managerImpl).processShutdownEvent.func5()
      /home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/pkg/kubelet/nodeshutdown/nodeshutdown_manager_linux.go:388 +0x9d
Previous write at 0x00c0004c1703 by goroutine 8:
  testing.tRunner.func1()
      /usr/local/go/src/testing/testing.go:1426 +0x7af
  runtime.deferreturn()
      /usr/local/go/src/runtime/panic.go:436 +0x32
  testing.(*T).Run.func1()
      /usr/local/go/src/testing/testing.go:1486 +0x47

Which tests are flaking?

TestLocalStorage in pkg/kubelet/nodeshutdown

Since when has it been flaking?

N/A

Testgrid link

No response

Reason for failure (if possible)

No response

Anything else we need to know?

No response

Relevant SIG(s)

/sig node

The text was updated successfully, but these errors were encountered:

k8s-ci-robot · 2022-06-29T06:04:02Z

@kerthcet: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

kerthcet · 2022-06-29T06:07:44Z

cc @pohly as we introduced contextual logging weeks ago, I have no idea whether this is related.

pohly · 2022-06-29T06:58:04Z

Contextual logging is involved here because it enables log output through testing.T.Log.

It's interesting that the data race seems to be inside testing.T itself. It does mutex locking but apparently the access to the mutex itself is racing?!

/assign

pohly · 2022-06-29T06:58:41Z

The change which triggered this is from PR #110504 .

pohly · 2022-06-29T09:31:57Z

The main problem seems to be that the testing.T instance becomes unusable once the test it was created for terminates. Running the test locally repeatedly, I got the data race and:

==================
panic: Log in goroutine after TestRestart has completed: INFO Restarting watch for node shutdown events


goroutine 77 [running]:
testing.(*common).logDepth(0xc00060d860, {0xc0006d63c0, 0x2f}, 0x3)
	/nvme/gopath/go-1.18.1/src/testing/testing.go:887 +0x6c5
testing.(*common).log(...)
	/nvme/gopath/go-1.18.1/src/testing/testing.go:869
testing.(*common).Log(0xc00060d860, {0xc000325620, 0x2, 0x2})
	/nvme/gopath/go-1.18.1/src/testing/testing.go:910 +0x85
k8s.io/klog/v2/ktesting.(*tlogger).log(0xc0004382d0, {0x27fd715, 0x4}, {0x2840e7e, 0x29}, 0x1, 0xc000927ea8, {0x0, 0x0}, {0x0, ...})
	/nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/klog/v2/ktesting/testinglogger.go:279 +0x525
k8s.io/klog/v2/ktesting.(*tlogger).Info(0xc0004382d0, 0x3329070?, {0x2840e7e, 0x29}, {0x0, 0x0, 0x0})
	/nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/klog/v2/ktesting/testinglogger.go:250 +0x158
github.com/go-logr/logr.Logger.Info({{0x3329070?, 0xc0004382d0?}, 0xc0001d3fd0?}, {0x2840e7e, 0x29}, {0x0, 0x0, 0x0})
	/nvme/gopath/src/k8s.io/kubernetes/vendor/github.com/go-logr/logr/logr.go:261 +0xe4
k8s.io/kubernetes/pkg/kubelet/nodeshutdown.(*managerImpl).Start.func1()
	/nvme/gopath/src/k8s.io/kubernetes/pkg/kubelet/nodeshutdown/nodeshutdown_manager_linux.go:189 +0xd5
created by k8s.io/kubernetes/pkg/kubelet/nodeshutdown.(*managerImpl).Start
	/nvme/gopath/src/k8s.io/kubernetes/pkg/kubelet/nodeshutdown/nodeshutdown_manager_linux.go:182 +0x1cf

One the one hand, keeping goroutines running in the background after test completion is bad and should be avoided. On the other hand, this used to be okay(is) in the past (it just made the global log output even less useful) and cannot always be avoided.

What could be done to solve this is to "mute" testinglogger once the test completes. But that depends on a klog API extension.

When testing.T.Log gets called after the test has completed, it panics. There's also a data race (kubernetes/kubernetes#110854). Normally that should never happen because tests should ensure that all goroutines have stopped before returning. But sometimes it is not possible to do that. For those cases, "defer Stop(logger)" may be added to a test. When called, it will cause all future usage of the testing.T instance to be skipped.

harry1064 · 2022-06-29T11:43:55Z

Hi @pohly

Isn't systemDbus defer function get called earlier

kubernetes/pkg/kubelet/nodeshutdown/nodeshutdown_manager_linux_test.go

Lines 380 to 390 in f4abde9

    
           systemDbus = func() (dbusInhibiter, error) { 
        
           	defer func() { 
        
           		connChan <- struct{}{} 
        
           	}() 
        
           	ch := make(chan bool) 
        
           	shutdownChanMut.Lock() 
        
           	shutdownChan = ch 
        
           	shutdownChanMut.Unlock() 
        
           	dbus := &fakeDbus{currentInhibitDelay: systemInhibitDelay, shutdownChan: ch, overrideSystemInhibitDelay: overrideSystemInhibitDelay} 
        
           	return dbus, nil 
        
           }

therefore following line get executed immediately. Hence, testing.T instance became unusable before go routines inside Start function finishes.

kubernetes/pkg/kubelet/nodeshutdown/nodeshutdown_manager_linux_test.go

Line 418 in f4abde9

case <-connChan:

Also, saw your above PR kubernetes/klog#337
I have one question, How we will use this for existing test case where we cannot use waitGroup? e.g TestRestart()

pohly · 2022-06-29T13:07:29Z

TestRestart is a red herring (= not relevant). The unexpected output must be coming from one of the other unit tests, for example:

kubernetes/pkg/kubelet/nodeshutdown/nodeshutdown_manager_linux_test.go

Line 692 in dafa55b

logger, _ := ktesting.NewTestContext(t)

We have to add defer ktesting.Stop(logger) after such lines.

pohly · 2022-06-29T13:11:42Z

Scratch that. The version of TestRestart that you linked to doesn't use ktesting, but master does.

pohly · 2022-06-29T13:21:42Z

Hmm, does this test also have other data races?

After adding ktesting.Stop, I get:

Read at 0x00000402c820 by goroutine 112:
  k8s.io/kubernetes/pkg/kubelet/nodeshutdown.(*managerImpl).start()
      /nvme/gopath/src/k8s.io/kubernetes/pkg/kubelet/nodeshutdown/nodeshutdown_manager_linux.go:202 +0x49
  k8s.io/kubernetes/pkg/kubelet/nodeshutdown.(*managerImpl).Start.func1()
      /nvme/gopath/src/k8s.io/kubernetes/pkg/kubelet/nodeshutdown/nodeshutdown_manager_linux.go:190 +0xde

Previous write at 0x00000402c820 by goroutine 106:
  k8s.io/kubernetes/pkg/kubelet/nodeshutdown.TestRestart()
      /nvme/gopath/src/k8s.io/kubernetes/pkg/kubelet/nodeshutdown/nodeshutdown_manager_linux_test.go:391 +0x326
  testing.tRunner()
      /nvme/gopath/go-1.18.1/src/testing/testing.go:1439 +0x213
  testing.(*T).Run.func1()
      /nvme/gopath/go-1.18.1/src/testing/testing.go:1486 +0x47

That's a race around setting and calling systemDbus.

When checking out the version prior to my contextual logging change, I get:

$ git checkout 65385fec209fb5a6d549129fb03cd529c25a2cff~
Previous HEAD position was 10bea49c12d Merge pull request #110140 from marosset/hpc-sandbox-config-fixes
$ go test -count=5 -race ./pkg/kubelet/nodeshutdown/
...
--- FAIL: TestRestart (0.00s)
panic: close of closed channel [recovered]
	panic: close of closed channel

goroutine 424 [running]:
testing.tRunner.func1.2({0x25cb900, 0x32fe1d0})
	/nvme/gopath/go-1.18.1/src/testing/testing.go:1389 +0x366
testing.tRunner.func1()
	/nvme/gopath/go-1.18.1/src/testing/testing.go:1392 +0x5d2
panic({0x25cb900, 0x32fe1d0})
	/nvme/gopath/go-1.18.1/src/runtime/panic.go:844 +0x258
k8s.io/kubernetes/pkg/kubelet/nodeshutdown.TestRestart(0xc000583ba0)
	/nvme/gopath/src/k8s.io/kubernetes/pkg/kubelet/nodeshutdown/nodeshutdown_manager_linux_test.go:422 +0x686
testing.tRunner(0xc000583ba0, 0x28f3468)
	/nvme/gopath/go-1.18.1/src/testing/testing.go:1439 +0x214
created by testing.(*T).Run
	/nvme/gopath/go-1.18.1/src/testing/testing.go:1486 +0x725
FAIL	k8s.io/kubernetes/pkg/kubelet/nodeshutdown	5.162s
FAIL

Trying again gives me the same race I saw earlier:

==================
WARNING: DATA RACE
Write at 0x000004019818 by goroutine 48:
  k8s.io/kubernetes/pkg/kubelet/nodeshutdown.TestRestart()
      /nvme/gopath/src/k8s.io/kubernetes/pkg/kubelet/nodeshutdown/nodeshutdown_manager_linux_test.go:380 +0x286
  testing.tRunner()
      /nvme/gopath/go-1.18.1/src/testing/testing.go:1439 +0x213
  testing.(*T).Run.func1()
      /nvme/gopath/go-1.18.1/src/testing/testing.go:1486 +0x47

Previous read at 0x000004019818 by goroutine 114:
  k8s.io/kubernetes/pkg/kubelet/nodeshutdown.(*managerImpl).start()
      /nvme/gopath/src/k8s.io/kubernetes/pkg/kubelet/nodeshutdown/nodeshutdown_manager_linux.go:200 +0x49
  k8s.io/kubernetes/pkg/kubelet/nodeshutdown.(*managerImpl).Start.func1()
      /nvme/gopath/src/k8s.io/kubernetes/pkg/kubelet/nodeshutdown/nodeshutdown_manager_linux.go:188 +0x104

kubernetes/pkg/kubelet/nodeshutdown/nodeshutdown_manager_linux_test.go

Line 380 in d796dd7

systemDbus = func() (dbusInhibiter, error) {

kubernetes/pkg/kubelet/nodeshutdown/nodeshutdown_manager_linux.go

Line 200 in d796dd7

systemBus, err := systemDbus()

pohly · 2022-06-29T13:23:11Z

We have to add defer ktesting.Stop(logger) after such lines.

But a better solution would be to shut down all goroutines cleanly...

pohly · 2022-06-29T13:26:36Z

That probably also would fix the non-logging race, too. I bet it is the goroutine from one test which reads systemDbus (because line 200 gets called in a loop) while another test sets it.

harry1064 · 2022-06-29T13:45:27Z

Scratch that. The version of TestRestart that you linked to doesn't use ktesting, but master does.

I have accidently linked the old one but I mentioned TestRestart because in new implementation we are using ktesting.NewTestContext(t).

I thought the for loop we added at the end was to wait for the test to complete and as following chan read will not wait
https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/nodeshutdown/nodeshutdown_manager_linux_test.go#L425
because defer function has already written into it.

harry1064 · 2022-06-29T18:12:52Z

Hi @pohly
I was thinking, if the purpose of contextual logging is to test the logs content. Can we comment out the line

https://github.com/kubernetes/kubernetes/blob/master/vendor/k8s.io/klog/v2/ktesting/testinglogger.go#L279

This way we will not be calling Log on instance of *testing.T and it will not panic. As, lines after already collect the string in buffer, we will still be able to test the logs content.

pohly · 2022-06-29T19:02:06Z

That doesn't solve the problem that the test leaks goroutines and the race I mentioned in #110854 (comment)

pohly · 2022-06-29T19:07:40Z

I tried to ensure that all goroutines terminate, but some functions blocked in some test cases. So a "proper" solution might not work and we need something like kubernetes/klog#337

harry1064 · 2022-06-30T05:55:39Z

in kubernetes/klog#337, new Stop method will only solve the panic case, right? Go routines will still leaks from the test cases.
And for contextual logging, where the function we want to test spawn a go routine, we will not be able to test the log content, right?

pohly · 2022-06-30T06:03:55Z

in kubernetes/klog#337, new Stop method will only solve the panic case, right? Go routines will still leaks from the test cases.

Correct.

And for contextual logging, where the function we want to test spawn a go routine, we will not be able to test the log content, right?

The test would have to ensure that the goroutine is done with logging before checking the output.

harry1064 · 2022-06-30T06:12:48Z

The test would have to ensure that the goroutine is done with logging before checking the output.

Then we have to change our functionality under test to have the ability to inject something like a channel and then in the test we have to listen on that chan, in this way we can keep the *testing.T to not return until we read from that channel.
But that would be a lot of changes in code base, I assume.

When testing.T.Log gets called after the test has completed, it panics. There's also a data race (kubernetes/kubernetes#110854). Normally that should never happen because tests should ensure that all goroutines have stopped before returning. But sometimes it is not possible to do that. ktesting now automatically protects against that by registering a cleanup function and redirecting all future output into klog.

This makes ktesting more resilient against logging from leaked goroutines, which is a problem that came up in kubelet node shutdown tests (kubernetes/kubernetes#110854). Kubernetes-commit: 3581e308835c69b11b2c9437db44073129e0e2bf

pohly · 2022-07-08T06:24:41Z

The klog update went in, the race in the logging path should be gone now. The kubelet shutdown test still has other data races, but they don't seem to occur in the CI.

k8s-triage-robot · 2022-10-06T07:12:25Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

kerthcet · 2022-10-07T05:58:18Z

Refer to the comment #110854 (comment), I'd like to close this issue then.
/close

k8s-ci-robot · 2022-10-07T05:58:22Z

@kerthcet: Closing this issue.

In response to this:

Refer to the comment #110854 (comment), I'd like to close this issue then.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

kerthcet added the kind/flake Categorizes issue or PR as related to a flaky test. label Jun 29, 2022

k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 29, 2022

k8s-ci-robot assigned pohly Jun 29, 2022

pohly mentioned this issue Jun 29, 2022

ktesting: handle test completion kubernetes/klog#337

Merged

pohly mentioned this issue Jun 29, 2022

[Failing test] ci-kubernetes-unit test failing #110867

Closed

pohly mentioned this issue Jun 29, 2022

kubelet: revert contextual logging support #110869

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 6, 2022

k8s-ci-robot closed this as completed Oct 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data race in nodeshutdown tests #110854

data race in nodeshutdown tests #110854

kerthcet commented Jun 29, 2022

k8s-ci-robot commented Jun 29, 2022

kerthcet commented Jun 29, 2022

pohly commented Jun 29, 2022

pohly commented Jun 29, 2022

pohly commented Jun 29, 2022

harry1064 commented Jun 29, 2022 •

edited

pohly commented Jun 29, 2022

pohly commented Jun 29, 2022

pohly commented Jun 29, 2022

pohly commented Jun 29, 2022

pohly commented Jun 29, 2022

harry1064 commented Jun 29, 2022 •

edited

harry1064 commented Jun 29, 2022

pohly commented Jun 29, 2022

pohly commented Jun 29, 2022

harry1064 commented Jun 30, 2022

pohly commented Jun 30, 2022

harry1064 commented Jun 30, 2022

pohly commented Jul 8, 2022

k8s-triage-robot commented Oct 6, 2022

kerthcet commented Oct 7, 2022

k8s-ci-robot commented Oct 7, 2022

data race in nodeshutdown tests #110854

data race in nodeshutdown tests #110854

Comments

kerthcet commented Jun 29, 2022

Which jobs are flaking?

Which tests are flaking?

Since when has it been flaking?

Testgrid link

Reason for failure (if possible)

Anything else we need to know?

Relevant SIG(s)

k8s-ci-robot commented Jun 29, 2022

kerthcet commented Jun 29, 2022

pohly commented Jun 29, 2022

pohly commented Jun 29, 2022

pohly commented Jun 29, 2022

harry1064 commented Jun 29, 2022 • edited

pohly commented Jun 29, 2022

pohly commented Jun 29, 2022

pohly commented Jun 29, 2022

pohly commented Jun 29, 2022

pohly commented Jun 29, 2022

harry1064 commented Jun 29, 2022 • edited

harry1064 commented Jun 29, 2022

pohly commented Jun 29, 2022

pohly commented Jun 29, 2022

harry1064 commented Jun 30, 2022

pohly commented Jun 30, 2022

harry1064 commented Jun 30, 2022

pohly commented Jul 8, 2022

k8s-triage-robot commented Oct 6, 2022

kerthcet commented Oct 7, 2022

k8s-ci-robot commented Oct 7, 2022

harry1064 commented Jun 29, 2022 •

edited

harry1064 commented Jun 29, 2022 •

edited