kubelet: Race condition in nodeshutdown unit test #108040

MadhavJivrajani · 2022-02-10T08:41:40Z

What happened?

There seems to be a race condition in the following unit test

kubernetes/pkg/kubelet/nodeshutdown/nodeshutdown_manager_linux_test.go

Line 617 in 40c2d04

func Test_managerImpl_processShutdownEvent(t *testing.T) {

The race happens between
read at

kubernetes/pkg/kubelet/nodeshutdown/nodeshutdown_manager_linux_test.go

Line 707 in 40c2d04

log := tmpWriteBuffer.String()

and write at

kubernetes/pkg/kubelet/nodeshutdown/nodeshutdown_manager_linux.go

Line 328 in 40c2d04

klog.V(1).InfoS("Shutdown manager finished killing pod", "pod", klog.KObj(pod))

What did you expect to happen?

No race condition

How can we reproduce it (as minimally and precisely as possible)?

cd $KUBE_ROOT/pkg/kubelet/nodeshutdown
go test -c -race
stress ./nodeshutdown.test -test.run ^Test_managerImpl_processShutdownEvent$

Anything else we need to know?

CI logs where race was detected: https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/108039/pull-kubernetes-unit/1491676749694504960

Kubernetes version

$ kubectl version
# paste output here

Cloud provider

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

MadhavJivrajani · 2022-02-10T08:42:17Z

/sig node
/cc @adisky @endocrimes

MadhavJivrajani · 2022-02-10T08:44:24Z

/remove-kind bug
/kind flake
(potential flake, did not use the flake template because CI is green for this test)

kerthcet · 2022-02-14T03:01:19Z

A related pr merged #107774 @MadhavJivrajani

utkarsh348 · 2022-02-14T03:17:23Z

/assign

MadhavJivrajani · 2022-02-14T13:57:23Z

Thanks @utkarsh348, for checking again.
#107774 (comment)
@kerthcet fyi

MadhavJivrajani · 2022-02-14T14:02:38Z

Maybe we need something like this: #98944?

jonyhy96 · 2022-02-15T10:34:29Z

@utkarsh348 are u working on resolve this issue? i wonder why

kubernetes/staging/src/k8s.io/client-go/transport/round_trippers_test.go

Line 518 in d899c39

// hijack the klog output

doesn't face a race condition

MadhavJivrajani · 2022-02-15T10:45:49Z

I think he's still working on it.
The race condition doesn't happen there because

kubernetes/staging/src/k8s.io/client-go/transport/round_trippers_test.go

Line 534 in d899c39

NewDebuggingRoundTripper(rt, tc.levels...).RoundTrip(req)

doesn't start another go routine. In the kubelet's case however, the write part of the race condition happens in another go routine that is spawned:

kubernetes/pkg/kubelet/nodeshutdown/nodeshutdown_manager_linux.go

Line 303 in 40c2d04

go func(pod *v1.Pod, group podShutdownGroup) {

write:

kubernetes/pkg/kubelet/nodeshutdown/nodeshutdown_manager_linux.go

Line 328 in 40c2d04

klog.V(1).InfoS("Shutdown manager finished killing pod", "pod", klog.KObj(pod))

SergeyKanzhelev · 2022-02-16T18:14:13Z

/triage accepted
/priority important-soon

liggitt · 2022-03-26T16:37:38Z

seen in https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/109048/pull-kubernetes-unit/1507725298571939840

this spiked in the last couple days - https://storage.googleapis.com/k8s-triage/index.html?pr=1&test=TestLocalStorage

is this related to the klog bump (#108725)?

cc @pohly

/priority critical-urgent
/milestone v1.24

liggitt · 2022-03-26T16:58:38Z

increasing priority since it is impacting test runs, and marking as blocker until root caused and we understand whether it is prod-impacting

endocrimes · 2022-03-26T20:59:52Z

/assign

^ I'll take this one up on monday - @utkarsh348 still feel free to open a PR if you have WIP investigation/research

MadhavJivrajani · 2022-03-27T03:59:11Z

@endocrimes please see #108193
if you have comments/inputs on the already open one.

pohly · 2022-03-28T07:29:03Z

seen in https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/109048/pull-kubernetes-unit/1507725298571939840

this spiked in the last couple days - https://storage.googleapis.com/k8s-triage/index.html?pr=1&test=TestLocalStorage

is this related to the klog bump (#108725)?

cc @pohly

I don't see how the klog bump could have made it worse. Perhaps the change around the flush daemon changed some timing conditions, but that's a rather wild guess. These tests have been faulty all along and need to be fixed.

What klog can do is support unit tests like this better. I've opened two issues:

short term: restore state in unit tests
long term as it depends on contextual logging: capture klog output without SetBuffer

MadhavJivrajani · 2022-03-28T07:32:05Z

increasing priority since it is impacting test runs, and marking as blocker until root caused and we understand whether it is prod-impacting

@liggitt I think it's safe to remove the release-blocker here. The root cause is largely as explained here: #108040 (comment) (appears to be test-only)

The code as it stands now works, but it is still complicated and previous versions had race conditions (kubernetes#108040). Now the test works without modifying global state. The individual test cases could run in parallel, this just isn't done because they complete quickly already (2 seconds).

MadhavJivrajani added the kind/bug Categorizes issue or PR as related to a bug. label Feb 10, 2022

k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 10, 2022

k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Feb 10, 2022

MadhavJivrajani mentioned this issue Feb 10, 2022

resttest: Modify TestContext to have namespace in context #108039

Merged

k8s-ci-robot added kind/flake Categorizes issue or PR as related to a flaky test. and removed kind/bug Categorizes issue or PR as related to a bug. labels Feb 10, 2022

ehashman added this to Triage in SIG Node CI/Test Board Feb 11, 2022

k8s-ci-robot assigned utkarsh348 Feb 14, 2022

utkarsh348 mentioned this issue Feb 14, 2022

fix: data race when hijack klog #107774

Merged

SergeyKanzhelev moved this from Triage to Issues - In progress in SIG Node CI/Test Board Feb 16, 2022

utkarsh348 mentioned this issue Feb 17, 2022

Fixed race condition in test manager shutdown #108193

Merged

endocrimes mentioned this issue Mar 16, 2022

Data race in pkg/kubelet/nodeshutdown test #108707

Closed

k8s-ci-robot added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Mar 26, 2022

k8s-ci-robot added this to the v1.24 milestone Mar 26, 2022

liggitt removed the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Mar 26, 2022

liggitt added the release-blocker label Mar 26, 2022

liggitt mentioned this issue Mar 26, 2022

Integration subtests #109048

Merged

k8s-ci-robot assigned endocrimes Mar 26, 2022

k8s-ci-robot closed this as completed in #108193 Mar 27, 2022

SIG Node CI/Test Board automation moved this from Issues - In progress to Done Mar 27, 2022

This was referenced Mar 28, 2022

capture klog output without SetBuffer kubernetes/klog#317

Closed

restore state in unit tests kubernetes/klog#318

Closed

liggitt removed the release-blocker label Mar 28, 2022

pohly mentioned this issue Jun 10, 2022

kubelet: convert node shutdown manager to contextual logging #110504

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kubelet: Race condition in nodeshutdown unit test #108040

kubelet: Race condition in nodeshutdown unit test #108040

MadhavJivrajani commented Feb 10, 2022

MadhavJivrajani commented Feb 10, 2022

MadhavJivrajani commented Feb 10, 2022 •

edited

kerthcet commented Feb 14, 2022

utkarsh348 commented Feb 14, 2022

MadhavJivrajani commented Feb 14, 2022

MadhavJivrajani commented Feb 14, 2022

jonyhy96 commented Feb 15, 2022

MadhavJivrajani commented Feb 15, 2022

SergeyKanzhelev commented Feb 16, 2022

liggitt commented Mar 26, 2022

liggitt commented Mar 26, 2022 •

edited

endocrimes commented Mar 26, 2022

MadhavJivrajani commented Mar 27, 2022 •

edited

pohly commented Mar 28, 2022

MadhavJivrajani commented Mar 28, 2022

kubelet: Race condition in nodeshutdown unit test #108040

kubelet: Race condition in nodeshutdown unit test #108040

Comments

MadhavJivrajani commented Feb 10, 2022

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Kubernetes version

Cloud provider

OS version

Install tools

Container runtime (CRI) and and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

MadhavJivrajani commented Feb 10, 2022

MadhavJivrajani commented Feb 10, 2022 • edited

kerthcet commented Feb 14, 2022

utkarsh348 commented Feb 14, 2022

MadhavJivrajani commented Feb 14, 2022

MadhavJivrajani commented Feb 14, 2022

jonyhy96 commented Feb 15, 2022

MadhavJivrajani commented Feb 15, 2022

SergeyKanzhelev commented Feb 16, 2022

liggitt commented Mar 26, 2022

liggitt commented Mar 26, 2022 • edited

endocrimes commented Mar 26, 2022

MadhavJivrajani commented Mar 27, 2022 • edited

pohly commented Mar 28, 2022

MadhavJivrajani commented Mar 28, 2022

MadhavJivrajani commented Feb 10, 2022 •

edited

liggitt commented Mar 26, 2022 •

edited

MadhavJivrajani commented Mar 27, 2022 •

edited