Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubelet: Race condition in nodeshutdown unit test #108040

Closed
MadhavJivrajani opened this issue Feb 10, 2022 · 15 comments · Fixed by #108193
Closed

kubelet: Race condition in nodeshutdown unit test #108040

MadhavJivrajani opened this issue Feb 10, 2022 · 15 comments · Fixed by #108193
Assignees
Labels
kind/flake Categorizes issue or PR as related to a flaky test. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/node Categorizes an issue or PR as relevant to SIG Node. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@MadhavJivrajani
Copy link
Contributor

What happened?

There seems to be a race condition in the following unit test

func Test_managerImpl_processShutdownEvent(t *testing.T) {

The race happens between
read at

and write at

klog.V(1).InfoS("Shutdown manager finished killing pod", "pod", klog.KObj(pod))

What did you expect to happen?

No race condition

How can we reproduce it (as minimally and precisely as possible)?

cd $KUBE_ROOT/pkg/kubelet/nodeshutdown
go test -c -race
stress ./nodeshutdown.test -test.run ^Test_managerImpl_processShutdownEvent$

Anything else we need to know?

CI logs where race was detected: https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/108039/pull-kubernetes-unit/1491676749694504960

Kubernetes version

$ kubectl version
# paste output here

Cloud provider

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

@MadhavJivrajani MadhavJivrajani added the kind/bug Categorizes issue or PR as related to a bug. label Feb 10, 2022
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 10, 2022
@MadhavJivrajani
Copy link
Contributor Author

/sig node
/cc @adisky @endocrimes

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Feb 10, 2022
@MadhavJivrajani
Copy link
Contributor Author

MadhavJivrajani commented Feb 10, 2022

/remove-kind bug
/kind flake
(potential flake, did not use the flake template because CI is green for this test)

@k8s-ci-robot k8s-ci-robot added kind/flake Categorizes issue or PR as related to a flaky test. and removed kind/bug Categorizes issue or PR as related to a bug. labels Feb 10, 2022
@ehashman ehashman added this to Triage in SIG Node CI/Test Board Feb 11, 2022
@kerthcet
Copy link
Member

A related pr merged #107774 @MadhavJivrajani

@utkarsh348
Copy link
Contributor

/assign

@MadhavJivrajani
Copy link
Contributor Author

Thanks @utkarsh348, for checking again.
#107774 (comment)
@kerthcet fyi

@MadhavJivrajani
Copy link
Contributor Author

Maybe we need something like this: #98944?

@jonyhy96
Copy link
Contributor

@utkarsh348 are u working on resolve this issue? i wonder why

doesn't face a race condition

@MadhavJivrajani
Copy link
Contributor Author

I think he's still working on it.
The race condition doesn't happen there because

NewDebuggingRoundTripper(rt, tc.levels...).RoundTrip(req)

doesn't start another go routine. In the kubelet's case however, the write part of the race condition happens in another go routine that is spawned:

go func(pod *v1.Pod, group podShutdownGroup) {

write:
klog.V(1).InfoS("Shutdown manager finished killing pod", "pod", klog.KObj(pod))

@SergeyKanzhelev
Copy link
Member

/triage accepted
/priority important-soon

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 16, 2022
@SergeyKanzhelev SergeyKanzhelev moved this from Triage to Issues - In progress in SIG Node CI/Test Board Feb 16, 2022
@liggitt
Copy link
Member

liggitt commented Mar 26, 2022

seen in https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/109048/pull-kubernetes-unit/1507725298571939840

this spiked in the last couple days - https://storage.googleapis.com/k8s-triage/index.html?pr=1&test=TestLocalStorage

is this related to the klog bump (#108725)?

cc @pohly

/priority critical-urgent
/milestone v1.24

@k8s-ci-robot k8s-ci-robot added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Mar 26, 2022
@k8s-ci-robot k8s-ci-robot added this to the v1.24 milestone Mar 26, 2022
@liggitt liggitt removed the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Mar 26, 2022
@liggitt
Copy link
Member

liggitt commented Mar 26, 2022

increasing priority since it is impacting test runs, and marking as blocker until root caused and we understand whether it is prod-impacting

@endocrimes
Copy link
Member

/assign

^ I'll take this one up on monday - @utkarsh348 still feel free to open a PR if you have WIP investigation/research

@MadhavJivrajani
Copy link
Contributor Author

MadhavJivrajani commented Mar 27, 2022

@endocrimes please see #108193
if you have comments/inputs on the already open one.

SIG Node CI/Test Board automation moved this from Issues - In progress to Done Mar 27, 2022
@pohly
Copy link
Contributor

pohly commented Mar 28, 2022

seen in https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/109048/pull-kubernetes-unit/1507725298571939840

this spiked in the last couple days - https://storage.googleapis.com/k8s-triage/index.html?pr=1&test=TestLocalStorage

is this related to the klog bump (#108725)?

cc @pohly

I don't see how the klog bump could have made it worse. Perhaps the change around the flush daemon changed some timing conditions, but that's a rather wild guess. These tests have been faulty all along and need to be fixed.

What klog can do is support unit tests like this better. I've opened two issues:

@MadhavJivrajani
Copy link
Contributor Author

increasing priority since it is impacting test runs, and marking as blocker until root caused and we understand whether it is prod-impacting

@liggitt I think it's safe to remove the release-blocker here. The root cause is largely as explained here: #108040 (comment) (appears to be test-only)

pohly added a commit to pohly/kubernetes that referenced this issue Jun 24, 2022
The code as it stands now works, but it is still complicated and previous
versions had race
conditions (kubernetes#108040). Now the
test works without modifying global state. The individual test cases could run
in parallel, this just isn't done because they complete quickly already (2
seconds).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/flake Categorizes issue or PR as related to a flaky test. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/node Categorizes an issue or PR as relevant to SIG Node. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

9 participants