New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kubelet: Race condition in nodeshutdown unit test #108040
Comments
/sig node |
/remove-kind bug |
A related pr merged #107774 @MadhavJivrajani |
/assign |
Thanks @utkarsh348, for checking again. |
Maybe we need something like this: #98944? |
@utkarsh348 are u working on resolve this issue? i wonder why
|
I think he's still working on it.
doesn't start another go routine. In the kubelet's case however, the write part of the race condition happens in another go routine that is spawned:
write:
|
/triage accepted |
this spiked in the last couple days - https://storage.googleapis.com/k8s-triage/index.html?pr=1&test=TestLocalStorage is this related to the klog bump (#108725)? cc @pohly /priority critical-urgent |
increasing priority since it is impacting test runs, and marking as blocker until root caused and we understand whether it is prod-impacting |
/assign ^ I'll take this one up on monday - @utkarsh348 still feel free to open a PR if you have WIP investigation/research |
@endocrimes please see #108193 |
I don't see how the klog bump could have made it worse. Perhaps the change around the flush daemon changed some timing conditions, but that's a rather wild guess. These tests have been faulty all along and need to be fixed. What klog can do is support unit tests like this better. I've opened two issues:
|
@liggitt I think it's safe to remove the release-blocker here. The root cause is largely as explained here: #108040 (comment) (appears to be test-only) |
The code as it stands now works, but it is still complicated and previous versions had race conditions (kubernetes#108040). Now the test works without modifying global state. The individual test cases could run in parallel, this just isn't done because they complete quickly already (2 seconds).
What happened?
There seems to be a race condition in the following unit test
kubernetes/pkg/kubelet/nodeshutdown/nodeshutdown_manager_linux_test.go
Line 617 in 40c2d04
The race happens between
read at
kubernetes/pkg/kubelet/nodeshutdown/nodeshutdown_manager_linux_test.go
Line 707 in 40c2d04
and write at
kubernetes/pkg/kubelet/nodeshutdown/nodeshutdown_manager_linux.go
Line 328 in 40c2d04
What did you expect to happen?
No race condition
How can we reproduce it (as minimally and precisely as possible)?
Anything else we need to know?
CI logs where race was detected: https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/108039/pull-kubernetes-unit/1491676749694504960
Kubernetes version
Cloud provider
OS version
Install tools
Container runtime (CRI) and and version (if applicable)
Related plugins (CNI, CSI, ...) and versions (if applicable)
The text was updated successfully, but these errors were encountered: