Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vm restart when kubelet restart #11662

Open
wavezhang opened this issue Apr 8, 2024 · 5 comments
Open

vm restart when kubelet restart #11662

wavezhang opened this issue Apr 8, 2024 · 5 comments
Labels

Comments

@wavezhang
Copy link

wavezhang commented Apr 8, 2024

What happened:
A clear and concise description of what the bug is.
vm restart when kubelet restart
What you expected to happen:
A clear and concise description of what you expected to happen.
vm not affected when kubelet restart
How to reproduce it (as minimally and precisely as possible):
Steps to reproduce the behavior.
restart kubelet until you see vm pod stop
Additional context:
Add any other context about the problem here.

kubelet log

predicate.go:129] "Predicate failed on Pod" pod="default/virt-launcher-vm-ftiq8-sk8jk" err="Predicate NodeAffinity failed"

func (h *HeartBeat) heartBeat(heartBeatInterval time.Duration, stopCh chan struct{}) {
	// ensure that the node is synchronized with the actual state
	// especially setting the node to unschedulable if device plugins are not yet ready is very important
	// otherwise workloads get scheduled but are immediately terminated by the kubelet
	h.do()
	// Now wait for 10 seconds for the device plugins  to be initialized
	// This is more than fast enough to be not treated as unschedulable by the cluster
	// and ensures that the cluster gets marked as scheduled as soon as the device plugin is ready
	h.waitForDevicePlugins(stopCh)

	// from now on periodically update the node status
	wait.JitterUntil(h.do, heartBeatInterval, 1.2, true, stopCh)
}

It seems that h.waitForDevicePlugins(stopCh) should put into h.do() ?

Environment:

  • KubeVirt version (use virtctl version): N/A
  • Kubernetes version (use kubectl version): N/A
  • VM or VMI specifications: N/A
  • Cloud provider or hardware configuration: N/A
  • OS (e.g. from /etc/os-release): N/A
  • Kernel (e.g. uname -a): N/A
  • Install tools: N/A
  • Others: N/A
@akalenyu
Copy link
Contributor

akalenyu commented Apr 8, 2024

We had a similar issue a while ago, and it was fixed and backported (kubernetes/kubernetes#118635)
Could you check if the k8s version you're using is impacted?

@wavezhang
Copy link
Author

We had a similar issue a while ago, and it was fixed and backported (kubernetes/kubernetes#118635) Could you check if the k8s version you're using is impacted?

it‘s not the same promblem, see kubelet logs

@victortoso
Copy link
Member

victortoso commented Apr 8, 2024

@wavezhang what KubeVirt and k8s versions are you running?

@fabiand
Copy link
Member

fabiand commented Apr 8, 2024

And what container runtime are you using?

But IIRC then as part of the discussion around kubernetes/kubernetes#118635 we today assume that kubelet restarts leads to container restarts.

  1. In controlled cases, we only see node maintenance to cause kubelet restarts, in that case we assume a node ot be drained (aka no vms are there)
  2. In uncontrolled cases, aka an error, we do expect VMs to be killed.

@Homura222
Copy link

Homura222 commented Apr 24, 2024

The virt-handler device_controller is watching the kubelet.sock file, and when kubelet restarts, dpi.initialized is set to false. If virt-handler executes a heartbeat at this time, the kubevirt.io/schedulable label of the node where virt-handler resides will be set to false. After kubelet restarts, kubelet will kill pods that do not match the node label.(virt-launcher pod nodeSelector is kubevirt.io/schedulable: "true") This will result in all VMs on the node being restarted.

issue:
kubernetes/kubernetes#123980
kubernetes/kubernetes#124586

How to reproduce it (as minimally and precisely as possible):
Restart kubelet every 1s until the kubevirt.io/schedulable label on the node becomes false. Or delete virt-handler pod, set the kubevirt.io/schedulable label on the node to false, then restart kubelet.

I believe the solution to this problem is:
Fix the issue of kubelet kill pods that do not match the node label when restarting(kubernetes/kubernetes#124367), or Improve the virt-handler heartbeats.
cc @victortoso @fabiand @akalenyu @rmohr

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants