New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Virt-handler heartbeat causes the vm to be restarted when kubelet is restarted. #11843
Comments
I think it makes sense to |
Also, @Homura222 do you see a case not covered by Fabian's #11662 (comment)? That is, |
I think you can see pkg/virt-handler/device-manager/generic_device.go Start function and healthCheck function. I think First, Kubelet will rebuild the // pkg/virt-handler/device-manager/generic_device.go
func (dpi *GenericDevicePlugin) healthCheck() error {
...
dirName = filepath.Dir(dpi.socketPath)
err = watcher.Add(dirName)
if err != nil {
return fmt.Errorf("failed to add the device-plugin kubelet path to the watcher: %v", err)
}
// socketPath is /var/lib/kubelet/device-plugins/kubevirt-${deviceName}.sock
_, err = os.Stat(dpi.socketPath)
if err != nil {
return fmt.Errorf("failed to stat the device-plugin socket: %v", err)
}
for {
select {
case <-dpi.stop:
return nil
case err := <-watcher.Errors:
logger.Reason(err).Errorf("error watching devices and device plugin directory")
case event := <-watcher.Events:
logger.V(4).Infof("health Event: %v", event)
if event.Name == devicePath {
// Health in this case is if the device path actually exists
if event.Op == fsnotify.Create {
logger.Infof("monitored device %s appeared", dpi.deviceName)
dpi.health <- pluginapi.Healthy
} else if (event.Op == fsnotify.Remove) || (event.Op == fsnotify.Rename) {
logger.Infof("monitored device %s disappeared", dpi.deviceName)
dpi.health <- pluginapi.Unhealthy
}
} else if event.Name == dpi.socketPath && event.Op == fsnotify.Remove {
logger.Infof("device socket file for device %s was removed, kubelet probably restarted.", dpi.deviceName)
return nil
}
}
}
}
// Start starts the device plugin
func (dpi *GenericDevicePlugin) Start(stop chan struct{}) (err error) {
...
defer dpi.Stop()
...
go func() {
errChan <- dpi.healthCheck()
}()
dpi.setInitialized(true)
logger.Infof("%s device plugin started", dpi.deviceName)
err = <-errChan
return err
}
// Stop stops the gRPC server
func (dpi *GenericDevicePlugin) Stop() error {
defer func() {
if !IsChanClosed(dpi.done) {
close(dpi.done)
}
}()
dpi.server.Stop()
dpi.setInitialized(false)
return dpi.cleanup()
} If virt-handler executes a heartbeat at this time, the node // pkg/virt-handler/device-manager/device_controller.go
func (c *DeviceController) Initialized() bool {
c.startedPluginsMutex.Lock()
defer c.startedPluginsMutex.Unlock()
for _, dev := range c.startedPlugins {
if !dev.devicePlugin.GetInitialized() {
return false
}
}
return true
}
// pkg/virt-handler/heartbeat/heartbeat.go
func (h *HeartBeat) do() {
...
kubevirtSchedulable := "true"
if !h.deviceManagerController.Initialized() {
kubevirtSchedulable = "false"
}
var data []byte
// Label the node if cpu manager is running on it
// This is a temporary workaround until k8s bug #66525 is resolved
cpuManagerEnabled := false
if h.clusterConfig.CPUManagerEnabled() {
cpuManagerEnabled = h.isCPUManagerEnabled(h.cpuManagerPaths)
}
data = []byte(fmt.Sprintf(`{"metadata": { "labels": {"%s": "%s", "%s": "%t"}, "annotations": {"%s": %s}}}`,
v1.NodeSchedulable, kubevirtSchedulable,
v1.CPUManager, cpuManagerEnabled,
v1.VirtHandlerHeartbeat, string(now),
))
_, err = h.clientset.Nodes().Patch(context.Background(), h.host, types.StrategicMergePatchType, data, metav1.PatchOptions{})
if err != nil {
log.DefaultLogger().Reason(err).Errorf("Can't patch node %s", h.host)
return
}
...
} After kubelet restarts, kubelet will kill pods that do not match the node label.(virt-launcher pod nodeSelector is Some issues about kubelet behavior: |
I also think |
/cc @victortoso @rmohr |
I think we should ensure that we are using node affinity with ignore during execution and closely follow kubernetes/kubernetes#124586. Once that is fixed and we use node affinity properly, we should be good. Thanks for raising this. I am not very keen on changing the labeling behaviour as such. It solved quite some kubelet restart issues for us where pods got rescheduled to the node before device plugins were ready again, kubelets admission control rejected them all, they got rescheduled again (or not because of run strategy once). |
Ok, after checking in on kubernetes/kubernetes#124586, I think our path forward would be that the heartbeat label is requested through a pod affinity with ignore during execution that should give us right now already exactly what we want. |
@rmohr Do you mean to move |
Well I mean I did not try it myself, but I asked for clarification and according to kubernetes/kubernetes#124586 (comment), in the reported case the pod was not yet running and therefore pushed out after the restart. If it would have been running, it would not have been stopped. |
In the reported case,kubelet also kill running pod when kubelet is restarted. kubernetes/kubernetes#123980 kubernetes/kubernetes#124367 can fix this case in kubelet. But it hasn't been approved yet. |
This is definitely happening with let me also point out that you did not seem to add affinity related tests, so not sure if you checked it on your PR. |
apiVersion: v1
kind: Pod
metadata:
name: my-pod
labels:
app: my-app
spec:
containers:
- command:
- bash
- -c
- sleep 10000
name: my-container
image: toolkits-centos:1.0.0
tolerations:
- effect: NoSchedule
key: node-role.kubernetes.io/master
operator: Exists
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: a
operator: In
values:
- b
@rmohr you can try it |
apiVersion: v1
kind: Pod
metadata:
name: my-pod
labels:
app: my-app
spec:
containers:
- command:
- bash
- -c
- sleep 10000
name: my-container
image: toolkits-centos:1.0.0.amd64
tolerations:
- effect: NoSchedule
key: node-role.kubernetes.io/master
operator: Exists
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
preference:
matchExpressions:
- key: a
operator: In
values:
- b Use But |
I think going back to kubernetes/kubernetes#124586 (comment) (as you did) makes sense. Let's see what is said there. |
What happened:
Virt-handler heartbeat causes the vm to be restarted when kubelet is restarted.
The virt-handler device_controller is watching the kubelet.sock file, and when kubelet restarts,
dpi.initialized
is set tofalse
. If virt-handler executes a heartbeat at this time, thekubevirt.io/schedulable
label of the node will be set tofalse
. After kubelet restarts, kubelet will kill pods that do not match the node label.(virt-launcher pod nodeSelector iskubevirt.io/schedulable: "true"
)The virt-launcher pod will be killed by kubelet, causing the vm to restart.related:
kubernetes/kubernetes#123980
kubernetes/kubernetes#124586
#11662
How to reproduce it (as minimally and precisely as possible):
Restart kubelet every 1s until the
kubevirt.io/schedulable
label on the node becomesfalse
. Or delete virt-handler pod, set thekubevirt.io/schedulable
label on the node tofalse
, then restart kubelet.Environment:
virtctl version
): 0.42.1kubectl version
): 1.22uname -a
): N/AI believe the solution to this problem:
Fix the issue of kubelet kill pods that do not match the node label when restarting(kubernetes/kubernetes#124367), or Improve the virt-handler heartbeats.
The text was updated successfully, but these errors were encountered: