Volume Lifecycle is not correspond to actual k8s behavior #512

YuikoTakada · 2022-06-15T03:46:04Z

Current CSI volume lifecycle is not designed for the case when Node is unreachable.

When Node is shutdown or in a non-recoverable state such as hardware failure or broken OS, Node Plugin cannot issue NodeUnpublishVolume / NodeUnstageVolume . In this case, we want to make the status CREATED (volume is detached from the node and pods are evicted to other node and running)
But in current CSI volume lifecycle, there is no transition from PUBLISHED / VOL_READY / NODE_READY to CREATED.
As a result, k8s doesn't follow the CSI spec and the status moves from PUBLISHED to CREATED directly without going through VOL_READY and/or NODE_READY status.

We need to update the CSI volume lifecycle with considering the case when Node is unreachable.

The text was updated successfully, but these errors were encountered:

jdef · 2022-06-15T04:10:26Z

At first glance, this seems like a breaking change to the spec. As this (CSI) is not a k8s project, it is not bound by the referenced KEP. That said, others in the CSI community have thought long and hard about similar problems. So perhaps there's a way to amend the spec in a way that doesn't break existing drivers (or the integration w/ other platforms and said drivers). Community feedback here is welcome.

…

On Tue, Jun 14, 2022 at 11:46 PM Yuiko Mouri ***@***.***> wrote: As non-graceful node shutdown feature has been implemented in k8s as alpha, CSI volume lifecycle is not correspond to actual k8s behavior. see: - https://kubernetes.io/blog/2022/05/20/kubernetes-1-24-non-graceful-node-shutdown-alpha/ - https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/2268-non-graceful-shutdown By using non-graceful node shutdown, status can be moved from PUBLISHED to CREATED without going through VOL_READY and/or NODE_READY status. In the KEP, written as below: - Once pods are selected and forcefully deleted, the attachdetach reconciler should check the out-of-service taint on the node. If the taint is present, the attachdetach reconciler will not wait for 6 minutes to do force detach. Instead it will force detach right away and allow volumeAttachment to be deleted. - This would trigger the deletion of the volumeAttachment objects. For CSI drivers, this would allow ControllerUnpublishVolume to happen without NodeUnpublishVolume and/or NodeUnstageVolume being called first. Note that there is no additional code changes required for this step. This happens automatically after the Proposed change in the previous step to force detach right away. — Reply to this email directly, view it on GitHub <#512>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAR5KLBDKB5HOQQPN6U7A7LVPFGYVANCNFSM5YZZEDDQ> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

-- James DeFelice

YuikoTakada · 2022-06-16T00:28:07Z

@jdef Thank you for your comments.

I understand what you say.

At first glance, this seems like a breaking change to the spec. As this (CSI) is not a k8s project, it is not bound by the referenced KEP.

I want to find a good solution which doesn't break existing drivers.
Therefore, #477 also seems to try to solve this problem. WDYT?

xing-yang · 2022-06-21T15:01:55Z

This would trigger the deletion of the volumeAttachment objects. For CSI drivers, this would allow ControllerUnpublishVolume to happen without NodeUnpublishVolume and/or NodeUnstageVolume being called first. Note that there is no additional code changes required for this step. This happens automatically after the Proposed change in the previous step to force detach right away.

Note that this "force detach" behavior is not introduced by Non-Graceful Node Shutdown feature. Kubernetes already supports this behavior without Non-Graceful Node Shutdown. See "Test 2" in the PR description section below. By forcefully deleting the Pods on the shutdown node manually, volumes will be force-detached after a 6 minute wait by the Attach Detach Controller.

kubernetes/kubernetes#108486

jsafrane · 2022-06-23T12:12:12Z

Yes, Kubernetes already breaks CSI spec and can call ControllerUnpublish without NodeUnpublish / NodeUnstage succeeding if Kubernetes thinks the node is broken - it can't really call NodUnstage/Unpublish in that case or get its result.

The last attempt to fix this is officially in CSI is in #477.

tgross · 2022-09-12T13:40:16Z

Yes, Kubernetes already breaks CSI spec and can call ControllerUnpublish without NodeUnpublish / NodeUnstage succeeding if Kubernetes thinks the node is broken - it can't really call NodUnstage/Unpublish in that case or get its result.

Implementor of Nomad's CSI support here! 👋 For what it's worth we originally implemented the spec as-written and it turned out to cause our users a lot of grief. As of Nomad 1.3.0 (shipped in May of this year), we're doing something similar to what k8s has done where we make a best effort attempt to NodeUnpublish / NodeUnstage before ControllerUnpublish.

We drive this from the "client node" (our equivalent of the kubelet), so if the client node is merely disconnected and not dead, we can rely on the node unpublish/unstage having happened by the time we try to GC the claim from the control plane side. The control plane ends up retrying the NodeUnstage/NodeUnpublish anyways, but proceeds on to ControllerUnpublish if it can't reach the node.

jdef · 2022-09-12T14:33:24Z

Thanks @tgross.

Are there any concerns from plugin providers that may be relying on CSI-as-written vs. the best-effort described herein?

Also Mesos is another CO w/ CSI integration - anyone on that side of the house have input to add here?

tgross · 2022-09-14T13:06:32Z

Are there any concerns from plugin providers that may be relying on CSI-as-written vs. the best-effort described herein?

IIRC all the plugins I've tested that support ControllerPublish implement that step as "make the API call to the storage provider to attach the volume as a device on the target host" and then implement NodeStage as "call mount(8) to mount that device". So when unpublishing, if the unmount is missed, the API call will try to detach a mounted device.

I know that, for example, the AWS EBS provider just merrily returns OK to that API call and then the device doesn't actually get detached until it's unmounted. (Or the user can "force detach" via the API out-of-band.) So in that case the provider is graceful and has behavior that's eventually correct, so long as that node unpublish happens eventually. If the node unpublish never happens (say the CO has crashed unrecoverably but the host is still live), I think you end up with a hung volume. But arguably that's the right behavior. I just don't how prevalent that graceful treatment is across the ecosystem.

YuikoTakada · 2022-09-20T11:19:21Z

@tgross Thank you for sharing information.
I've updated description to fit CO as a whole. This issue is not only of Kubernetes.

The CSI specification says that we "SHOULD" send no more than one in-flight request per *volume* at a time, with an allowance for losing state (ex. leadership transitions) which the plugins "SHOULD" handle gracefully. We mostly succesfully serialize node and controller RPCs for the same volume, except when Nomad clients are lost. (See also container-storage-interface/spec#512) These concurrency requirements in the spec fall short because Storage Provider APIs aren't necessarily safe to call concurrently on the same host. For example, concurrently attaching AWS EBS volumes to an EC2 instance results in a race for device names, which results in failure to attach and confused results when releasing claims. So in practice many CSI plugins rely on k8s-specific sidecars for serializing storage provider API calls globally. As a result, we have to be much more conservative about concurrency in Nomad than the spec allows. This changeset includes two major changes to fix this: * Add a serializer method to the CSI volume RPC handler. When the RPC handler makes a destructive CSI Controller RPC, we send the RPC thru this serializer and only one RPC is sent at a time. Any other RPCs in flight will block. * Ensure that requests go to the same controller plugin instance whenever possible by sorting by lowest client ID out of the healthy plugin instances. Fixes: #15415

The CSI specification says that we "SHOULD" send no more than one in-flight request per *volume* at a time, with an allowance for losing state (ex. leadership transitions) which the plugins "SHOULD" handle gracefully. We mostly successfully serialize node and controller RPCs for the same volume, except when Nomad clients are lost. (See also container-storage-interface/spec#512) These concurrency requirements in the spec fall short because Storage Provider APIs aren't necessarily safe to call concurrently on the same host even for _different_ volumes. For example, concurrently attaching AWS EBS volumes to an EC2 instance results in a race for device names, which results in failure to attach (because the device name is taken already and the API call fails) and confused results when releasing claims. So in practice many CSI plugins rely on k8s-specific sidecars for serializing storage provider API calls globally. As a result, we have to be much more conservative about concurrency in Nomad than the spec allows. This changeset includes four major changes to fix this: * Add a serializer method to the CSI volume RPC handler. When the RPC handler makes a destructive CSI Controller RPC, we send the RPC thru this serializer and only one RPC is sent at a time. Any other RPCs in flight will block. * Ensure that requests go to the same controller plugin instance whenever possible by sorting by lowest client ID out of the plugin instances. * Ensure that requests go to _healthy_ plugin instances only. * Ensure that requests for controllers can go to a controller on any _live_ node, not just ones eligible for scheduling (which CSI controllers don't care about) Fixes: #15415

YuikoTakada mentioned this issue Jun 15, 2022

Add note to Volume Lifecycle figure #513

Closed

YuikoTakada mentioned this issue Dec 12, 2022

Fix lifecycle and ControllerUnpublishVolume description #533

Closed

apricote mentioned this issue Dec 27, 2022

Volume not detached/migrated after node failure/shutdown hetznercloud/csi-driver#164

Closed

tgross mentioned this issue Jul 19, 2023

CSI: improve controller RPC reliability hashicorp/nomad#17996

Merged

s4ke mentioned this issue Jul 29, 2023

Unused CSI Volumes stay in use if services using them are removed in the wrong order moby/moby#45547

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Volume Lifecycle is not correspond to actual k8s behavior #512

Volume Lifecycle is not correspond to actual k8s behavior #512

YuikoTakada commented Jun 15, 2022 •

edited

jdef commented Jun 15, 2022 via email

YuikoTakada commented Jun 16, 2022

xing-yang commented Jun 21, 2022

jsafrane commented Jun 23, 2022

tgross commented Sep 12, 2022

jdef commented Sep 12, 2022

tgross commented Sep 14, 2022 •

edited

YuikoTakada commented Sep 20, 2022

Volume Lifecycle is not correspond to actual k8s behavior #512

Volume Lifecycle is not correspond to actual k8s behavior #512

Comments

YuikoTakada commented Jun 15, 2022 • edited

jdef commented Jun 15, 2022 via email

YuikoTakada commented Jun 16, 2022

xing-yang commented Jun 21, 2022

jsafrane commented Jun 23, 2022

tgross commented Sep 12, 2022

jdef commented Sep 12, 2022

tgross commented Sep 14, 2022 • edited

YuikoTakada commented Sep 20, 2022

YuikoTakada commented Jun 15, 2022 •

edited

tgross commented Sep 14, 2022 •

edited