If kubelet is unavailable, AttachDetachController fails to force detach on pod deletion #65392

verult · 2018-06-23T00:20:43Z

/kind bug

What you expected to happen:
When a pod using an attached volume is deleted (gracefully) but kubelet in the corresponding node is down, the AttachDetachController should assume the node is unrecoverable after a timeout (currently 6min) and forcefully detach the volume.

What happened:
The volume is never detached.

How to reproduce it (as minimally and precisely as possible):

Create a pod with a volume using any of the attachable plugins (I used GCE PD to test).
Stop kubelet inside the node where the pod is scheduled.
Delete the pod.
Wait for 6min+
Check to see if the volume is still attached.

Anything else we need to know?:
This doesn't happen if the pod is force-deleted.

It's likely due to the last condition checked in this line. Once kubelet is down, the container status is no longer reported correctly. Inside the Attach Detach Controller, This function is called by the pod update informer handler, which sets whether the volume should be attached in the desired state of the world. On pod force deletion, the pod object is deleted immediately, and this triggers the pod delete informer handler, which doesn't call this function.

/sig storage
/cc @saad-ali @gnufied @jingxu97 @NickrenREN
/assign

k8s-github-robot · 2018-06-27T00:32:38Z

[MILESTONENOTIFIER] Milestone Issue: Up-to-date for process

@verult

Issue Labels

sig/storage: Issue will be escalated to these SIGs if needed.
priority/important-soon: Escalate to the issue owners and SIG owner; move out of milestone after several unsuccessful escalation attempts.
kind/bug: Fixes a bug discovered during the current release.

Help

orainxiong · 2018-07-11T03:57:24Z

Maybe I hitted a similar case but not the same.

The general sequence as below :

Config volumes.kubernetes.io/controller-managed-attach-detach in order to have the attach/detach controller being in charge of attaching and detaching operations both;
Create a statefulset with volumes provisioned by glusterfs;
Use iptables to simulate a master-worker network partition failure. As a result, kubelet on worker node can't talk against apiserver at the moment;
NodeLifecycleController subsequently detect the unreachable node and evict the pods on the unreachable node;
The evicted pods will be recreated and scheduled to another available Node with previous volumes.

The attach/detach controller works as expected. I use version 1.9.3.

verult · 2018-07-13T23:36:58Z

@orainxiong is the terminationGracePeriod set inside the statefulset?

orainxiong · 2018-07-14T08:11:02Z

terminationGracePeriod is 0 if I remember correctly. I guess it might be the root cause.

guineveresaenger · 2018-08-13T13:16:50Z

@saad-ali can we have an update whether someone is working on this issue?

NickrenREN · 2018-08-14T12:24:07Z

I can take a look at this

/assign

shay-berman · 2018-08-22T12:25:38Z

@saad-ali, @verult and I scheduled a meeting to talk about this issue on Monday 27Aug 11am (New York, USA Time).

shay-berman · 2018-08-29T17:38:23Z

Hi @saad-ali , @verult and @gnufied.

I have a concern about how to fix this issue.
Regarding the implementation of #67847 \ #67419 , I am not sure if the PR design is save enough.
The solution that was proposed in this PR was to leverage the DeletionGracePeriodSeconds instead of using the podStatus.ContainerStatuses (HERE).

Its good not to trust on podStatus.ContainerStatuses because in a crash worker scenario the kubelet is not responsive and the ContainerStatuses will remain Running for-ever (which prevent detach the POD). So adding DeletionGracePeriodSeconds as another trigger to force detach the PVCs of a none responding POD is good idea.

BUT I think we should be very careful of force detach PVCs, because if its a split brain scenario (worker node was not really crashed, it was just a network issue) and the POD still running and writing IO to the PVC, detaching the PVC may lead to data corruption.
So maybe its too brutal to trust on this DeletionGracePeriodSeconds (AFAIK the customer facing of this timeout is TerminationGracePeriodSeconds which is by default 30 seconds if you didn't specify it in the pod spec.). So if we take this PR implementation there could be application that will get force detach in split brain and maybe these application cannot handle such detach.

I would like to propose saver implementation that from one hand will do detach for PODs that in crashed node scenario, but from other hand will be save enough also for split brain scenario.
The propose is to add a new annotation on the POD spec (e.g : AllowForceDetachAfterGracePeriod with default false), and then only if the customer specify it in the pod spec with true value,
then k8s will force detach PVCs of this POD, but if its false then even after DeletionGracePeriodSeconds it will not force detach the POD - which keeping the existing behavior - so customer will have to manually delete the POD (do it in save and manual way to prevent data lose).
So by adding this safety annotation we provide the customer the ability to decide if this pod can recover from detach operation. And then you will fill save to also chary pick this fix to older k8s versions, because you will not impact any existing behavior(Only PODs with this annotation will get the new behavior).

Thoughts?
Let me know if its over kill, or maybe its saver for the user to avoid potential data lose.

verult · 2018-08-29T18:42:21Z

@gnufied and I discussed this offline. Please correct me if I'm wrong, Hemant, but your argument was that setting the annotation to true is equivalent to setting TerminationGracePeriod very high, which also "disables" this detach behavior by delaying it indefinitely.

On the other hand, because the default TerminationGracePeriod is 30s, we can't safely cherry-pick this PR back to older versions.

One option is to use an annotation + DeletionGracePeriod for all versions prior to 1.12, and only DeletionGracePeriod for 1.12 and later.

gnufied · 2018-08-29T20:14:20Z

One of the other things we could do is - only detach volumes from a node when it is known that it has been shutdown. So basically make use of newly introduced taint to detach the volume. This does mean that volumes never get detached from unresponsive or crashed kubelet nodes but that is probably fine. If user really wants to force the detach he can always do that via:

~> kubectl delete pod <foo> --force=true --grace-period=0

This will mean, we will have to roll this PR into #67977 and we obviously can't backport.

Older versions of AWS, Openstack and Vsphere is not affected by this bug btw - because in those cloudproviders a node is deleted when shutdown. It is only recently that this behaviour was changed in those cloud providers and node object no longer is deleted and hence pods remain in "Unknown" phase and volume never gets detached.

This bug affects versions as far back as 1.9 (or even before that). But cloudprovider behaviour is recent change. Another problem is - using shutdown to detect volume detach has been feature gated as an alpha feature,

I do not like annotation option that much tbh. It is hard to document and easy to miss and since it is an API change, it also has to go through alpha, beta phases. Agreed that - #67977 is currently feature gated too but I do not see a strong reason of why that should be feature gated. cc @yastij

yastij · 2018-08-29T20:19:40Z

@gnufied - some production cluster relies on quick reboots (i.e. reboot and still have disks attached). We might want to give them some bandwidth for transition.

gnufied · 2018-08-30T03:12:59Z

@yastij shouldn't pod eviction timeout(5 minute default) + expiry window before a node is considered lost and it stops sending heartbeats enough? There is also - grace period obviously.

The reason I do not particularly like annotation also is because, detaching volumes from shutdown nodes should be a default behavior and should not require user intervention. If we can differentiate between shutdown nodes and unresponsive kubelets then we should use that information for detaching volumes and it should be default rather than something that user have to configure.

yastij · 2018-08-30T07:01:14Z

I agree with @gnufied on annotations.

the plan is to promote this as beta on 1.13. From what I’ve see the eviction timeout impacts all pods which might be tricky: for aggresive clusters this doesn’t work as pod timeout and node grace period are set lower than usual. This window (1 release) ensures that everything goes smoothly.

saad-ali · 2018-08-30T07:08:15Z

Good arguments on both sides. On the one hand, we have to very careful not to cause data corruption. On the other hand, we don't want to get in to a state where their data is stuck and requires manual intervention to access from new workloads.

I agree with @gnufied, we should avoid adding another annotation -- whenever there is a question like this, exposing the option to the end user is the easiest way out, but not necessarily the best. We should try to be smart and do the right thing automatically on behalf of the user.

The right thing in this case is tricky. If a node is not coming back, we should move the disk even if it is not safely unmounted. But how do we differentiate that from a split brain? A timer based algorithm is pretty crude, but its the best approximation we currently have. And I'm not aware of any way to detect a split brain vs a dead node so I think we need to weigh the likely hood of the different scenarios happening and the worst case for each option and how willing we are to tolerate that.

I'll leave it up to you guys to weigh those tradeoffs.

zetaab · 2018-08-30T15:49:07Z

@gnufied actually in openstack nodes are not deleted anymore. #59931 It basically means that in openstack if you shutdown instance which had pod with pvc, it cannot failover to another instance.

However, #67977 will fix this issue

I just talked to @yastij that we should use feature gate only to detach delay. Then it can detach volumes in openstack again, if feature gate turned on the delay is something like 1min and if not turned on the detach delay is something like 8-10minutes.

You are correct like in AWS this problem does not exist but if #59930 is approved (like it should because other cloud behaviour and spec is that it should not delete nodes). Then aws will get this problem as well if #67977 is not approved.

shay-berman · 2018-08-30T19:23:49Z

Some notes I took during the conference meeting today (@verult please add your notes as well):

PRs Handle force detach #67847, Forcefully detach the volumes on pod deletion if kubelet is unavailable #67419 should be on hold, due to the potential risk in split brain situation. And as for now the user will have to manually delete the Pods from the crashed node or just delete the crash node object in order to trigger PVC detach operation.
PR add detach logic for node shutdown taint #67977 is still in place, because it will allow detach PVCs from node with nodeShutdown taint. But its only applicable for cloud provider drivers.
Opened ticket Allow to configure maxWaitForUnmountDuration #68086 to make maxWaitForUnmountDuration as configurable value (instead of hardcoded 6 minutes).
Its less recommended to use Pod annotation to indicate force detach PVC mode.

guineveresaenger · 2018-08-30T20:01:21Z

Do we have any notion of the timeframe on resolving this issue? Code freeze is next week, at which point only critical/urgent bugs will be permitted into the milestone.

Thank you! :)

gnufied · 2018-08-30T20:21:17Z

@shay-berman PR #67977 should still be on hold because without actual handling of pods from shutdown nodes the PR #67977 has no effect whatsoever. The volumes still will not be detached from shutdown nodes.

cc @msau42 @jingxu97

msau42 · 2019-07-22T22:02:22Z

Can you be more specific about what kinds of "various reasons" there are? It's not obvious to me and I want to understand this basic use case.

jfrederickson · 2019-07-22T23:03:32Z

@msau42 I can only speak for my own experience here, but I've seen behavior like this when the underlying hardware for a VM running a node fails. The cluster admin never has the opportunity to cordon/drain the node first in that scenario.

This prevents the volume from being reattached to another node even if the cloud provider knows that the instance is powered down (and therefore that the volume definitely isn't in active use on that node) as to my knowledge there isn't a node taint that lets us signal this status to k8s.

It's less of an issue if you have node autoscaling available (as the volume can be attached to another node as soon as the failed node is deleted), but when that's not an option, as far as I know any pods with attached PVs can't be rescheduled until the node comes back online or is removed from the cluster entirely. (It's been a while since I've run into this myself and looked into it though, if that's no longer the case I'd be glad to be wrong about this!)

adampl · 2019-07-23T12:43:18Z

There's an ongoing work on a new solution to this problem: kubernetes/enhancements#1116

However, I'm still unsure how it will work with non-cloud clusters.

@msau42 Machines sometimes fail...

alena1108 · 2019-07-23T19:35:33Z

@adampl thanks for the link. Resolving the issue one way or another, even if limited to cloud provider nodes only, would be great. Do you know what k8s release it's targeted for? cc @smarterclayton

yastij · 2019-07-24T17:30:00Z

@alena1108 - hopefully 1.16

wenjun93 · 2019-08-30T07:39:31Z

@NickrenREN Hi, any info about the proposal for node fencing on cloud and BM?

fejta-bot · 2019-11-28T08:05:02Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

yastij · 2019-11-28T10:05:25Z

/remove-lifecycle stale
/lifecycle frozen

neujie · 2021-06-01T10:09:59Z

@jfrederickson @saad-ali @verult
Maybe we define a api like csi in nodelifecycle controller to check these nodes status in cluster, before evict pods.
If the api return true, nodelifecycle controller force delete these eviction pods, or otherwise keep the current code.

arvisha16 · 2022-04-20T21:55:15Z

Hi All,

Is this issue still valid ???

SystemZ · 2022-05-04T20:29:35Z

Hi All,

Is this issue still valid ???

It's still a problem on 1.22, I didn't test on 1.23 yet.
Should be fixed as alpha feature NodeOutOfServiceVolumeDetach in 1.24: #108486, kubernetes/website#32595

I made very rough draft of workaround script. It can cause data loss so be sure to test this before.
It should make failover faster for services with volumes attached. Just need to mount it with ConfigMap in deployment in container with kubectl and add ClusterRoleBinding to it.

#!/bin/bash
trap "exit 0" SIGINT SIGTERM
while true; do
  echo "sleeping 5 sec..."
  sleep 5
  
  # automatically force remove pods in terminating state
  kubectl get pods -A | awk '$4 ~/Terminating/' | awk '{print "kubectl --namespace "$1" delete pod "$2" --grace-period=0 --force"}' | xargs -I {} sh -c '{};' &

  # next step is to delete volumeattachments to speed up recovery of pods instead of waiting 6 minutes
  # https://github.com/ceph/ceph-csi/issues/740
  node_offline=$(kubectl get node | awk '$2 ~/NotReady/' | awk '{print $1}' | head -n 1)
  kubectl get volumeattachments.storage.k8s.io -o json | jq -r '.items[] | select (.spec.nodeName == "'$node_offline'") | .metadata.name' | xargs -I {} sh -c 'kubectl delete volumeattachments.storage.k8s.io {};' &
done

jingxu97 · 2022-05-05T01:29:24Z

We now have a new alpha feature "Nongraceful node shutdown" which can mostly help this situation. Please check out the KEP for details https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/2268-non-graceful-shutdown
There will be a blog post soon.
Please let us know if you have any feedback.

xing-yang · 2023-02-22T18:34:13Z

/assign

xing-yang · 2023-03-05T01:23:00Z

The non-graceful node shutdown feature is beta now: https://kubernetes.io/blog/2022/12/16/kubernetes-1-26-non-graceful-node-shutdown-beta/

You can use this feature to resolve the problem.

xing-yang · 2023-03-08T18:14:02Z

/close

If re-open if this is still an issue.

k8s-ci-robot · 2023-03-08T18:14:07Z

@xing-yang: Closing this issue.

In response to this:

/close

If re-open if this is still an issue.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot assigned verult Jun 23, 2018

k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. sig/storage Categorizes an issue or PR as relevant to SIG Storage. labels Jun 23, 2018

saad-ali added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. status/approved-for-milestone labels Jun 27, 2018

saad-ali added this to the v1.12 milestone Jun 27, 2018

k8s-ci-robot assigned NickrenREN Aug 14, 2018

NickrenREN mentioned this issue Aug 15, 2018

Forcefully detach the volumes on pod deletion if kubelet is unavailable #67419

Closed

gnufied mentioned this issue Aug 17, 2018

cloudprovider: aws: return true on existence check for stopped instances #66835

Merged

gnufied mentioned this issue Aug 24, 2018

Handle force detach #67847

Closed

orainxiong mentioned this issue Aug 29, 2018

The controller-managed attachment and detachment is not able to detach a rbd volume from a lost node #62061

Closed

shay-berman mentioned this issue Aug 30, 2018

Allow to configure maxWaitForUnmountDuration #68086

Closed

kmova mentioned this issue Aug 29, 2019

multi-attach error during node reboot or shutdown openebs/openebs#2536

Closed

tobad357 mentioned this issue Nov 5, 2019

Volumeattachment stays on crashed node hpe-storage/csi-driver#68

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 28, 2019

k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Nov 28, 2019

NickrenREN mentioned this issue Apr 13, 2020

Rotating cloud instances with PVCs in a StatefulSet kubernetes-sigs/sig-storage-local-static-provisioner#181

Open

rsoika mentioned this issue Apr 17, 2020

[FEATURE] Improve node failure handling longhorn/longhorn#1105

Closed

AmitKumarDas mentioned this issue Apr 20, 2020

feat: missing kubelet testplan mayadata-io/oep-e2e#422

Open

travisn mentioned this issue Apr 27, 2020

Portable OSDs get stuck if a node is terminated abruptly rook/rook#5348

Closed

RaunakShah mentioned this issue May 29, 2020

Volume can't be detached of deleted Node/VM kubernetes-sigs/vsphere-csi-driver#221

Closed

galp mentioned this issue Dec 15, 2021

multi attach errors kadalu/kadalu#727

Closed

k8s-ci-robot assigned xing-yang Feb 22, 2023

k8s-ci-robot closed this as completed Mar 8, 2023

torredil mentioned this issue Sep 18, 2023

Add taint removal retry mechanism kubernetes-sigs/aws-fsx-csi-driver#351

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

If kubelet is unavailable, AttachDetachController fails to force detach on pod deletion #65392

If kubelet is unavailable, AttachDetachController fails to force detach on pod deletion #65392

verult commented Jun 23, 2018

k8s-github-robot commented Jun 27, 2018

orainxiong commented Jul 11, 2018

verult commented Jul 13, 2018

orainxiong commented Jul 14, 2018 •

edited

guineveresaenger commented Aug 13, 2018

NickrenREN commented Aug 14, 2018

shay-berman commented Aug 22, 2018

shay-berman commented Aug 29, 2018

verult commented Aug 29, 2018

gnufied commented Aug 29, 2018

yastij commented Aug 29, 2018

gnufied commented Aug 30, 2018

yastij commented Aug 30, 2018 •

edited

saad-ali commented Aug 30, 2018

zetaab commented Aug 30, 2018 •

edited

shay-berman commented Aug 30, 2018 •

edited

guineveresaenger commented Aug 30, 2018

gnufied commented Aug 30, 2018 •

edited

msau42 commented Jul 22, 2019

jfrederickson commented Jul 22, 2019

adampl commented Jul 23, 2019

alena1108 commented Jul 23, 2019

yastij commented Jul 24, 2019

wenjun93 commented Aug 30, 2019

fejta-bot commented Nov 28, 2019

yastij commented Nov 28, 2019

neujie commented Jun 1, 2021 •

edited

arvisha16 commented Apr 20, 2022

SystemZ commented May 4, 2022 •

edited

jingxu97 commented May 5, 2022

xing-yang commented Feb 22, 2023

xing-yang commented Mar 5, 2023

xing-yang commented Mar 8, 2023

k8s-ci-robot commented Mar 8, 2023

If kubelet is unavailable, AttachDetachController fails to force detach on pod deletion #65392

If kubelet is unavailable, AttachDetachController fails to force detach on pod deletion #65392

Comments

verult commented Jun 23, 2018

k8s-github-robot commented Jun 27, 2018

orainxiong commented Jul 11, 2018

verult commented Jul 13, 2018

orainxiong commented Jul 14, 2018 • edited

guineveresaenger commented Aug 13, 2018

NickrenREN commented Aug 14, 2018

shay-berman commented Aug 22, 2018

shay-berman commented Aug 29, 2018

verult commented Aug 29, 2018

gnufied commented Aug 29, 2018

yastij commented Aug 29, 2018

gnufied commented Aug 30, 2018

yastij commented Aug 30, 2018 • edited

saad-ali commented Aug 30, 2018

zetaab commented Aug 30, 2018 • edited

shay-berman commented Aug 30, 2018 • edited

guineveresaenger commented Aug 30, 2018

gnufied commented Aug 30, 2018 • edited

msau42 commented Jul 22, 2019

jfrederickson commented Jul 22, 2019

adampl commented Jul 23, 2019

alena1108 commented Jul 23, 2019

yastij commented Jul 24, 2019

wenjun93 commented Aug 30, 2019

fejta-bot commented Nov 28, 2019

yastij commented Nov 28, 2019

neujie commented Jun 1, 2021 • edited

arvisha16 commented Apr 20, 2022

SystemZ commented May 4, 2022 • edited

jingxu97 commented May 5, 2022

xing-yang commented Feb 22, 2023

xing-yang commented Mar 5, 2023

xing-yang commented Mar 8, 2023

k8s-ci-robot commented Mar 8, 2023

orainxiong commented Jul 14, 2018 •

edited

yastij commented Aug 30, 2018 •

edited

zetaab commented Aug 30, 2018 •

edited

shay-berman commented Aug 30, 2018 •

edited

gnufied commented Aug 30, 2018 •

edited

neujie commented Jun 1, 2021 •

edited

SystemZ commented May 4, 2022 •

edited