Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

If kubelet is unavailable, AttachDetachController fails to force detach on pod deletion #65392

Closed
verult opened this issue Jun 23, 2018 · 124 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/storage Categorizes an issue or PR as relevant to SIG Storage.

Comments

@verult
Copy link
Contributor

verult commented Jun 23, 2018

/kind bug

What you expected to happen:
When a pod using an attached volume is deleted (gracefully) but kubelet in the corresponding node is down, the AttachDetachController should assume the node is unrecoverable after a timeout (currently 6min) and forcefully detach the volume.

What happened:
The volume is never detached.

How to reproduce it (as minimally and precisely as possible):

  • Create a pod with a volume using any of the attachable plugins (I used GCE PD to test).
  • Stop kubelet inside the node where the pod is scheduled.
  • Delete the pod.
  • Wait for 6min+
  • Check to see if the volume is still attached.

Anything else we need to know?:
This doesn't happen if the pod is force-deleted.

It's likely due to the last condition checked in this line. Once kubelet is down, the container status is no longer reported correctly. Inside the Attach Detach Controller, This function is called by the pod update informer handler, which sets whether the volume should be attached in the desired state of the world. On pod force deletion, the pod object is deleted immediately, and this triggers the pod delete informer handler, which doesn't call this function.

/sig storage
/cc @saad-ali @gnufied @jingxu97 @NickrenREN
/assign

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. sig/storage Categorizes an issue or PR as relevant to SIG Storage. labels Jun 23, 2018
@saad-ali saad-ali added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. status/approved-for-milestone labels Jun 27, 2018
@saad-ali saad-ali added this to the v1.12 milestone Jun 27, 2018
@k8s-github-robot
Copy link

[MILESTONENOTIFIER] Milestone Issue: Up-to-date for process

@verult

Issue Labels
  • sig/storage: Issue will be escalated to these SIGs if needed.
  • priority/important-soon: Escalate to the issue owners and SIG owner; move out of milestone after several unsuccessful escalation attempts.
  • kind/bug: Fixes a bug discovered during the current release.
Help

@orainxiong
Copy link

Maybe I hitted a similar case but not the same.

The general sequence as below :

  • Config volumes.kubernetes.io/controller-managed-attach-detach in order to have the attach/detach controller being in charge of attaching and detaching operations both;
  • Create a statefulset with volumes provisioned by glusterfs;
  • Use iptables to simulate a master-worker network partition failure. As a result, kubelet on worker node can't talk against apiserver at the moment;
  • NodeLifecycleController subsequently detect the unreachable node and evict the pods on the unreachable node;
  • The evicted pods will be recreated and scheduled to another available Node with previous volumes.

The attach/detach controller works as expected. I use version 1.9.3.

@verult
Copy link
Contributor Author

verult commented Jul 13, 2018

@orainxiong is the terminationGracePeriod set inside the statefulset?

@orainxiong
Copy link

orainxiong commented Jul 14, 2018

terminationGracePeriod is 0 if I remember correctly. I guess it might be the root cause.

@guineveresaenger
Copy link
Contributor

@saad-ali can we have an update whether someone is working on this issue?

@NickrenREN
Copy link
Contributor

I can take a look at this

/assign

@shay-berman
Copy link

@saad-ali, @verult and I scheduled a meeting to talk about this issue on Monday 27Aug 11am (New York, USA Time).

@shay-berman
Copy link

Hi @saad-ali , @verult and @gnufied.

I have a concern about how to fix this issue.
Regarding the implementation of #67847 \ #67419 , I am not sure if the PR design is save enough.
The solution that was proposed in this PR was to leverage the DeletionGracePeriodSeconds instead of using the podStatus.ContainerStatuses (HERE).

Its good not to trust on podStatus.ContainerStatuses because in a crash worker scenario the kubelet is not responsive and the ContainerStatuses will remain Running for-ever (which prevent detach the POD). So adding DeletionGracePeriodSeconds as another trigger to force detach the PVCs of a none responding POD is good idea.

BUT I think we should be very careful of force detach PVCs, because if its a split brain scenario (worker node was not really crashed, it was just a network issue) and the POD still running and writing IO to the PVC, detaching the PVC may lead to data corruption.
So maybe its too brutal to trust on this DeletionGracePeriodSeconds (AFAIK the customer facing of this timeout is TerminationGracePeriodSeconds which is by default 30 seconds if you didn't specify it in the pod spec.). So if we take this PR implementation there could be application that will get force detach in split brain and maybe these application cannot handle such detach.

I would like to propose saver implementation that from one hand will do detach for PODs that in crashed node scenario, but from other hand will be save enough also for split brain scenario.
The propose is to add a new annotation on the POD spec (e.g : AllowForceDetachAfterGracePeriod with default false), and then only if the customer specify it in the pod spec with true value,
then k8s will force detach PVCs of this POD, but if its false then even after DeletionGracePeriodSeconds it will not force detach the POD - which keeping the existing behavior - so customer will have to manually delete the POD (do it in save and manual way to prevent data lose).
So by adding this safety annotation we provide the customer the ability to decide if this pod can recover from detach operation. And then you will fill save to also chary pick this fix to older k8s versions, because you will not impact any existing behavior(Only PODs with this annotation will get the new behavior).

Thoughts?
Let me know if its over kill, or maybe its saver for the user to avoid potential data lose.

@verult
Copy link
Contributor Author

verult commented Aug 29, 2018

@gnufied and I discussed this offline. Please correct me if I'm wrong, Hemant, but your argument was that setting the annotation to true is equivalent to setting TerminationGracePeriod very high, which also "disables" this detach behavior by delaying it indefinitely.

On the other hand, because the default TerminationGracePeriod is 30s, we can't safely cherry-pick this PR back to older versions.

One option is to use an annotation + DeletionGracePeriod for all versions prior to 1.12, and only DeletionGracePeriod for 1.12 and later.

@gnufied
Copy link
Member

gnufied commented Aug 29, 2018

One of the other things we could do is - only detach volumes from a node when it is known that it has been shutdown. So basically make use of newly introduced taint to detach the volume. This does mean that volumes never get detached from unresponsive or crashed kubelet nodes but that is probably fine. If user really wants to force the detach he can always do that via:

~> kubectl delete pod <foo> --force=true --grace-period=0

This will mean, we will have to roll this PR into #67977 and we obviously can't backport.

Older versions of AWS, Openstack and Vsphere is not affected by this bug btw - because in those cloudproviders a node is deleted when shutdown. It is only recently that this behaviour was changed in those cloud providers and node object no longer is deleted and hence pods remain in "Unknown" phase and volume never gets detached.

This bug affects versions as far back as 1.9 (or even before that). But cloudprovider behaviour is recent change. Another problem is - using shutdown to detect volume detach has been feature gated as an alpha feature,

I do not like annotation option that much tbh. It is hard to document and easy to miss and since it is an API change, it also has to go through alpha, beta phases. Agreed that - #67977 is currently feature gated too but I do not see a strong reason of why that should be feature gated. cc @yastij

@yastij
Copy link
Member

yastij commented Aug 29, 2018

@gnufied - some production cluster relies on quick reboots (i.e. reboot and still have disks attached). We might want to give them some bandwidth for transition.

@gnufied
Copy link
Member

gnufied commented Aug 30, 2018

@yastij shouldn't pod eviction timeout(5 minute default) + expiry window before a node is considered lost and it stops sending heartbeats enough? There is also - grace period obviously.

The reason I do not particularly like annotation also is because, detaching volumes from shutdown nodes should be a default behavior and should not require user intervention. If we can differentiate between shutdown nodes and unresponsive kubelets then we should use that information for detaching volumes and it should be default rather than something that user have to configure.

@yastij
Copy link
Member

yastij commented Aug 30, 2018

I agree with @gnufied on annotations.

the plan is to promote this as beta on 1.13. From what I’ve see the eviction timeout impacts all pods which might be tricky: for aggresive clusters this doesn’t work as pod timeout and node grace period are set lower than usual. This window (1 release) ensures that everything goes smoothly.

@saad-ali
Copy link
Member

Good arguments on both sides. On the one hand, we have to very careful not to cause data corruption. On the other hand, we don't want to get in to a state where their data is stuck and requires manual intervention to access from new workloads.

I agree with @gnufied, we should avoid adding another annotation -- whenever there is a question like this, exposing the option to the end user is the easiest way out, but not necessarily the best. We should try to be smart and do the right thing automatically on behalf of the user.

The right thing in this case is tricky. If a node is not coming back, we should move the disk even if it is not safely unmounted. But how do we differentiate that from a split brain? A timer based algorithm is pretty crude, but its the best approximation we currently have. And I'm not aware of any way to detect a split brain vs a dead node so I think we need to weigh the likely hood of the different scenarios happening and the worst case for each option and how willing we are to tolerate that.

I'll leave it up to you guys to weigh those tradeoffs.

@zetaab
Copy link
Member

zetaab commented Aug 30, 2018

@gnufied actually in openstack nodes are not deleted anymore. #59931 It basically means that in openstack if you shutdown instance which had pod with pvc, it cannot failover to another instance.

However, #67977 will fix this issue

I just talked to @yastij that we should use feature gate only to detach delay. Then it can detach volumes in openstack again, if feature gate turned on the delay is something like 1min and if not turned on the detach delay is something like 8-10minutes.

You are correct like in AWS this problem does not exist but if #59930 is approved (like it should because other cloud behaviour and spec is that it should not delete nodes). Then aws will get this problem as well if #67977 is not approved.

@shay-berman
Copy link

shay-berman commented Aug 30, 2018

Some notes I took during the conference meeting today (@verult please add your notes as well):

@guineveresaenger
Copy link
Contributor

Do we have any notion of the timeframe on resolving this issue? Code freeze is next week, at which point only critical/urgent bugs will be permitted into the milestone.

Thank you! :)

@gnufied
Copy link
Member

gnufied commented Aug 30, 2018

@shay-berman PR #67977 should still be on hold because without actual handling of pods from shutdown nodes the PR #67977 has no effect whatsoever. The volumes still will not be detached from shutdown nodes.

cc @msau42 @jingxu97

@msau42
Copy link
Member

msau42 commented Jul 22, 2019

Can you be more specific about what kinds of "various reasons" there are? It's not obvious to me and I want to understand this basic use case.

@jfrederickson
Copy link

@msau42 I can only speak for my own experience here, but I've seen behavior like this when the underlying hardware for a VM running a node fails. The cluster admin never has the opportunity to cordon/drain the node first in that scenario.

This prevents the volume from being reattached to another node even if the cloud provider knows that the instance is powered down (and therefore that the volume definitely isn't in active use on that node) as to my knowledge there isn't a node taint that lets us signal this status to k8s.

It's less of an issue if you have node autoscaling available (as the volume can be attached to another node as soon as the failed node is deleted), but when that's not an option, as far as I know any pods with attached PVs can't be rescheduled until the node comes back online or is removed from the cluster entirely. (It's been a while since I've run into this myself and looked into it though, if that's no longer the case I'd be glad to be wrong about this!)

@adampl
Copy link

adampl commented Jul 23, 2019

There's an ongoing work on a new solution to this problem: kubernetes/enhancements#1116

However, I'm still unsure how it will work with non-cloud clusters.

@msau42 Machines sometimes fail...

@alena1108
Copy link
Contributor

@adampl thanks for the link. Resolving the issue one way or another, even if limited to cloud provider nodes only, would be great. Do you know what k8s release it's targeted for? cc @smarterclayton

@yastij
Copy link
Member

yastij commented Jul 24, 2019

@alena1108 - hopefully 1.16

@wenjun93
Copy link

@NickrenREN Hi, any info about the proposal for node fencing on cloud and BM?

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 28, 2019
@yastij
Copy link
Member

yastij commented Nov 28, 2019

/remove-lifecycle stale
/lifecycle frozen

@neujie
Copy link

neujie commented Jun 1, 2021

@jfrederickson @saad-ali @verult
Maybe we define a api like csi in nodelifecycle controller to check these nodes status in cluster, before evict pods.
If the api return true, nodelifecycle controller force delete these eviction pods, or otherwise keep the current code.

@arvisha16
Copy link

Hi All,

Is this issue still valid ???

@SystemZ
Copy link

SystemZ commented May 4, 2022

Hi All,

Is this issue still valid ???

It's still a problem on 1.22, I didn't test on 1.23 yet.
Should be fixed as alpha feature NodeOutOfServiceVolumeDetach in 1.24: #108486, kubernetes/website#32595

I made very rough draft of workaround script. It can cause data loss so be sure to test this before.
It should make failover faster for services with volumes attached. Just need to mount it with ConfigMap in deployment in container with kubectl and add ClusterRoleBinding to it.

#!/bin/bash
trap "exit 0" SIGINT SIGTERM
while true; do
  echo "sleeping 5 sec..."
  sleep 5
  
  # automatically force remove pods in terminating state
  kubectl get pods -A | awk '$4 ~/Terminating/' | awk '{print "kubectl --namespace "$1" delete pod "$2" --grace-period=0 --force"}' | xargs -I {} sh -c '{};' &

  # next step is to delete volumeattachments to speed up recovery of pods instead of waiting 6 minutes
  # https://github.com/ceph/ceph-csi/issues/740
  node_offline=$(kubectl get node | awk '$2 ~/NotReady/' | awk '{print $1}' | head -n 1)
  kubectl get volumeattachments.storage.k8s.io -o json | jq -r '.items[] | select (.spec.nodeName == "'$node_offline'") | .metadata.name' | xargs -I {} sh -c 'kubectl delete volumeattachments.storage.k8s.io {};' &
done

@jingxu97
Copy link
Contributor

jingxu97 commented May 5, 2022

We now have a new alpha feature "Nongraceful node shutdown" which can mostly help this situation. Please check out the KEP for details https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/2268-non-graceful-shutdown
There will be a blog post soon.
Please let us know if you have any feedback.

@xing-yang
Copy link
Contributor

/assign

@xing-yang
Copy link
Contributor

The non-graceful node shutdown feature is beta now: https://kubernetes.io/blog/2022/12/16/kubernetes-1-26-non-graceful-node-shutdown-beta/

You can use this feature to resolve the problem.

@xing-yang
Copy link
Contributor

/close

If re-open if this is still an issue.

@k8s-ci-robot
Copy link
Contributor

@xing-yang: Closing this issue.

In response to this:

/close

If re-open if this is still an issue.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/storage Categorizes an issue or PR as relevant to SIG Storage.
Projects
None yet
Development

Successfully merging a pull request may close this issue.