Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PVC attaching takes much time #1302

Closed
sergeymatsak87 opened this issue Jul 7, 2022 · 38 comments
Closed

PVC attaching takes much time #1302

sergeymatsak87 opened this issue Jul 7, 2022 · 38 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@sergeymatsak87
Copy link

sergeymatsak87 commented Jul 7, 2022

Duplicating case #1279 as we observe the same issue when detaching/deleting and attaching PVC to a new node takes more than 6 min and it's followed by a timeout

What happened?

  Normal   Scheduled               7m21s                  default-scheduler        Successfully assigned dev-test/bigid-cache-0 to ip-13-118-32-127.ec2.internal
  Warning  FailedAttachVolume      7m21s                  attachdetach-controller  Multi-Attach error for volume "pvc-xxx" Volume is already exclusively attached to one node and can't be attached to another
  Warning  FailedMount             3m2s (x2 over 5m18s)   kubelet                  Unable to attach or mount volumes: unmounted volumes=[redis-data], unattached volumes=[health redis-data config redis-tmp-conf tmp kube-api-access-98lr9 start-scripts]: timed out waiting for the condition
  Normal   SuccessfulAttachVolume  80s                    attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-xxx"
  Normal   Pulling                 64s                    kubelet                  Pulling image "docker.io/bitnami/redis:6.2.7-debian-10-r32"
  Normal   Pulled                  62s                    kubelet                  Successfully pulled image "docker.io/bitnami/redis:6.2.7-debian-10-r32" in 2.812050986s
  Normal   Created                 62s                    kubelet                  Created container redis
  Normal   Started                 62s                    kubelet                  Started container redis

EBS CSI Controller logs
Failure to Delete EBS volume after Detach

I0706 19:22:20.580181       1 request.go:665] Waited for 1.119392703s due to client-side throttling, not priority and fairness, request: PATCH:https://172.20.0.1:443/apis/storage.k8s.io/v1/volumeattachments/csi-49dcfd3771191a910bb4dc00a3862dbc4332449c708072c95e11f43ead631eb3/status
I0706 19:22:21.580381       1 request.go:665] Waited for 1.331653146s due to client-side throttling, not priority and fairness, request: PATCH:https://172.20.0.1:443/apis/storage.k8s.io/v1/volumeattachments/csi-1973672463aeb63fb239e6370c04bb7df8bb50290d8d9f2a0d3690d134f3521a/status
I0706 19:22:21.786329       1 csi_handler.go:286] Failed to save detach error to "csi-49dcfd3771191a910bb4dc00a3862dbc4332449c708072c95e11f43ead631eb3": volumeattachments.storage.k8s.io "csi-49dcfd3771191a910bb4dc00a3862dbc4332449c708072c95e11f43ead631eb3" not found
I0706 19:22:21.786354       1 csi_handler.go:231] Error processing "csi-49dcfd3771191a910bb4dc00a3862dbc4332449c708072c95e11f43ead631eb3": failed to detach: could not mark as detached: volumeattachments.storage.k8s.io "csi-49dcfd3771191a910bb4dc00a3862dbc4332449c708072c95e11f43ead631eb3" not found
I0706 19:22:21.877958       1 csi_handler.go:261] Attached "csi-d0cb7bac7a53511c451712aaa8bc8bac79adc2d3426d993d1f14a135d84fb5ab"
I0706 19:22:21.986228       1 csi_handler.go:286] Failed to save detach error to "csi-a5b254ba7d7a41c45a4d772d05dc55a002bff25d9f53c1ee9bd945d98126b59d": volumeattachments.storage.k8s.io "csi-a5b254ba7d7a41c45a4d772d05dc55a002bff25d9f53c1ee9bd945d98126b59d" not found
I0706 19:22:21.986254       1 csi_handler.go:231] Error processing "csi-a5b254ba7d7a41c45a4d772d05dc55a002bff25d9f53c1ee9bd945d98126b59d": failed to detach: could not mark as detached: volumeattachments.storage.k8s.io "csi-a5b254ba7d7a41c45a4d772d05dc55a002bff25d9f53c1ee9bd945d98126b59d" not found

What you expected to happen?
It should not take so much time for PVC detaching and attaching to new k8s node. Compare to default Kubernetes Storage driver (gp2) where I do not see this issue it take a lot of time

How to reproduce it (as minimally and precisely as possible)?
You just need to terminate EC2 instance (k8s node) directly which statefulset is running on

Anything else we need to know?:

Environment

  • Kubernetes version (use kubectl version): 1.22
  • Driver version: 1.6.2
@torredil
Copy link
Member

torredil commented Jul 7, 2022

I believe this is expected behavior, if you terminate your EC2 instance ungracefully (i.e. without safely evicting all of your pods through kubectl drain) the pods that are part of a StatefulSet will be stuck terminating on the shutdown node and cannot move to a new running node since the kubelet is not available to delete the pods. If there are volumes used by the pods, the VolumeAttachment objects will not be deleted from the original node so the volumes used by these pods cannot be attached to a new running node. There is work being done upstream to address this, K8s v1.24 introduces alpha support for Non-Graceful Node Shutdown which would reduce this time, for more information see the KEP and kubernetes/kubernetes#108486.

The 6 minute wait time before timing out is due to the default configuration of Attach/Detach controller timers. More specifically, ReconcilerMaxWaitForUnmountDuration which is the max amount of time the attach/detach controller waits for a volume to be safely unmounted from a node. After 6 minutes, the controller assumes the node is unresponsive and attempts to detach the volume.

@amitde69
Copy link

amitde69 commented Jul 25, 2022

is ReconcilerMaxWaitForUnmountDuration configurable ? is it part of the control plane ?
the same issue happens to us but not on ungraceful shutdown. it takes a long time for PVC to get reattached to the new node after a pod reschedules.

@torredil
Copy link
Member

@amitde69 ReconcilerMaxWaitForUnmountDuration is currently not configurable: kubernetes/kubernetes#105341.
The previously mentioned KEP addresses this 6 minute wait time by issuing a force detach if the out-of-service taint is present on the node.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 23, 2022
@taylorsmithgg
Copy link

+1 this makes some apps unusable (e.g. strimzi kafka operator) because of the frequent timeouts.
The same works fine on the default Kubernetes Storage driver (gp2).

@carlosjgp
Copy link

I'm still seeing this
/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 23, 2022
@carlosjgp
Copy link

I'm still seeing this
K8s AWS EKS 1.22.11
EBS image v1.13.0
/remove-lifecycle stale

@maximethebault
Copy link

Experiencing the same behavior, and this is very concerning as this is a clear regression compared to previous versions of EKS.

We could observe the issue with a pod scheduled with Karpenter.
Karpenter cordoned & drained the node on which the pod was running (during consolidation), and it took several minutes for the pod to get back to the running state. The pod event log looks exactly the same as in the original issue report.

I want to make it clear that the node was not terminated ungracefully.

@samuel-form3
Copy link

We're experiencing this on EKS 1.24 + EBS 1.12.

We use node-termination-handler to handle the graceful draining of the nodes. We observed that the volume becomes detached from the old node through AWS console, but the attachdetach controller doesn't detach the volume on the kubernetes api.

This seems to be an issue on attachdetach controller to be honest. I'll try to dig out what maybe happening.

@stdmje
Copy link

stdmje commented Feb 4, 2023

Same here.
EKS 1.24
Karpenter 0.23.0
ebs-csi-driver 1.15.0

/remove-lifecycle stale

@617m4rc
Copy link

617m4rc commented Feb 21, 2023

Unfortunately, we see this problem as well, and it has a severe negative impact on our workload. Has anyone come up with a solution or workaround yet?

@maximethebault
Copy link

/kind bug

@torredil @ConnorJC3 comments + thumbs up deduplicated, there are now 12 people that found their way to this Github issue to report this exact same issue. And I suspect this will grow in number with EKS 1.22 ending support in June 2023 and EBS CSI Driver becoming mandatory in EKS 1.23.

Can we please have an official word on this? Do you confirm the issue, or do you still think this is an expected, perfectly normal behavior? Do you maybe lack resource to look into this, should we alert our AWS account reps to get more attention on this?

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Feb 25, 2023
@ConnorJC3
Copy link
Contributor

Hi all, I can't make any specific promises before I get a chance to look into it, but I will investigate this bug sometime during the next work week.

In the meantime, can anyone affected here please clarify that this issue is happening to them (only) when they terminate the EC2 instance behind a node without draining it first?

If you can reproduce this under other scenarios that are different, please provide information on how to do that, thanks!

@maximethebault
Copy link

Thanks!

Running a "kubectl drain " on a node running a stateful workload is enough to reproduce this issue. The node doesn't have to be terminated ungracefully.

@jhass
Copy link

jhass commented Feb 26, 2023

For me killing a pod of a stateful set with a PVC attached is enough already to reproduce.

@igor-nikiforov
Copy link

igor-nikiforov commented Feb 28, 2023

I have this issue every time when we starting process of nodes replacement when updating AWS launch templates. Killing a pods from StatefulSet usually not enough so we also kill pods with EBS CSI Controller and then issue disappears.

@maximethebault
Copy link

Hi all, I can't make any specific promises before I get a chance to look into it, but I will investigate this bug sometime during the next work week.

In the meantime, can anyone affected here please clarify that this issue is happening to them (only) when they terminate the EC2 instance behind a node without draining it first?

If you can reproduce this under other scenarios that are different, please provide information on how to do that, thanks!

Hi @ConnorJC3

Anything else you need, you let us know!

@ConnorJC3
Copy link
Contributor

Hi all, sorry about the somewhat late response. I have been trying to reproduce this locally via methods such as kubectl drain, but I cannot seem to do so.

If anyone experiencing this issue not related to an ungraceful shutdown (ungraceful shutdowns causing this is expected behavior and out of our control) could provide one of the two, that would be very helpful:

  1. A minimum set of manifests to reproduce the issue with some level of consistency OR
  2. Logs from the EBS CSI Controller ebs-plugin and csi-attacher containers with the log level for both turned up to 4 when the issue occurs

@danielloader
Copy link

Not got reproduction steps per se, but running stateful sets with pv/pvcs on karpenter on spot instances with a node lifespan TTL of 15mins is a nice way to force the issue to happen quickly, as nodes are constantly churning and re-attaching to the PVs takes a while.

@tkatrichenko
Copy link

Using Spot instances on AWS and when an instance is terminated the same issue occurs

EKS: 1.22
ebs-csi-driver: v1.17
node-termination-handler: v1.19

@maximethebault
Copy link

I don't know what changed exactly, but I cannot reproduce the issue after upgrading to the driver's latest version.

Reverting to an older version like 1.12.1 and the issue is back.

I will upgrade all production clusters to the latest version and keep you updated if the issue is back.
Meanwhile, would be nice if anyone else can confirm the issue is gone after upgrading to the driver latest version.

@maherbeg
Copy link

@maximethebault After upgrading your production clusters, are you still seeing the issue?

danail-branekov added a commit to cloudfoundry/korifi that referenced this issue May 4, 2023
We want to verify whether VolumeAttachments are leaking, see kubernetes-sigs/aws-ebs-csi-driver#1302 (comment)

Co-authored-by: Danail Branekov <danailster@gmail.com>
Co-authored-by: Georgi Sabev <georgethebeatle@gmail.com>
georgethebeatle added a commit to cloudfoundry/korifi that referenced this issue May 4, 2023
We want to verify whether VolumeAttachments are leaking, see kubernetes-sigs/aws-ebs-csi-driver#1302 (comment)

Co-authored-by: Danail Branekov <danailster@gmail.com>
Co-authored-by: Georgi Sabev <georgethebeatle@gmail.com>
@igor-nikiforov
Copy link

I could confirm that this issue is still exist with EKS 1.25 / 1.26 and latest EBS driver version 1.18.0.

@VitaliiBabii
Copy link

Hi all,
I also faced with PVC(EBS) attach timeouts. Sometimes PVC attachment takes up to 1 hour.
In csi-attacher container logs I see lots of "client-side throttling". Is there the way to increase request's quota on csi-attacher side ?

I0515 15:12:23.707456       1 request.go:682] Waited for 3.792550031s due to client-side throttling, not priority and fairness, request: PATCH:https://100.64.0.1:443/api/v1/persistentvolumes/pvc-485c1d77-30e1-4006-bad4-396eeee17a6b
I0515 15:12:24.907002       1 request.go:682] Waited for 3.766319675s due to client-side throttling, not priority and fairness, request: PATCH:https://100.64.0.1:443/api/v1/persistentvolumes/pvc-cef2f6e7-df16-4d1d-ad50-7981093c78ad
I0515 15:12:25.907372       1 request.go:682] Waited for 3.392822052s due to client-side throttling, not priority and fairness, request: PATCH:https://100.64.0.1:443/api/v1/persistentvolumes/pvc-20aa26b3-f495-456e-a284-d9327bf73a84
I0515 15:12:27.107446       1 request.go:682] Waited for 3.192762704s due to client-side throttling, not priority and fairness, request: PATCH:https://100.64.0.1:443/api/v1/persistentvolumes/pvc-591f7379-5c9f-45f1-a42d-ef53808d0684

@VitaliiBabii
Copy link

PVC attaching delay fixed by increasing attacher and provisioner qps and burst .
--kube-api-burst=20
--kube-api-qps=10.0

@BrianKopp
Copy link

Is this related? kubernetes-csi/external-attacher#215

@martysweet
Copy link

martysweet commented Jun 20, 2023

In our tests, we were able to reproduce an issue where a volume remained attached to a node after the pod was deleted. This issue is relatively easy to reproduce (within a few hundred volume attachments) and would not clear itself without manual intervention.

It's possible there are multiple issues which users are experiencing in this thread, however, this one was the most disruptive to our users.

This reproduction is visible in the following places:

  • Node.volumesAttached and Node.volumesInUse were not empty on the old node (the stuck volumes remain there)
  • The volume was still typically attached to the old node in the AWS console
  • The VolumeAttachment object was still present in the cluster for the old node

The following workarounds can be used to detach the volume from the node:

  • Delete the VolumeAttachment object - this seems to reattach the disk and kick off the detach process successfully
  • Delete the node - the controller-manager attachdetach component will delete the VolumeAttachment object after 6 minutes
  • Restart the kubelet on the node - The kubelet then detects the correct state (ex. The volume is not mounted and not in use)

This issue is most likely to occur when a pod moves from one node to another, for example, when draining a node and during karpenter consolidation events.

During investigation, we saw the following calls in CloudWatch to the API server from the kubelet:

filter requestURI like "/api/v1/persistentvolumes/"
| fields @timestamp, requestURI, user.username
| sort @timestamp asc
Time URL Auth
2023-06-19 08:11:50.180 /api/v1/persistentvolumes/pvc-42d8283d-38c8-49af-a09b-0822b118b40d system:node:ip-10-0-0-47.eu-west-1.compute.internal
2023-06-19 08:13:33.706 /api/v1/persistentvolumes/e706dce472d961b5b5ad8c80796434bb34d486f1ec17127a20f92b2084f4e61a system:node:ip-10-0-0-137.eu-west-1.compute.internal

What stood out here, was the sha256 hash being requested from the API, in place of the expected pvc-* format. This only occurs when the error condition occurs and the volume is not detached from the node. It will occur periodically in the logs as the detach job has a backoff period.

Delving into the kubelet logs, we can see logs similar to the following:

csi_attacher.go:603] kubernetes.io/csi: attacher.UnmountDevice failed to get driver and volume name from device mount path: persistentvolumes "e706dce472d961b5b5ad8c80796434bb34d486f1ec17127a20f92b2084f4e61a" is forbidden: User "system:node:ip-10-0-0-137.eu-west-1.compute.internal" cannot get res

This is the point where the kubelet is trying to detach the volume and is failing when failing to read vol_data.json. What was interesting here, is that this code path does not exist in the release-1.27 codebase. With the latest patches, the kubelet should not be making this erroneous call to the api-server during an unmount.

PR kubernetes/kubernetes#116138 removes the code path that is being taken by kubelet/v1.27.1 (linux/amd64) kubernetes/abfec7d, and the fix has been backported for other versions.

The invalid path was:

  • In the event the vol_data.json file did not exist
  • Call getDriverAndVolNameFromDeviceMountPath
  • This calls pv, err := k8s.CoreV1().PersistentVolumes().Get(context.TODO(), pvName, metav1.GetOptions{})
  • Where pvName is the sha256 hash of the VolumeHandler // dir is now /var/lib/kubelet/plugins/kubernetes.io/csi/pv/{pvname}
  • The API returns a 403 Forbidden error, the PV doesn't exist and if it did, the kubelet would not have permission to access it
  • Return an error, and wait for the job to run again, which will fail indefinitely

So the observed behaviour becomes:

  • Due to an underlying (rare) bug in the kubelet volumemanager, resulting in multiple detach attempts after a pod is stopped, multiple detach attempts may be made for the same volume.
  • The first attempt will succeed, unmounting the volume, but the second attempt will fail with the persisentvolume API call error above, as the vol_data.json has been unmounted by the first attempt.
  • As the second job is failing, the volume remains marked as "attached" and "in-use" in the Nodes status.
  • The second job will retry forever, never succeeding.
  • As the volume is marked "attached" and "in-use" by the node, the controllermanager attachdetach component will not delete the VolumeAttachment object for Cloud API detachment, as it could potentially be unsafe to do so - unless the node is terminated.

The changes in the PR remove this erroneous code path, and return a nil error in the event of a missing vol_data.json which will result in a successful unmount.

It appears the EKS provided version of the kubelet is not patched with the PR commits to include the fix for this issue, otherwise, these log events would not occur.

The PR states this fix has been applied to the upstream branches, and can be checked by looking at the specific release- branch in the kubernetes/kubernetes repo:

FYI. this issue is fixed in 1.24.14, 1.25.10, 1.26.5, and has been fixed in v1.27.0

We have reproduced this behaviour in the following versions of the AWS kubelet, as recorded by the UserAgent string:

  • kubelet/v1.27.1 (linux/amd64) kubernetes/abfec7d
  • kubelet/v1.26.4 (linux/amd64) kubernetes/4a34796
  • kubelet/v1.25.9 (linux/amd64) kubernetes/a1a87a0
  • kubelet/v1.24.13 (linux/amd64) kubernetes/4943330

It looks like we will need AWS to rebuild the kubelet/AMIs with the required upstream patches to resolve this specific issue.

To aid in reproduction, we have created a repository available here https://github.com/martysweet/eks-ebs-node-detachment-reproduction, which provisions a bare-bones eks cluster based off Terraform blueprints. The python script can then be used to cycle the pod between two nodes. It will typically take less than 60 minutes before it becomes stuck on all of the above versions.

~ edit
Further investigation shows that despite the PR suggesting it had been fixed in v1.27.0, this is not the case:

The other kubelet versions tested (using the default blueprint sourced AL2 AMIs) have patch versions lower than the PR mentions (ex . kubelet/v1.24.13 vs 1.24.14)

@ichbestimmtnicht
Copy link

I can confirm the issue of long re-atachments for the addon v1.19.0-eksbuild.2 from kubernetes version 1.23 upwards and found kind of an easier workaround as I was impatiently updating a cluster: I deleted the volume.kubernetes.io/selected-node: [ip...] annotation from all pvc's prior to updating and the switch to the new nodes was almost smooth.
I tested this for k8s versions 1.24 -> 1.25 -> 1.26. The 1.23 -> 1.24 update was my test update where i discovered this fix while the pods couldn't be rescheduled.

This bug is a real pain and in my opinion this should be a dealbreaker for everyone who tests eks thoroughly.

@wallhided
Copy link

wallhided commented Aug 14, 2023

I have same issue with gp2 volume in 1.24, 1.25, 1.26. Pod restart process takes from 3 to 20 minutes, even if node hasn't changed. I had opened case and Support engineer from AWS suggested to add
fsGroupChangePolicy: "OnRootMismatch" to securityContext of StatefulSet.
More info -> https://kubernetes.io/blog/2020/12/14/kubernetes-release-1.20-fsgroupchangepolicy-fsgrouppolicy/

This option helped me and pod started to run without significant delays but only if node hasn't been changed.

The same problem happens when node terminated gracefully by Karpenter, as AWS support engineer described - it is a race condition which is related to Karpenter. Issue: #1665

I tried to Drain node manually before deletion and deleted it. This helped and there were no re-attaching issues for my pod from StatefulSet. Probably this could be temporary fixed by integrating node-termination-handler.

@shay-ul
Copy link

shay-ul commented Sep 13, 2023

For whoever is still impacted by this behaviour - make sure to update to the latest ebs csi driver version.

@TJCNI
Copy link

TJCNI commented Sep 23, 2023

Problem still exists on EKS v1.27.4-eks-cedffd4, Karpenter v0.30.0, aws-ebs-csi-driver v1.23.0

@torredil
Copy link
Member

@TJCNI The recently introduced PreStop lifecycle hook feature aims to address delayed attachments as a result of the driver node pod getting terminated before workload volumes are cleanly unmounted during graceful termination events. The hook logic will only execute if a node is being drained.

If you happen to be using spot instances along with Karpenter, you'll want to enable the Native Spot Termination Handling feature introduced in aws/karpenter-provider-aws#2546 (disabled by default):

Interruption: If enabled, Karpenter will watch for upcoming involuntary interruption events that could affect your nodes (health events, spot interruption, etc.) and will cordon, drain, and terminate the node(s) ahead of the event to reduce workload disruption.

For more details see Deprovisioning nodes from the Karpenter documentation.

Otherwise, if your instances are indeed being gracefully terminated and are running into delayed attachments issues I highly encourage you to create a new issue describing your workflow + attaching driver logs (please include csi-attacher container logs, and if possible, Kubelet logs) to debug and fix that for you. Thanks!

@hobbsh
Copy link

hobbsh commented Oct 5, 2023

We have Karpenter termination handling setup and still experience this in EKS 1.27, ebs-csi-driver 1.23, Karpenter 0.30.0. Based on logs (and probably to no one's surprise) the driver is failing to detatch the volume. I also see is that the node in question becomes NotReady (but not cordoned/drained) making me think that something is up with the kubelet, but Karpenter does not seem to notice or care.

E1005 01:55:03.601580       1 driver.go:122] "GRPC error" err="rpc error: code = Internal desc = Could not detach volume \"vol-0691fe2d3686d6859\" from node \"i-07c1e6216907ac9a9\": context canceled"

@ConnorJC3
Copy link
Contributor

@hobbsh EBS detaches rely on the EC2 instance's cooperation (if it is still online) - the OS must "give up" the volume, traditionally done on *nix OSes by umounting it.

Thus, if your node breaks in such a way that the volume isn't detached on the node side (for example, because the Kubelet is broken and never asks the EBS CSI Driver to unmount the volume) - but the node is still running, the EBS CSI Driver will get stuck indefinitely trying to detach.

This is intentional design, as taking the volume away from a running node without it's cooperation (what EBS calls a "force detach") can cause data loss. The EBS CSI Driver will never perform a force detach on your behalf because of the risks associated.

The feature that @torredil mentions above is only applicable when nodes are gracefully terminated (i.e. drained before shutdown). If your nodes are not being gracefully terminated, you'll need to work with Karpenter to figure out why that is occurring.

@doryer
Copy link

doryer commented Oct 31, 2023

Anyone on K8S and CSI versions managed to solve it and can share?

@blowfishpro
Copy link

The EBS CSI Driver will never perform a force detach on your behalf because of the risks associated.

So how are unrecoverable instance failures supposed to be handled? Things like hardware failures are individually rare, but across a large fleet of nodes become a common enough occurrence. In these cases the volumes are never going to be gracefully unmounted, and we're left to either wait many minutes for some timeouts to be hit on the AWS side and the instance to stop, or to manually force-detach the volumes. Either way some data loss is inevitable, but requiring a human to intervene and force-detach the volumes is generally not what we want to be doing. And waiting for the timeouts to be hit usually means more downtime than we want (for the services we have not yet managed to make HA).

@torredil
Copy link
Member

@blowfishpro

So how are unrecoverable instance failures supposed to be handled?

As a first step, ensure that Graceful Node Shutdown is enabled in your cluster by configuring ShutdownGracePeriod and ShutdownGracePeriodCriticalPods. The default values are set to zero, so you will need to configure them as appropriate.

Doing the above will allow the driver's preStop hook to execute even during unexpected shutdowns, and thus prevent the node pod from terminating for a period of 30 seconds, given there are VolumeAttachment objects associated with the relevant instance. If volumes are successfully unmounted in that 30 second window, they will not be "locked up" for 6 minutes.

That said, in rare corner cases where Kubelet's Node Shutdown Manager is unable to detect shutdown events and delay it by the ShutdownGracePeriod, you can manually add an out-of-service taint on the Node to reconcile the state before 6 minute timer lapses, see Non-Graceful Node Shutdown:

kubectl taint nodes <node-name> node.kubernetes.io/out-of-service=nodeshutdown:NoExecute

Either way some data loss is inevitable, but requiring a human to intervene and force-detach the volumes is generally not what we want to be doing

A different perspective on this: data loss / corruption of any kind is unacceptable. Delayed attachments are certainly highly undesirable and indicative of room for improvements, but potentially compromising a user's data is never the right tradeoff. Relevant to this discussion, Sig-Storage is currently working on a feature that allows for disabling the 6 minute force detach timer in A/D controller (doing so would be the equivalent of replacing 6 minutes with "infinity", requiring operator intervention by instructing Kubernetes that the node has really been shut-down): kubernetes/kubernetes#120328


To the best of my knowledge, the issue of delayed attachments as a result of graceful termination workflows has been addressed. If anyone runs into issues using the latest version of the driver, please report it and be sure to include all relevant context / information that may be helpful in tracking down the root cause so that we can address that promptly for you. Thank you!

/close

@k8s-ci-robot
Copy link
Contributor

@torredil: Closing this issue.

In response to this:

@blowfishpro

So how are unrecoverable instance failures supposed to be handled?

As a first step, ensure that Graceful Node Shutdown is enabled in your cluster by configuring ShutdownGracePeriod and ShutdownGracePeriodCriticalPods. The default values are set to zero, so you will need to configure them as appropriate.

Doing the above will allow the driver's preStop hook to execute even during unexpected shutdowns, and thus prevent the node pod from terminating for a period of 30 seconds, given there are VolumeAttachment objects associated with the relevant instance. If volumes are successfully unmounted in that 30 second window, they will not be "locked up" for 6 minutes.

That said, in rare corner cases where Kubelet's Node Shutdown Manager is unable to detect shutdown events and delay it by the ShutdownGracePeriod, you can manually add an out-of-service taint on the Node to reconcile the state before 6 minute timer lapses, see Non-Graceful Node Shutdown:

kubectl taint nodes <node-name> node.kubernetes.io/out-of-service=nodeshutdown:NoExecute

Either way some data loss is inevitable, but requiring a human to intervene and force-detach the volumes is generally not what we want to be doing

A different perspective on this: data loss / corruption of any kind is unacceptable. Delayed attachments are certainly highly undesirable and indicative of room for improvements, but potentially compromising a user's data is never the right tradeoff. Relevant to this discussion, Sig-Storage is currently working on a feature that allows for disabling the 6 minute force detach timer in A/D controller (doing so would be the equivalent of replacing 6 minutes with "infinity", requiring operator intervention by instructing Kubernetes that the node has really been shut-down): kubernetes/kubernetes#120328


To the best of my knowledge, the issue of delayed attachments as a result of graceful termination workflows has been addressed. If anyone runs into issues using the latest version of the driver, please report it and be sure to include all relevant context / information that may be helpful in tracking down the root cause so that we can address that promptly for you. Thank you!

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests