kubergrunt eks deploy doesn't gracefully decommission nodes if drain fails #107

aleclerc-sonrai · 2020-12-21T15:09:13Z

I was trying out the kubergrunt eks deploy command today to try and upgrade/roll my ASKs with a new version.

I have several pods with local data, so the drain failed, however it still went ahead and forcefully terminated the nodes

[] INFO[2020-12-21T11:02:04-04:00] Cordoning old instances in cluster ASG dev-eks-client-spot-us-east-1b20190530170243598800000004 to prevent Pod scheduling  name=kubergrunt
[] INFO[2020-12-21T11:02:05-04:00] Running command: kubectl --kubeconfig /Users/adam.leclerc/.kube/config cordon ip-172-25-114-254.ec2.internal
[] INFO[2020-12-21T11:02:05-04:00] Running command: kubectl --kubeconfig /Users/adam.leclerc/.kube/config cordon ip-172-25-116-251.ec2.internal
[] INFO[2020-12-21T11:02:07-04:00] node/ip-172-25-116-251.ec2.internal cordoned
[] INFO[2020-12-21T11:02:07-04:00] node/ip-172-25-114-254.ec2.internal cordoned
[] INFO[2020-12-21T11:02:07-04:00] Successfully cordoned old instances in cluster ASG dev-eks-client-spot-us-east-1b20190530170243598800000004  name=kubergrunt
[] INFO[2020-12-21T11:02:07-04:00] Draining Pods on old instances in cluster ASG dev-eks-client-spot-us-east-1b20190530170243598800000004  name=kubergrunt
[] INFO[2020-12-21T11:02:07-04:00] Running command: kubectl --kubeconfig /Users/adam.leclerc/.kube/config drain ip-172-25-116-251.ec2.internal --ignore-daemonsets --timeout 15m0s
[] INFO[2020-12-21T11:02:07-04:00] Running command: kubectl --kubeconfig /Users/adam.leclerc/.kube/config drain ip-172-25-114-254.ec2.internal --ignore-daemonsets --timeout 15m0s
[] INFO[2020-12-21T11:02:09-04:00] node/ip-172-25-116-251.ec2.internal already cordoned
[] INFO[2020-12-21T11:02:09-04:00] node/ip-172-25-114-254.ec2.internal already cordoned
[] INFO[2020-12-21T11:02:10-04:00] error: unable to drain node "ip-172-25-116-251.ec2.internal", aborting command...
[] INFO[2020-12-21T11:02:10-04:00]
[] INFO[2020-12-21T11:02:10-04:00] There are pending nodes to be drained:
[] INFO[2020-12-21T11:02:10-04:00]  ip-172-25-116-251.ec2.internal
[] INFO[2020-12-21T11:02:10-04:00] error: cannot delete Pods with local storage (use --delete-local-data to override): <<<< REDACTED>>>>
[] INFO[2020-12-21T11:02:10-04:00] error: unable to drain node "ip-172-25-114-254.ec2.internal", aborting command...
[] INFO[2020-12-21T11:02:10-04:00]
[] INFO[2020-12-21T11:02:10-04:00] There are pending nodes to be drained:
[] INFO[2020-12-21T11:02:10-04:00]  ip-172-25-114-254.ec2.internal
[] INFO[2020-12-21T11:02:10-04:00] error: cannot delete Pods with local storage (use --delete-local-data to override): <<<< REDACTED>>>>
[] INFO[2020-12-21T11:02:10-04:00] Successfully drained all scheduled Pods on old instances in cluster ASG dev-eks-client-spot-us-east-1b20190530170243598800000004  name=kubergrunt
[] INFO[2020-12-21T11:02:10-04:00] Removing old nodes from ASG dev-eks-client-spot-us-east-1b20190530170243598800000004  name=kubergrunt
[] INFO[2020-12-21T11:02:10-04:00] Detaching 2 instances from ASG dev-eks-client-spot-us-east-1b20190530170243598800000004  name=kubergrunt
[] INFO[2020-12-21T11:02:11-04:00] Detached 2 instances from ASG dev-eks-client-spot-us-east-1b20190530170243598800000004  name=kubergrunt
[] INFO[2020-12-21T11:02:11-04:00] Terminating 2 instances, in groups of up to 1000 instances  name=kubergrunt
[] INFO[2020-12-21T11:02:11-04:00] Terminated 2 instances from batch 0           name=kubergrunt
[] INFO[2020-12-21T11:02:11-04:00] Waiting for 2 instances to shut down from batch 0  name=kubergrunt
[] INFO[2020-12-21T11:03:42-04:00] Successfully shutdown 2 instances from batch 0  name=kubergrunt
[] INFO[2020-12-21T11:03:42-04:00] Successfully shutdown all 2 instances         name=kubergrunt
[] INFO[2020-12-21T11:03:42-04:00] Successfully removed old nodes from ASG dev-eks-client-spot-us-east-1b20190530170243598800000004  name=kubergrunt
[] INFO[2020-12-21T11:03:42-04:00] Successfully finished roll out for EKS cluster worker group dev-eks-client-spot-us-east-1b20190530170243598800000004 in us-east-1  name=kubergrunt```

I have two questions:
1. Is this expected, or is it a bug?
2. Is there a way to allow the drain to pass the `--delete-local-data` flag?

The text was updated successfully, but these errors were encountered:

yorinasub17 · 2021-01-04T17:46:06Z

Thanks for sharing the logs!

Is this expected, or is it a bug?

There is indeed a bug here in that it should not continue through when it fails the drain. Will take a look today to see what is causing this. In the meantime...

Is there a way to allow the drain to pass the --delete-local-data flag?

Yup you can pass through that exact option to eks deploy to forward it to the drain (kubergrunt eks deploy --delete-local-data --region REGION --asg-name NAME should work).

yorinasub17 added the bug Something isn't working label Jan 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kubergrunt eks deploy doesn't gracefully decommission nodes if drain fails #107

kubergrunt eks deploy doesn't gracefully decommission nodes if drain fails #107

aleclerc-sonrai commented Dec 21, 2020

yorinasub17 commented Jan 4, 2021

kubergrunt eks deploy doesn't gracefully decommission nodes if drain fails #107

kubergrunt eks deploy doesn't gracefully decommission nodes if drain fails #107

Comments

aleclerc-sonrai commented Dec 21, 2020

yorinasub17 commented Jan 4, 2021