Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubergrunt eks deploy doesn't gracefully decommission nodes if drain fails #107

Open
aleclerc-sonrai opened this issue Dec 21, 2020 · 1 comment
Labels
bug Something isn't working

Comments

@aleclerc-sonrai
Copy link

I was trying out the kubergrunt eks deploy command today to try and upgrade/roll my ASKs with a new version.

I have several pods with local data, so the drain failed, however it still went ahead and forcefully terminated the nodes

[] INFO[2020-12-21T11:02:04-04:00] Cordoning old instances in cluster ASG dev-eks-client-spot-us-east-1b20190530170243598800000004 to prevent Pod scheduling  name=kubergrunt
[] INFO[2020-12-21T11:02:05-04:00] Running command: kubectl --kubeconfig /Users/adam.leclerc/.kube/config cordon ip-172-25-114-254.ec2.internal
[] INFO[2020-12-21T11:02:05-04:00] Running command: kubectl --kubeconfig /Users/adam.leclerc/.kube/config cordon ip-172-25-116-251.ec2.internal
[] INFO[2020-12-21T11:02:07-04:00] node/ip-172-25-116-251.ec2.internal cordoned
[] INFO[2020-12-21T11:02:07-04:00] node/ip-172-25-114-254.ec2.internal cordoned
[] INFO[2020-12-21T11:02:07-04:00] Successfully cordoned old instances in cluster ASG dev-eks-client-spot-us-east-1b20190530170243598800000004  name=kubergrunt
[] INFO[2020-12-21T11:02:07-04:00] Draining Pods on old instances in cluster ASG dev-eks-client-spot-us-east-1b20190530170243598800000004  name=kubergrunt
[] INFO[2020-12-21T11:02:07-04:00] Running command: kubectl --kubeconfig /Users/adam.leclerc/.kube/config drain ip-172-25-116-251.ec2.internal --ignore-daemonsets --timeout 15m0s
[] INFO[2020-12-21T11:02:07-04:00] Running command: kubectl --kubeconfig /Users/adam.leclerc/.kube/config drain ip-172-25-114-254.ec2.internal --ignore-daemonsets --timeout 15m0s
[] INFO[2020-12-21T11:02:09-04:00] node/ip-172-25-116-251.ec2.internal already cordoned
[] INFO[2020-12-21T11:02:09-04:00] node/ip-172-25-114-254.ec2.internal already cordoned
[] INFO[2020-12-21T11:02:10-04:00] error: unable to drain node "ip-172-25-116-251.ec2.internal", aborting command...
[] INFO[2020-12-21T11:02:10-04:00]
[] INFO[2020-12-21T11:02:10-04:00] There are pending nodes to be drained:
[] INFO[2020-12-21T11:02:10-04:00]  ip-172-25-116-251.ec2.internal
[] INFO[2020-12-21T11:02:10-04:00] error: cannot delete Pods with local storage (use --delete-local-data to override): <<<< REDACTED>>>>
[] INFO[2020-12-21T11:02:10-04:00] error: unable to drain node "ip-172-25-114-254.ec2.internal", aborting command...
[] INFO[2020-12-21T11:02:10-04:00]
[] INFO[2020-12-21T11:02:10-04:00] There are pending nodes to be drained:
[] INFO[2020-12-21T11:02:10-04:00]  ip-172-25-114-254.ec2.internal
[] INFO[2020-12-21T11:02:10-04:00] error: cannot delete Pods with local storage (use --delete-local-data to override): <<<< REDACTED>>>>
[] INFO[2020-12-21T11:02:10-04:00] Successfully drained all scheduled Pods on old instances in cluster ASG dev-eks-client-spot-us-east-1b20190530170243598800000004  name=kubergrunt
[] INFO[2020-12-21T11:02:10-04:00] Removing old nodes from ASG dev-eks-client-spot-us-east-1b20190530170243598800000004  name=kubergrunt
[] INFO[2020-12-21T11:02:10-04:00] Detaching 2 instances from ASG dev-eks-client-spot-us-east-1b20190530170243598800000004  name=kubergrunt
[] INFO[2020-12-21T11:02:11-04:00] Detached 2 instances from ASG dev-eks-client-spot-us-east-1b20190530170243598800000004  name=kubergrunt
[] INFO[2020-12-21T11:02:11-04:00] Terminating 2 instances, in groups of up to 1000 instances  name=kubergrunt
[] INFO[2020-12-21T11:02:11-04:00] Terminated 2 instances from batch 0           name=kubergrunt
[] INFO[2020-12-21T11:02:11-04:00] Waiting for 2 instances to shut down from batch 0  name=kubergrunt
[] INFO[2020-12-21T11:03:42-04:00] Successfully shutdown 2 instances from batch 0  name=kubergrunt
[] INFO[2020-12-21T11:03:42-04:00] Successfully shutdown all 2 instances         name=kubergrunt
[] INFO[2020-12-21T11:03:42-04:00] Successfully removed old nodes from ASG dev-eks-client-spot-us-east-1b20190530170243598800000004  name=kubergrunt
[] INFO[2020-12-21T11:03:42-04:00] Successfully finished roll out for EKS cluster worker group dev-eks-client-spot-us-east-1b20190530170243598800000004 in us-east-1  name=kubergrunt```

I have two questions:
1. Is this expected, or is it a bug?
2. Is there a way to allow the drain to pass the `--delete-local-data` flag?
@yorinasub17 yorinasub17 added the bug Something isn't working label Jan 4, 2021
@yorinasub17
Copy link
Contributor

Thanks for sharing the logs!

  1. Is this expected, or is it a bug?

There is indeed a bug here in that it should not continue through when it fails the drain. Will take a look today to see what is causing this. In the meantime...

  1. Is there a way to allow the drain to pass the --delete-local-data flag?

Yup you can pass through that exact option to eks deploy to forward it to the drain (kubergrunt eks deploy --delete-local-data --region REGION --asg-name NAME should work).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants