Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to delete machinePool for unreachable cluster #10544

Open
serngawy opened this issue May 1, 2024 · 6 comments · May be fixed by #10553
Open

Failed to delete machinePool for unreachable cluster #10544

serngawy opened this issue May 1, 2024 · 6 comments · May be fixed by #10553
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/backlog Higher priority than priority/awaiting-more-evidence. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@serngawy
Copy link

serngawy commented May 1, 2024

What steps did you take and what happened?

At the time to uninstall/deprovision a cluster AND the cluster is unreachable or the cluster kubeConfig secret is deleted (as it managed by controlPlaneRef) the machinePool controller raise the below errors

E0501 14:52:48.526675       1 controller.go:329] "Reconciler error" err="failed to create cluster accessor: error creating http client and mapper for remote cluster \"ns-rosa-hcp/rosa-hcp-2\": error creating client for remote cluster \"ns-rosa-hcp/rosa-hcp-2\": error getting rest mapping: failed to get API group resources: unable to retrieve the complete list of server APIs: v1: Get \"https://api.mydomain.5sub.s3.devshift.org:443/api/v1?timeout=10s\": dial tcp: lookup api.mydomain.5sub.s3.devshift.org on 172.30.0.10:53: no such host" controller="machinepool" controllerGroup="cluster.x-k8s.io" controllerKind="MachinePool" MachinePool="ns-rosa-hcp/workers-ex" namespace="ns-rosa-hcp" name="workers-ex" reconcileID="55df7e9c-460a-4cf5-bc19-409c18a6d61e"


E0501 14:53:48.529773       1 controller.go:329] "Reconciler error" err="failed to create cluster accessor: error fetching REST client config for remote cluster \"ns-rosa-hcp/rosa-hcp-2\": failed to retrieve kubeconfig secret for Cluster ns-rosa-hcp/rosa-hcp-2: Secret \"rosa-hcp-2-kubeconfig\" not found" controller="machinepool" controllerGroup="cluster.x-k8s.io" controllerKind="MachinePool" MachinePool="ns-rosa-hcp/workers-ex" namespace="ns-rosa-hcp" name="workers-ex" reconcileID="bbd400ce-c773-4ffe-9ab1-f4dfccc6660e"

The issue is initially raised for cluster-api-provider-aws issue number 4936. Failed to clean up the MachinePool CRs in gitOps workFlow using ArgoCD.

Workaround to clean the machinePool CR by removing the finalizer manually.

What did you expect to happen?

Expect to be able to delete the machinePool CRs without the needs to clean the finalizer manually

Cluster API version

v1.6.3
registry.k8s.io/cluster-api/cluster-api-controller:v1.6.3

Kubernetes version

Kubernetes Version: v1.25.7+eab9cc9

Anything else you would like to add?

Checking the machinePool_controller_reconcileDelete the fix could be by checking the returning error at line-263 is not as failed to create cluster accessor (same error as above logs). The changes could be as below

func (r *MachinePoolReconciler) reconcileDelete(ctx context.Context, cluster *clusterv1.Cluster, mp *expv1.MachinePool) error {
	if ok, err := r.reconcileDeleteExternal(ctx, mp); !ok || err != nil {
		// Return early and don't remove the finalizer if we got an error or
		// the external reconciliation deletion isn't ready.
		return err
	}

	// Check if the return error is "failed to create cluster accessor" meaninng the cluster in unreachable then deleting the machienPool.
	if err := r.reconcileDeleteNodes(ctx, cluster, mp); err != nil {
		if !strings.Contains(err.Error(), "failed to create cluster accessor") {
                         // Return early and don't remove the finalizer if we got an error.
			return err
		}
	}

	controllerutil.RemoveFinalizer(mp, expv1.MachinePoolFinalizer)
	return nil
}

OR if the machinePool has InfrastructureRef then the responsibility of deleting the node should be done by the InfrastructureRef CR not by the machinePool (it may raise back compatibility issue). The changes could be as below

func (r *MachinePoolReconciler) reconcileDelete(ctx context.Context, cluster *clusterv1.Cluster, mp *expv1.MachinePool) error {
	if ok, err := r.reconcileDeleteExternal(ctx, mp); !ok || err != nil {
		// Return early and don't remove the finalizer if we got an error or
		// the external reconciliation deletion isn't ready.
		return err
	}

	// Check if there is an InfrastructureRef then deleting the node should be done by the InfrastructureRef.
	if mp.Spec.Template.Spec.InfrastructureRef.Name == "" {
		if err := r.reconcileDeleteNodes(ctx, cluster, mp); err != nil {
			// Return early and don't remove the finalizer if we got an error.
			return err
		}
	}

	controllerutil.RemoveFinalizer(mp, expv1.MachinePoolFinalizer)
	return nil
}

Label(s) to be applied

/kind bug
One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label May 1, 2024
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

CAPI contributors will take a look as soon as possible, apply one of the triage/* labels and provide further guidance.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label May 1, 2024
@fabriziopandini
Copy link
Member

/priority awaiting-more-evidence

cc @mboersma @willie-yao @Jont828
to take a look and assign priority

note: cluster is unreachable is handled "gracefully" in machine controller, I do expect the same to happen in MachinePools too.

@k8s-ci-robot k8s-ci-robot added the priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. label May 2, 2024
@mboersma
Copy link
Contributor

mboersma commented May 2, 2024

/assign

@serngawy
Copy link
Author

serngawy commented May 3, 2024

Thanks for the follow up, @fabriziopandini you mentioned handling unreachable cluster "gracefully" in the machinePool, Would you elaborate more. Cause I was thinking that we should have unreachable state for the cluster CR and giving it timeout to do actions like force delete. Is that align with what you mean ?

@chrischdi
Copy link
Member

Thanks for the follow up, @fabriziopandini you mentioned handling unreachable cluster "gracefully" in the machinePool, Would you elaborate more. Cause I was thinking that we should have unreachable state for the cluster CR and giving it timeout to do actions like force delete. Is that align with what you mean ?

@fabriziopandini refers to the code for machine-controller which, we explicitly skips the node deletion on certain errors:

err := r.isDeleteNodeAllowed(ctx, cluster, m)
isDeleteNodeAllowed := err == nil
if err != nil {
switch err {
case errNoControlPlaneNodes, errLastControlPlaneNode, errNilNodeRef, errClusterIsBeingDeleted, errControlPlaneIsBeingDeleted:
nodeName := ""
if m.Status.NodeRef != nil {
nodeName = m.Status.NodeRef.Name
}
log.Info("Skipping deletion of Kubernetes Node associated with Machine as it is not allowed", "Node", klog.KRef("", nodeName), "cause", err.Error())

isDeleteNodeAllowed would be false, so it even won't try to create an accessor and run into the mentioned error.

Probably a similar behaviour can be implemented for the machine pool case.

@serngawy
Copy link
Author

serngawy commented May 8, 2024

agree, make sense so the fix I pushed check here if the cluster is deleted, no need to create a client.

@fabriziopandini fabriziopandini added priority/backlog Higher priority than priority/awaiting-more-evidence. triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. labels May 22, 2024
@k8s-ci-robot k8s-ci-robot removed the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label May 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/backlog Higher priority than priority/awaiting-more-evidence. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants