Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KCP: add condition to surface remediation state #8188

Closed
sbueringer opened this issue Feb 27, 2023 · 8 comments · Fixed by #10559
Closed

KCP: add condition to surface remediation state #8188

sbueringer opened this issue Feb 27, 2023 · 8 comments · Fixed by #10559
Assignees
Labels
area/control-plane Issues or PRs related to control-plane lifecycle management kind/feature Categorizes issue or PR as related to a new feature. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@sbueringer
Copy link
Member

sbueringer commented Feb 27, 2023

We recently extended KCP remediation in #7963.

Goal of this issue is to add a condition to KCP to surface the state of the remedation.

The condition should help to debug issues in remedation. During PR review we identified a deadlock which should very rarely happen:

Let's assume the following happens:

  • Machine create is done
  • KCP annotation is not removed because of controller crash
  • Remediation won't remediate any other machines

This will be automatically resolved once a new Machine is created (e.g. during scale up or rollout) but until then remediation is blocked.
xref: #7963 (comment)

For now we added a log message, but it would be easier for users if they can see the status / progress of the remediation in a condition.

/kind feature

@sbueringer sbueringer added the area/control-plane Issues or PRs related to control-plane lifecycle management label Feb 27, 2023
@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 27, 2023
@sbueringer
Copy link
Member Author

cc @fabriziopandini

@fabriziopandini
Copy link
Member

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 20, 2023
@k8s-triage-robot
Copy link

This issue has not been updated in over 1 year, and should be re-triaged.

You can:

  • Confirm that this issue is still relevant with /triage accepted (org members only)
  • Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. and removed triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Mar 19, 2024
@fabriziopandini
Copy link
Member

/priority important-longterm

@k8s-ci-robot k8s-ci-robot added the priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. label Apr 11, 2024
@fabriziopandini
Copy link
Member

/assign
To take a look

@fabriziopandini
Copy link
Member

Current state is:

  • When a remediation start, we already update OwnerRemediated condition setting RemediationInProgress reason.
  • Adding a new condition or a new state for the condition above will have the same risk of the current code, because we patch KCP at the end of reconcile and something can happen in the middle.

One option we might consider is to issue a patch to remove the annotation immediately after create, so we can further reduc the risk of a controller crash happening in between
@sbueringer opinions?

@sbueringer
Copy link
Member Author

sbueringer commented May 6, 2024

I wonder if we can just add some additional logic to automatically drop the annotation when we detect the remediation is done, e.g.

  • Directly before l.102 in reconcileUnhealthyMachines:
    • If there is a KCP Machine with creationTimestamp > KCP.annotations[RemediationInProgressAnnotation].Timestamp => remove the annotation

@fabriziopandini
Copy link
Member

ACK, I will take a look

@fabriziopandini fabriziopandini added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label May 7, 2024
@k8s-ci-robot k8s-ci-robot removed the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label May 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/control-plane Issues or PRs related to control-plane lifecycle management kind/feature Categorizes issue or PR as related to a new feature. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants