New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add conditions to PDB status #98127
Add conditions to PDB status #98127
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -49,6 +49,7 @@ import ( | |
"k8s.io/client-go/tools/cache" | ||
"k8s.io/client-go/tools/record" | ||
"k8s.io/client-go/util/workqueue" | ||
pdbhelper "k8s.io/component-helpers/apps/poddisruptionbudget" | ||
"k8s.io/klog/v2" | ||
podutil "k8s.io/kubernetes/pkg/api/v1/pod" | ||
"k8s.io/kubernetes/pkg/controller" | ||
|
@@ -63,7 +64,9 @@ import ( | |
// If the controller is running on a different node it is important that the two nodes have synced | ||
// clock (via ntp for example). Otherwise PodDisruptionBudget controller may not provide enough | ||
// protection against unwanted pod disruptions. | ||
const DeletionTimeout = 2 * 60 * time.Second | ||
const ( | ||
DeletionTimeout = 2 * 60 * time.Second | ||
) | ||
|
||
type updater func(*policy.PodDisruptionBudget) error | ||
|
||
|
@@ -579,7 +582,7 @@ func (dc *DisruptionController) sync(key string) error { | |
} | ||
if err != nil { | ||
klog.Errorf("Failed to sync pdb %s/%s: %v", pdb.Namespace, pdb.Name, err) | ||
return dc.failSafe(pdb) | ||
return dc.failSafe(pdb, err) | ||
} | ||
|
||
return nil | ||
|
@@ -774,9 +777,21 @@ func (dc *DisruptionController) buildDisruptedPodMap(pods []*v1.Pod, pdb *policy | |
// implement the "fail open" part of the design since if we manage to update | ||
// this field correctly, we will prevent the /evict handler from approving an | ||
// eviction when it may be unsafe to do so. | ||
func (dc *DisruptionController) failSafe(pdb *policy.PodDisruptionBudget) error { | ||
func (dc *DisruptionController) failSafe(pdb *policy.PodDisruptionBudget, err error) error { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Would it be possible to add another condition to indicate we're hitting failSafe rather than just disruptions allowed. We could consume that info here: https://github.com/kubernetes/kubernetes/blob/release-1.20/pkg/registry/core/pod/storage/eviction.go#L208 Otherwise, we might evict pods with DesiredHealthy && CurrentHealthy being stale. Maybe we can still achieve this by testing Reason == policy.SyncFailedReason if we have a util function, but sync failure seems like it should be it's own status condition rather than embedded in DisruptionsAllowed. WDYT? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We could add a separate condition for this, but I wanted to avoid creating more conditions than necessary. So if we did this, it would be something like a And would conditions be the appropriate way to handle this? The There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
is that true? failSafe sets DisruptionsAllowed to 0, which prevents evictions, right? if SyncFailed=True always implies DisruptionAllowed=False, that seems like a good use of the distinct reason field on a DisruptionAllowed=False condition rather than a separate condition There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Currently, DisruptionsAllowed is an integer, and this is the 'budget' used to determine how many pods are evicted. If we are hoping to explicitly disallow eviction because the PDB has gone stale, we need another field. I don't have a strong preference on the field name or type, so long as it's for that specific purpose. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. the same controller that would add the "I am stale because I can't sync" condition is already setting DisruptionsAllowed to 0 when it can't sync... I don't see why another field is required There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
actually, I'm not sure I understand why we need to check CurrentHealthy / DesiredHealthy in this case... removing a not-ready pod should always be ok, right? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If we do end up needing to determine if the controller is failing to sync, checking for
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I added the discussion issue here: #99598 The current patch is fine if we don't like manipulating CurrentHealthy. My suggestion is to add a util function to detect this specific status, and ensure that document the intent to others that 'SyncFailed' means you probably should not rely on the PDB status counts as they may be stale. This is somewhat of an edge case that probably we should have accounted for prior to the eviction API change, but I missed it. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Replied in #99598 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. addressing this in a separate PR sounds good to me. I don't think this change makes the issue worse, and gives us a potential tool to address the issue in a follow up |
||
newPdb := pdb.DeepCopy() | ||
newPdb.Status.DisruptionsAllowed = 0 | ||
|
||
if newPdb.Status.Conditions == nil { | ||
newPdb.Status.Conditions = make([]metav1.Condition, 0) | ||
} | ||
apimeta.SetStatusCondition(&newPdb.Status.Conditions, metav1.Condition{ | ||
Type: policy.DisruptionAllowedCondition, | ||
Status: metav1.ConditionFalse, | ||
Reason: policy.SyncFailedReason, | ||
Message: err.Error(), | ||
ObservedGeneration: newPdb.Status.ObservedGeneration, | ||
}) | ||
|
||
return dc.getUpdater()(newPdb) | ||
} | ||
|
||
|
@@ -797,7 +812,8 @@ func (dc *DisruptionController) updatePdbStatus(pdb *policy.PodDisruptionBudget, | |
pdb.Status.ExpectedPods == expectedCount && | ||
pdb.Status.DisruptionsAllowed == disruptionsAllowed && | ||
apiequality.Semantic.DeepEqual(pdb.Status.DisruptedPods, disruptedPods) && | ||
pdb.Status.ObservedGeneration == pdb.Generation { | ||
pdb.Status.ObservedGeneration == pdb.Generation && | ||
pdbhelper.ConditionsAreUpToDate(pdb) { | ||
return nil | ||
} | ||
|
||
|
@@ -811,6 +827,8 @@ func (dc *DisruptionController) updatePdbStatus(pdb *policy.PodDisruptionBudget, | |
ObservedGeneration: pdb.Generation, | ||
} | ||
|
||
pdbhelper.UpdateDisruptionAllowedCondition(newPdb) | ||
|
||
return dc.getUpdater()(newPdb) | ||
} | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this needs to be validated with a call to metav1validation.ValidateConditions
since this is a new field, it can be validated even for v1beta1 (which means actually connecting v1beta1 status updates to validation specifically for this field)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added validation of the conditions to
ValidatePodDisruptionBudgetStatus
. This isn't actually used yet, but #99290 addresses this. I will make sure this validation gets run for all versions, while the existing fields will only be validated for v1.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
go ahead and wire the validation here... we don't want to merge unvalidated APIs assuming validation will be added in a follow-up before release
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved the code for wiring up status validation from #99290 into this PR.