New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
job controller: set job conditions to false explicitly #100701
Conversation
@adtac: This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Signed-off-by: Adhityaa Chandrasekar <adtac@google.com>
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: adtac The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
pkg/controller/job/job_controller.go
Outdated
job.Status.Conditions, isUpdated = ensureJobConditionStatus(job.Status.Conditions, batch.JobComplete, v1.ConditionFalse, "", "") | ||
jobConditionsChanged = jobConditionsChanged || isUpdated | ||
} | ||
if utilfeature.DefaultFeatureGate.Enabled(features.SuspendJob) && manageJobCalled { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a subtle change from the previous implementation:
Before, it looked like we would only suspend a job if didn't already completed. However, now we can theoretically mark a job as suspended even if it's also marked as completed.
For a moment, I thought this wouldn't be possible, but it is, in the coincidence where all pods succeed and a Job is marked as suspended at the same time.
Maybe we can just do:
if !complete && job.Spec.Suspend != nil && *job.Spec.Suspend
also, we should set the condition even if manageJobCalled
is false. And even remove the condition if the feature gate is disabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -986,13 +996,21 @@ func ensureJobConditionStatus(list []batch.JobCondition, cType batch.JobConditio | |||
list[i].Reason = reason |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't forget to update the Probe timestamp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I did begin to change LastProbeTime, but there's a problem with that. By definition, LastProbeTime must be updated every time ensureJobConditionStatus is called regardless of whether the status/reason/message matches (because it's just a probe). That means that jobConditionsChanged is always true every time syncJob is called. That means that we'll do a Jobs.UpdatedStatus()
every time syncJob is called and that forget
is always true. These are significantly more far-reaching consequences than I had originally anticipated.
The performance impact of syncing the Job with the API server every time syncJob is called could be disastrous for long running jobs that see very infrequent updates.
So I'd like to hold off from making LastProbeTime change in this PR. It also sends the wrong message to update it only when job condition status value changes (i.e. the if part of this if-else). In fact, AFAICT there are also no other k8s controllers that do anything like LastProbeTime, so I think removing that field might even be the most consistent approach. There is some value in knowing LastProbeTime from the user's perspective, but idk if it's possible to accurately update that without performance concerns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking that if the block in line 643 is reached, you can update all LastProbeTime
s
@@ -986,13 +996,21 @@ func ensureJobConditionStatus(list []batch.JobCondition, cType batch.JobConditio | |||
list[i].Reason = reason |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking that if the block in line 643 is reached, you can update all LastProbeTime
s
job.Status.StartTime = &now | ||
} | ||
} else { | ||
job.Status.Conditions, isUpdated = ensureJobConditionStatus(job.Status.Conditions, batch.JobComplete, v1.ConditionFalse, "", "") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this necessary?
Wouldn't it be overridden below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is JobComplete, not JobSuspended
jobConditionsChanged = jobConditionsChanged || isUpdated | ||
} | ||
|
||
if !utilfeature.DefaultFeatureGate.Enabled(features.SuspendJob) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider putting all this logic in a function.
For example, you could just return after removing the condition if the gate is disabled, simplifying the conditionals below.
jobConditionsChanged = jobConditionsChanged || isUpdated | ||
} else if !hasCondition(job.Status.Conditions, batch.JobSuspended) { | ||
// The feature gate is enabled, but the condition doesn't exist. This | ||
// should never happen, insert the default placeholder with status false. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wouldn't this happen in the first sync?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/This should never happen/This should not be the case/ :)
Yes, this will happen in the first sync.
@@ -654,9 +643,14 @@ func TestControllerSyncJob(t *testing.T) { | |||
} else if tc.fakeExpectationAtCreation > 0 { | |||
manager.expectations.ExpectCreations(key, int(tc.fakeExpectationAtCreation)) | |||
} | |||
originalConditions := []batch.JobCondition{ | |||
newCondition(batch.JobComplete, v1.ConditionFalse, "", ""), | |||
newCondition(batch.JobFailed, v1.ConditionFalse, "", ""), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also newCondition(batch.JobSuspended, v1.ConditionFalse, "", "")
, if feature is enabled.
But, I would rather prefer we have wantConditions []JobCondition
in the test case.
t.Errorf("Unexpected conditions %v", actual.Status.Conditions) | ||
for _, jc := range originalConditions { | ||
if !getCondition(actual, jc.Type, jc.Status, jc.Reason) { | ||
copyIrrelevantJobConditionFields(actual.Status.Conditions, originalConditions) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would sort the conditions by name and then do a direct comparison of actual.Status.Condition
and original
. Then call cmp.Diff
with cmpopts.IgnoreFields
@adtac: PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-contributor-experience at kubernetes/community. |
is someone else able to take this over? this is an important change but I don't have the time to see it through |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close |
@k8s-triage-robot: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
What type of PR is this?
/kind bug
What this PR does / why we need it:
#98727 (comment)
From the guide to writing controllers in the [API conventions doc]:
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
This is a user-facing change. If external tools were incorrectly relying on the presence/absence of a particular condition instead of its status, such tools will need to correct their behaviour. Whether or not this is acceptable, I'll leave it to the API reviewers.
However, I'll note that other controllers also make this mistake of removing conditions instead of setting the status to false. Non-exhaustive list of examples:
kubernetes/pkg/controller/deployment/progress.go
Line 42 in a651804
kubernetes/pkg/controller/replicaset/replica_set_utils.go
Line 120 in a651804
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:
cc @alculquicondor @lavalamp @soltysh
/sig apps