Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

job controller: set job conditions to false explicitly #100701

Closed
wants to merge 2 commits into from

Conversation

adtac
Copy link
Member

@adtac adtac commented Mar 31, 2021

What type of PR is this?

/kind bug

What this PR does / why we need it:

#98727 (comment)

From the guide to writing controllers in the [API conventions doc]:

Controllers should apply their conditions to a resource the first time they visit the resource, even if the status is Unknown. This allows other components in the system to know that the condition exists ...

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

This is a user-facing change. If external tools were incorrectly relying on the presence/absence of a particular condition instead of its status, such tools will need to correct their behaviour. Whether or not this is acceptable, I'll leave it to the API reviewers.

However, I'll note that other controllers also make this mistake of removing conditions instead of setting the status to false. Non-exhaustive list of examples:

util.RemoveDeploymentCondition(&newStatus, apps.DeploymentProgressing)

RemoveCondition(&newStatus, apps.ReplicaSetReplicaFailure)

Does this PR introduce a user-facing change?

The Job controller now sets Job conditions with a false status value when it previously omitted them

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


cc @alculquicondor @lavalamp @soltysh
/sig apps

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. sig/apps Categorizes an issue or PR as relevant to SIG Apps. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 31, 2021
@k8s-ci-robot
Copy link
Contributor

@adtac: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Mar 31, 2021
Signed-off-by: Adhityaa Chandrasekar <adtac@google.com>
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: adtac
To complete the pull request process, please assign smarterclayton after the PR has been reviewed.
You can assign the PR to them by writing /assign @smarterclayton in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added area/test sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels Mar 31, 2021
job.Status.Conditions, isUpdated = ensureJobConditionStatus(job.Status.Conditions, batch.JobComplete, v1.ConditionFalse, "", "")
jobConditionsChanged = jobConditionsChanged || isUpdated
}
if utilfeature.DefaultFeatureGate.Enabled(features.SuspendJob) && manageJobCalled {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a subtle change from the previous implementation:

Before, it looked like we would only suspend a job if didn't already completed. However, now we can theoretically mark a job as suspended even if it's also marked as completed.

For a moment, I thought this wouldn't be possible, but it is, in the coincidence where all pods succeed and a Job is marked as suspended at the same time.

Maybe we can just do:

if !complete && job.Spec.Suspend != nil && *job.Spec.Suspend

also, we should set the condition even if manageJobCalled is false. And even remove the condition if the feature gate is disabled.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -986,13 +996,21 @@ func ensureJobConditionStatus(list []batch.JobCondition, cType batch.JobConditio
list[i].Reason = reason
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't forget to update the Probe timestamp

Copy link
Member Author

@adtac adtac Apr 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I did begin to change LastProbeTime, but there's a problem with that. By definition, LastProbeTime must be updated every time ensureJobConditionStatus is called regardless of whether the status/reason/message matches (because it's just a probe). That means that jobConditionsChanged is always true every time syncJob is called. That means that we'll do a Jobs.UpdatedStatus() every time syncJob is called and that forget is always true. These are significantly more far-reaching consequences than I had originally anticipated.

The performance impact of syncing the Job with the API server every time syncJob is called could be disastrous for long running jobs that see very infrequent updates.

So I'd like to hold off from making LastProbeTime change in this PR. It also sends the wrong message to update it only when job condition status value changes (i.e. the if part of this if-else). In fact, AFAICT there are also no other k8s controllers that do anything like LastProbeTime, so I think removing that field might even be the most consistent approach. There is some value in knowing LastProbeTime from the user's perspective, but idk if it's possible to accurately update that without performance concerns.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking that if the block in line 643 is reached, you can update all LastProbeTimes

Signed-off-by: Adhityaa Chandrasekar <adtac@google.com>
@@ -986,13 +996,21 @@ func ensureJobConditionStatus(list []batch.JobCondition, cType batch.JobConditio
list[i].Reason = reason
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking that if the block in line 643 is reached, you can update all LastProbeTimes

job.Status.StartTime = &now
}
} else {
job.Status.Conditions, isUpdated = ensureJobConditionStatus(job.Status.Conditions, batch.JobComplete, v1.ConditionFalse, "", "")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this necessary?
Wouldn't it be overridden below?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is JobComplete, not JobSuspended

jobConditionsChanged = jobConditionsChanged || isUpdated
}

if !utilfeature.DefaultFeatureGate.Enabled(features.SuspendJob) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider putting all this logic in a function.

For example, you could just return after removing the condition if the gate is disabled, simplifying the conditionals below.

jobConditionsChanged = jobConditionsChanged || isUpdated
} else if !hasCondition(job.Status.Conditions, batch.JobSuspended) {
// The feature gate is enabled, but the condition doesn't exist. This
// should never happen, insert the default placeholder with status false.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't this happen in the first sync?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/This should never happen/This should not be the case/ :)

Yes, this will happen in the first sync.

@@ -654,9 +643,14 @@ func TestControllerSyncJob(t *testing.T) {
} else if tc.fakeExpectationAtCreation > 0 {
manager.expectations.ExpectCreations(key, int(tc.fakeExpectationAtCreation))
}
originalConditions := []batch.JobCondition{
newCondition(batch.JobComplete, v1.ConditionFalse, "", ""),
newCondition(batch.JobFailed, v1.ConditionFalse, "", ""),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also newCondition(batch.JobSuspended, v1.ConditionFalse, "", ""), if feature is enabled.

But, I would rather prefer we have wantConditions []JobCondition in the test case.

t.Errorf("Unexpected conditions %v", actual.Status.Conditions)
for _, jc := range originalConditions {
if !getCondition(actual, jc.Type, jc.Status, jc.Reason) {
copyIrrelevantJobConditionFields(actual.Status.Conditions, originalConditions)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would sort the conditions by name and then do a direct comparison of actual.Status.Condition and original. Then call cmp.Diff with cmpopts.IgnoreFields

@k8s-ci-robot
Copy link
Contributor

@adtac: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 8, 2021
@k8s-triage-robot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 6, 2021
@adtac
Copy link
Member Author

adtac commented Aug 6, 2021

is someone else able to take this over? this is an important change but I don't have the time to see it through

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 5, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closed this PR.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants