Retry when it fails to update pods status on scheduling loop #109832

sanposhiho · 2022-05-05T15:46:06Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

Retry with exponential backoff when it fails to update pods status on scheduling loop.

Scheduler doesn't retry when updating pods status on scheduling loop. If it fails to update pod states to unschedulable by some reasons (e.g., a flaky connection to apiserver), the next time it has a chance to be updated to unschedulable status is when it is scheduled again.
This means that in the worst case, that Pod status may not be updated for 5 minutes. (ref #108761)

Which issue(s) this PR fixes:

Fixes #109796

Special notes for your reviewer:

It's difficult for me to determine which errors are retriable. I used EventBroadcaster as a reference to implement.

kubernetes/staging/src/k8s.io/client-go/tools/record/event.go

Line 258 in d582814

switch err.(type) {

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot · 2022-05-05T15:46:13Z

@sanposhiho: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

sanposhiho · 2022-05-05T15:46:58Z

/cc @alculquicondor

pkg/scheduler/util/utils.go

sanposhiho · 2022-05-05T16:47:53Z

@alculquicondor
Thanks for the review. Address your comments and simplify the retry logic.
Please take a look at this again.

pkg/scheduler/util/utils_test.go

alculquicondor · 2022-05-05T18:48:29Z

pkg/scheduler/util/utils.go

@@ -90,8 +93,16 @@ func MoreImportantPod(pod1, pod2 *v1.Pod) bool {
 	return GetPodStartTime(pod1).Before(GetPodStartTime(pod2))
 }

+const (
+	// Parameters for retrying with exponential backoff.
+	retryBackoffInitialDuration = 100 * time.Millisecond


Actually, let's make this smaller. You are already sleeping the time that the server is telling you to retry.

0.1 seconds is way too much time considering that the scheduler is able to schedule ~300 pods/s if the apiserver is healthy.

maybe the factor can be 1.3

0.1 seconds is way too much time considering that the scheduler is able to schedule ~300 pods/s if the apiserver is healthy.

Make sense. change it to 10*time.Millisecond.

Did you remove the comment about calling this in a routine? I think that's a good idea. But do it in the caller (the scheduler package, in this case).

Ah, yeah, I removed my previous comments because if we do so and the parallel status update is done after starting another scheduling cycle, I guess the pod will be moved from unschedulableQ to activeQ by event handler? (I realized this concern after I posted my deleted comment 😓 )
wdyt?

Okay, It seems my concern was wrong. Any status updates are ignored and the pod won’t be moved in that case.

kubernetes/pkg/scheduler/internal/queue/scheduling_queue.go

Line 524 in 7e77252

func (p *PriorityQueue) Update(oldPod, newPod *v1.Pod) error {

kubernetes/pkg/scheduler/internal/queue/scheduling_queue.go

Line 507 in 7e77252

func isPodUpdated(oldPod, newPod *v1.Pod) bool {

will change the scheduler pkg's implementation to update pods status in goroutine

Should I also revert the retryBackoffInitialDuration to 0.1 seconds?

yeah, 100ms sounds good.

ahg-g · 2022-05-05T20:24:37Z

pkg/scheduler/util/utils.go

+		if typedErr, ok := err.(*errors.StatusError); ok {
+			var retryAfterSeconds int32
+			if typedErr.Status().Details != nil {
+				retryAfterSeconds = typedErr.Status().Details.RetryAfterSeconds


if the api-server is unreachable, who sets this value? the client?

If the api-server is unreachable, the error type will not be *errors.StatusError, so won't reach here.
So, scheduler will wait for a backoff time and retry again.

pkg/scheduler/util/utils.go

sanposhiho · 2022-05-06T04:19:59Z

/retest

sanposhiho · 2022-05-06T04:37:09Z

(By #109848 #109847)

pkg/scheduler/util/utils_test.go

alculquicondor · 2022-05-06T14:07:54Z

pkg/scheduler/util/utils.go

@@ -90,8 +93,16 @@ func MoreImportantPod(pod1, pod2 *v1.Pod) bool {
 	return GetPodStartTime(pod1).Before(GetPodStartTime(pod2))
 }

+const (
+	// Parameters for retrying with exponential backoff.
+	retryBackoffInitialDuration = 100 * time.Millisecond


Did you remove the comment about calling this in a routine? I think that's a good idea. But do it in the caller (the scheduler package, in this case).

sanposhiho · 2022-05-06T15:05:58Z

@alculquicondor @ahg-g
Could you please retake a look at this?

change retryBackoffInitialDuration to 0.1 seconds.
run updatePod in goroutine. (pkg/scheduler/schedule_one.go)
apply the suggestion on pkg/scheduler/util/utils_test.go

pkg/scheduler/schedule_one.go

alculquicondor · 2022-05-06T15:16:37Z

pkg/scheduler/util/utils.go

+		}
+		// Otherwise we can retry after retryAfterSeconds.
+		klog.ErrorS(err, "Server rejected Pod patch (may retry after sleeping)", "pod", klog.KObj(old))
+		time.Sleep(time.Duration(retryAfterSeconds))


you could actually use the backoff.Duration to find the diff. The field is updated every time.

you could actually use the backoff.Duration to find the diff. The field is updated every time.

I guessed so too, but it doesn't seem to be updated every time since both wait.ExponentialBackoff() and retry.OnError() receives non-pointer wait.Backoff (not pointer *wait.Backoff)...

kubernetes/staging/src/k8s.io/client-go/util/retry/util.go

Line 48 in e84f62e

func OnError(backoff wait.Backoff, retriable func(error) bool, fn func() error) error {

kubernetes/staging/src/k8s.io/apimachinery/pkg/util/wait/wait.go

Line 419 in e84f62e

func ExponentialBackoff(backoff Backoff, condition ConditionFunc) error {

oh I missed that :(

Do you have any idea how long usually retryAfterSeconds is?

Do you have any idea how long usually retryAfterSeconds is?

It depends on the type of error:

GanerateNameConflict→1s TooManyRequests→10s ServerTimeout→2s Timeout→10s

(investigated by searching the usage of NewXXXXX defined in k8s.io/apimachinery/pkg/api/errors/errors.go.)

Ah, that's great!

I'm happy with the backoffs you have set.

/lgtm
/approve
/hold
@ahg-g anything to add?

I'm curious about the exact failure reasons, b/c a "PATCH" operation would be pretty lightweight call that won't hit any errors like ErrConflict as it doesn't compare obj's resourceVersions. Based on a quick search of a failure log, it all shows:

weih@m1max:/tmp|⇒ grep "Error updating pod" build-log.txt.2 E0504 17:05:35.053963 119469 schedule_one.go:832] "Error updating pod" err="Patch \"http://127.0.0.1:40021/api/v1/namespaces/postfilter1-08d6ab0a-daf1-4df2-acf9-0e0fd8ca453e/pods/test-pod/status\": dial tcp 127.0.0.1:40021: connect: connection refused" pod="postfilter1-08d6ab0a-daf1-4df2-acf9-0e0fd8ca453e/test-pod" E0504 17:05:39.733402 119469 schedule_one.go:832] "Error updating pod" err="Patch \"http://127.0.0.1:33249/api/v1/namespaces/postfilter2-00d09641-4bc2-4c7f-a980-f194e12091d1/pods/test-pod/status\": dial tcp 127.0.0.1:33249: connect: connection refused" pod="postfilter2-00d09641-4bc2-4c7f-a980-f194e12091d1/test-pod" E0504 17:05:44.416195 119469 schedule_one.go:832] "Error updating pod" err="Patch \"http://127.0.0.1:39869/api/v1/namespaces/postfilter3-c090a3c6-5479-435a-8fb8-3d36e8ff9be7/pods/test-pod/status\": dial tcp 127.0.0.1:39869: connect: connection refused" pod="postfilter3-c090a3c6-5479-435a-8fb8-3d36e8ff9be7/test-pod" E0504 17:07:41.533230 119469 schedule_one.go:832] "Error updating pod" err="Patch \"http://127.0.0.1:37893/api/v1/namespaces/permit-plugins16c02792-1845-48c3-9be5-904831c2c383/pods/test-pod/status\": dial tcp 127.0.0.1:37893: connect: connection refused" pod="permit-plugins16c02792-1845-48c3-9be5-904831c2c383/test-pod" E0504 17:05:35.053963 119469 schedule_one.go:832] "Error updating pod" err="Patch \"http://127.0.0.1:40021/api/v1/namespaces/postfilter1-08d6ab0a-daf1-4df2-acf9-0e0fd8ca453e/pods/test-pod/status\": dial tcp 127.0.0.1:40021: connect: connection refused" pod="postfilter1-08d6ab0a-daf1-4df2-acf9-0e0fd8ca453e/test-pod" E0504 17:05:39.733402 119469 schedule_one.go:832] "Error updating pod" err="Patch \"http://127.0.0.1:33249/api/v1/namespaces/postfilter2-00d09641-4bc2-4c7f-a980-f194e12091d1/pods/test-pod/status\": dial tcp 127.0.0.1:33249: connect: connection refused" pod="postfilter2-00d09641-4bc2-4c7f-a980-f194e12091d1/test-pod" E0504 17:05:44.416195 119469 schedule_one.go:832] "Error updating pod" err="Patch \"http://127.0.0.1:39869/api/v1/namespaces/postfilter3-c090a3c6-5479-435a-8fb8-3d36e8ff9be7/pods/test-pod/status\": dial tcp 127.0.0.1:39869: connect: connection refused" pod="postfilter3-c090a3c6-5479-435a-8fb8-3d36e8ff9be7/test-pod" E0504 17:07:41.533230 119469 schedule_one.go:832] "Error updating pod" err="Patch \"http://127.0.0.1:37893/api/v1/namespaces/permit-plugins16c02792-1845-48c3-9be5-904831c2c383/pods/test-pod/status\": dial tcp 127.0.0.1:37893: connect: connection refused" pod="permit-plugins16c02792-1845-48c3-9be5-904831c2c383/test-pod"

If we can figure the exact failure reasons, I'd propose to restrict the retry logic to those particular error(s), like the "connection refused" one:

kubernetes/staging/src/k8s.io/apimachinery/pkg/util/net/util.go

Line 50 in 9304782

func IsConnectionRefused(err error) bool {

can direcetly pin the error.

I'm curious about the exact failure reasons,

If we can figure the exact failure reasons, I'd propose to restrict the retry logic to those particular error(s), like the "connection refused" one

At least the error we have seen is "connection refused".
#109783 (comment)

But, when api-server is very busy and could return ServerTimeout or TooManyRequests errors, we'll face the same issue as #109796.
So, I think it's better to retry not only for connection refused error, but also for errors which contains retryAfterSeconds.

But, the current implementation retries for all unknown errors(= all errors which type is not *errors.StatusError), so it would make sense to change it to retry only for connection refused if the error is not *errors.StatusError.

sanposhiho · 2022-05-08T04:05:05Z

/retest

sanposhiho · 2022-05-08T05:00:57Z

Flaky test: #109783
Rebase this PR to include the change #109834.

sanposhiho · 2022-05-08T05:41:07Z

Next flaky test...
#109889

sanposhiho · 2022-05-08T05:41:15Z

/retest

sanposhiho · 2022-05-08T06:10:22Z

/retest

Flaky: #109182

Huang-Wei · 2022-05-08T17:41:15Z

pkg/scheduler/util/utils.go

@@ -115,13 +126,48 @@ func PatchPodStatus(cs kubernetes.Interface, old *v1.Pod, newStatus *v1.PodStatu
 		return nil
 	}

-	_, err = cs.CoreV1().Pods(old.Namespace).Patch(context.TODO(), old.Name, types.StrategicMergePatchType, patchBytes, metav1.PatchOptions{}, "status")
-	return err
+	backoff := wait.Backoff{


Can we use retry.DefaultBackoff?

10ms, 50ms, 250ms, 1.25s SGTM.

Huang-Wei · 2022-05-08T17:54:28Z

pkg/scheduler/schedule_one.go

 	}
 	if nnnNeedsUpdate {
 		podStatusCopy.NominatedNodeName = nominatingInfo.NominatedNodeName
 	}
-	return util.PatchPodStatus(client, pod, podStatusCopy)
+	go func() {


I'm a bit concerned with this goroutine, which may cause unintended behavior. It's better to keep it a blocking call.

Can you elaborate?

if you want to make this function to be used with retries I think that you should do it as a wait.ConditionFunc .
If you spawn this as a goroutine how do you know when to stop and declare an error? or when it was successful and you should not retry? or if you leak goroutines and create a new one despite the previous didn't finish?

We are using go-client's retry.OnError which uses a wait.ConditionFunc underneath.

or when it was successful and you should not retry?

see the update in pkg/scheduler/utils

or if you leak goroutines and create a new one despite the previous didn't finish?

We only retry 6 times (but probably will change to 4), is it that a big concern?

Can you elaborate?

When it comes to goroutines, it increases uncertainty. For the changes introduced here, I don't see a significant benefit that can outweigh the uncertainty.

If we agree it's a transient error, retry in a blocking manner makes more sense to me. While if the error is consistent, retry in goroutines would increase the burden to APIServer.

Okay, understood. If a new request to update a Pod (like bind request or another patch request from the next scheduling cycle) is sent from the scheduler (and the Pod is updated successfully) at a stage where the past one has not been completed by retries, the new Pod state, which is updated by the new request, may be overwritten by the past request. That's a problem..

If we agree it's a transient error, retry in a blocking manner makes more sense. While if the error is consistent, retry in goroutines would increase the burden to APIServer.

the scheduler will not be able to make too much progress with the api-server not reachable anyways.

These make sense. When api-server is unreachable from a scheduler, other requests are likely to fail as well, making it difficult for the scheduler to continue scheduling successfully. And putting more load on api-server by many requests from these goroutines is no solution.

I'll update the PR to change here to retry with a blocking.

just FYI, maybe is already known, but client-go already implement retries https://gist.github.com/aojea/31ab71c894c15f46a567c5e8aa235a17

Oh, I didn't know that...

client-go implements some retry logic for requests it considers retryable, generally:

no write requests (PUT,POST) , a GET is retryable

transient network errors: connection reset by peer or EOF

retryable http errors: 50x code with a Retry-After header

So, we need to retry for "connection refused" by ourselves, but client-go do retries for 10s for errors that contain Retry-After like ServerTimeout
This means the current implementation does retry in addition to client-go does. Can we remove the retry in this case? What do you all think?

Do we have to do anything to benefit from the builtin retries?

Is it the case that "connection refused" is not included in the retriable errors?

Do we have to do anything to benefit from the builtin retries?

No. (right? @aojea)

Is it the case that "connection refused" is not included in the retriable errors?

I guess this is also No because "connection refused" is not returned from api-server, and it doesn't have Retry-After header.

Huang-Wei · 2022-05-08T17:57:39Z

pkg/scheduler/util/utils.go

+		}
+		// Otherwise we can retry after retryAfterSeconds.
+		klog.ErrorS(err, "Server rejected Pod patch (may retry after sleeping)", "pod", klog.KObj(old))
+		time.Sleep(time.Duration(retryAfterSeconds))


I'm curious about the exact failure reasons, b/c a "PATCH" operation would be pretty lightweight call that won't hit any errors like ErrConflict as it doesn't compare obj's resourceVersions. Based on a quick search of a failure log, it all shows:

weih@m1max:/tmp|⇒ grep "Error updating pod" build-log.txt.2 E0504 17:05:35.053963 119469 schedule_one.go:832] "Error updating pod" err="Patch \"http://127.0.0.1:40021/api/v1/namespaces/postfilter1-08d6ab0a-daf1-4df2-acf9-0e0fd8ca453e/pods/test-pod/status\": dial tcp 127.0.0.1:40021: connect: connection refused" pod="postfilter1-08d6ab0a-daf1-4df2-acf9-0e0fd8ca453e/test-pod" E0504 17:05:39.733402 119469 schedule_one.go:832] "Error updating pod" err="Patch \"http://127.0.0.1:33249/api/v1/namespaces/postfilter2-00d09641-4bc2-4c7f-a980-f194e12091d1/pods/test-pod/status\": dial tcp 127.0.0.1:33249: connect: connection refused" pod="postfilter2-00d09641-4bc2-4c7f-a980-f194e12091d1/test-pod" E0504 17:05:44.416195 119469 schedule_one.go:832] "Error updating pod" err="Patch \"http://127.0.0.1:39869/api/v1/namespaces/postfilter3-c090a3c6-5479-435a-8fb8-3d36e8ff9be7/pods/test-pod/status\": dial tcp 127.0.0.1:39869: connect: connection refused" pod="postfilter3-c090a3c6-5479-435a-8fb8-3d36e8ff9be7/test-pod" E0504 17:07:41.533230 119469 schedule_one.go:832] "Error updating pod" err="Patch \"http://127.0.0.1:37893/api/v1/namespaces/permit-plugins16c02792-1845-48c3-9be5-904831c2c383/pods/test-pod/status\": dial tcp 127.0.0.1:37893: connect: connection refused" pod="permit-plugins16c02792-1845-48c3-9be5-904831c2c383/test-pod" E0504 17:05:35.053963 119469 schedule_one.go:832] "Error updating pod" err="Patch \"http://127.0.0.1:40021/api/v1/namespaces/postfilter1-08d6ab0a-daf1-4df2-acf9-0e0fd8ca453e/pods/test-pod/status\": dial tcp 127.0.0.1:40021: connect: connection refused" pod="postfilter1-08d6ab0a-daf1-4df2-acf9-0e0fd8ca453e/test-pod" E0504 17:05:39.733402 119469 schedule_one.go:832] "Error updating pod" err="Patch \"http://127.0.0.1:33249/api/v1/namespaces/postfilter2-00d09641-4bc2-4c7f-a980-f194e12091d1/pods/test-pod/status\": dial tcp 127.0.0.1:33249: connect: connection refused" pod="postfilter2-00d09641-4bc2-4c7f-a980-f194e12091d1/test-pod" E0504 17:05:44.416195 119469 schedule_one.go:832] "Error updating pod" err="Patch \"http://127.0.0.1:39869/api/v1/namespaces/postfilter3-c090a3c6-5479-435a-8fb8-3d36e8ff9be7/pods/test-pod/status\": dial tcp 127.0.0.1:39869: connect: connection refused" pod="postfilter3-c090a3c6-5479-435a-8fb8-3d36e8ff9be7/test-pod" E0504 17:07:41.533230 119469 schedule_one.go:832] "Error updating pod" err="Patch \"http://127.0.0.1:37893/api/v1/namespaces/permit-plugins16c02792-1845-48c3-9be5-904831c2c383/pods/test-pod/status\": dial tcp 127.0.0.1:37893: connect: connection refused" pod="permit-plugins16c02792-1845-48c3-9be5-904831c2c383/test-pod"

Huang-Wei · 2022-05-08T18:03:55Z

pkg/scheduler/util/utils.go

+		}
+		// Otherwise we can retry after retryAfterSeconds.
+		klog.ErrorS(err, "Server rejected Pod patch (may retry after sleeping)", "pod", klog.KObj(old))
+		time.Sleep(time.Duration(retryAfterSeconds))


If we can figure the exact failure reasons, I'd propose to restrict the retry logic to those particular error(s), like the "connection refused" one:

kubernetes/staging/src/k8s.io/apimachinery/pkg/util/net/util.go

Line 50 in 9304782

func IsConnectionRefused(err error) bool {

can direcetly pin the error.

Huang-Wei · 2022-05-08T18:07:09Z

pkg/scheduler/util/utils.go

 }

 // DeletePod deletes the given <pod> from API server
 func DeletePod(cs kubernetes.Interface, pod *v1.Pod) error {
-	return cs.CoreV1().Pods(pod.Namespace).Delete(context.TODO(), pod.Name, metav1.DeleteOptions{})
+	return cs.CoreV1().Pods(pod.Namespace).Delete(context.Background(), pod.Name, metav1.DeleteOptions{})


this makes no difference. (in theory, we should pass in a context)

sanposhiho · 2022-05-24T15:00:53Z

@Huang-Wei @alculquicondor @aojea

Sorry for leaving here for a while.
Given the discussion, I changed the implementation to retry only for "connection refused" error.
Please retake a look 🙏

alculquicondor · 2022-06-03T15:54:46Z

can you fix the verify?

sanposhiho · 2022-06-03T21:32:25Z

@alculquicondor

can you fix the verify?

Thanks for pinging me. (I wasn't aware of it. 🙏 🙏 🙏 )
Fixed.

alculquicondor · 2022-06-06T13:27:46Z

pkg/scheduler/util/utils.go

@@ -91,7 +93,7 @@ func MoreImportantPod(pod1, pod2 *v1.Pod) bool {
 }

 // PatchPodStatus calculates the delta bytes change from <old.Status> to <newStatus>,
-// and then submit a request to API server to patch the pod changes.
+// and then submit a request to API server to patch the pod changes with retries.


Suggested change

// and then submit a request to API server to patch the pod changes with retries.

// and then submit a request to API server to patch the pod changes with retries when connection is refused.

alculquicondor · 2022-06-06T13:30:13Z

pkg/scheduler/util/utils_test.go

+				client.PrependReactor("patch", "pods", func(action clienttesting.Action) (bool, runtime.Object, error) {
+					defer func() { reqcount++ }()
+					if reqcount >= 4 {
+						// return error if requests comes in more than six times.


four?

But why change the error? It should stop retrying even if the error is the same.

Thanks. Fixed typo. 🙏

But why change the error? It should stop retrying even if the error is the same.

This "requests comes in more than four times." error shouldn't be returned in the expected scenario.

The expected scenario is:

PatchPodStatus func sends a patch request to fake client.

fake client returns "connection refused" error

PatchPodStatus func retry (1) three times and fake client returns "connection refused" errors for these three requests.

PatchPodStatus func gives up retrying and returns "connection refused" error. (the test will check if the returned error is "connection refused" or not.)

PrependReactor returns "connection refused" error only four times and if there is a bug like "the system retries forever", the "requests comes in more than four times." error will be returned and this test case will fail. (because the test will check if the returned error is "connection refused" or not.)

sanposhiho · 2022-06-14T14:28:46Z

@alculquicondor Fixed as suggestions. 🙏

alculquicondor · 2022-06-14T19:06:54Z

Please rebase....
But note that I'm giving priority to KEP reviews this week.

sanposhiho · 2022-06-17T19:03:00Z

rebased and squashed.

sanposhiho · 2022-06-17T19:10:20Z

Fixed bad rebase.

sanposhiho · 2022-07-11T14:54:52Z

@alculquicondor Could you check this when have a chance 🙏

alculquicondor · 2022-07-11T19:42:14Z

/lgtm

Thanks

Huang-Wei · 2022-07-18T23:53:10Z

/hold cancel

sanposhiho · 2022-07-19T01:53:35Z

/retest

#108891

k8s-ci-robot requested review from alculquicondor and ahg-g May 5, 2022 15:46

alculquicondor reviewed May 5, 2022

View reviewed changes

pkg/scheduler/util/utils.go Outdated Show resolved Hide resolved

pkg/scheduler/util/utils.go Outdated Show resolved Hide resolved

pkg/scheduler/util/utils.go Outdated Show resolved Hide resolved

pkg/scheduler/util/utils.go Outdated Show resolved Hide resolved

sanposhiho force-pushed the retry-on-update branch from fa14703 to f1fbc71 Compare May 5, 2022 16:46

novahe reviewed May 5, 2022

View reviewed changes

pkg/scheduler/util/utils_test.go Show resolved Hide resolved

alculquicondor reviewed May 5, 2022

View reviewed changes

pkg/scheduler/util/utils_test.go Outdated Show resolved Hide resolved

pkg/scheduler/util/utils_test.go Outdated Show resolved Hide resolved

alculquicondor reviewed May 5, 2022

View reviewed changes

ahg-g reviewed May 5, 2022

View reviewed changes

pkg/scheduler/util/utils.go Outdated Show resolved Hide resolved

sanposhiho force-pushed the retry-on-update branch 3 times, most recently from e1490c8 to 382d648 Compare May 6, 2022 03:31

alculquicondor reviewed May 6, 2022

View reviewed changes

sanposhiho force-pushed the retry-on-update branch from 382d648 to a0b1bdc Compare May 6, 2022 15:02

alculquicondor reviewed May 6, 2022

View reviewed changes

sanposhiho force-pushed the retry-on-update branch from 9dc18c7 to e7dbcfe Compare May 8, 2022 05:01

sanposhiho force-pushed the retry-on-update branch from e7dbcfe to 9a622e4 Compare May 8, 2022 10:43

Huang-Wei reviewed May 8, 2022

View reviewed changes

sanposhiho force-pushed the retry-on-update branch from 1c022c2 to 4bd39bb Compare June 3, 2022 21:31

alculquicondor reviewed Jun 6, 2022

View reviewed changes

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 14, 2022

sanposhiho force-pushed the retry-on-update branch from 8454870 to c75b09a Compare June 17, 2022 19:01

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 17, 2022

Retry when update unschedulable pods status on scheduling loop

8b18f7c

sanposhiho force-pushed the retry-on-update branch from c75b09a to 8b18f7c Compare June 17, 2022 19:09

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 11, 2022

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 18, 2022

k8s-ci-robot merged commit 92cb0ae into kubernetes:master Jul 19, 2022

k8s-ci-robot added this to the v1.25 milestone Jul 19, 2022

	// and then submit a request to API server to patch the pod changes with retries.
	// and then submit a request to API server to patch the pod changes with retries when connection is refused.

Retry when it fails to update pods status on scheduling loop #109832

Retry when it fails to update pods status on scheduling loop #109832

Conversation

sanposhiho commented May 5, 2022 • edited

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot commented May 5, 2022

sanposhiho commented May 5, 2022

sanposhiho commented May 5, 2022

Choose a reason for hiding this comment

sanposhiho May 6, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sanposhiho May 6, 2022 • edited

Choose a reason for hiding this comment

sanposhiho May 6, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sanposhiho May 6, 2022 • edited

Choose a reason for hiding this comment

sanposhiho commented May 6, 2022

sanposhiho commented May 6, 2022

Choose a reason for hiding this comment

sanposhiho commented May 6, 2022 • edited

Choose a reason for hiding this comment

sanposhiho May 6, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sanposhiho May 6, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Huang-Wei May 8, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sanposhiho commented May 8, 2022

sanposhiho commented May 8, 2022

sanposhiho commented May 8, 2022

sanposhiho commented May 8, 2022

sanposhiho commented May 8, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sanposhiho May 10, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sanposhiho May 10, 2022 • edited

Choose a reason for hiding this comment

alculquicondor May 10, 2022 • edited

Choose a reason for hiding this comment

sanposhiho May 14, 2022 • edited

Choose a reason for hiding this comment

Huang-Wei May 8, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sanposhiho commented May 24, 2022

alculquicondor commented Jun 3, 2022

sanposhiho commented Jun 3, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sanposhiho commented Jun 14, 2022

alculquicondor commented Jun 14, 2022

sanposhiho commented Jun 17, 2022

sanposhiho commented Jun 17, 2022

sanposhiho commented Jul 11, 2022

alculquicondor commented Jul 11, 2022

Huang-Wei commented Jul 18, 2022

sanposhiho commented May 5, 2022 •

edited

sanposhiho May 6, 2022 •

edited

sanposhiho May 6, 2022 •

edited

sanposhiho May 6, 2022 •

edited

sanposhiho May 6, 2022 •

edited

sanposhiho commented May 6, 2022 •

edited

sanposhiho May 6, 2022 •

edited

sanposhiho May 6, 2022 •

edited

Huang-Wei May 8, 2022 •

edited

sanposhiho commented May 8, 2022 •

edited

sanposhiho May 10, 2022 •

edited

sanposhiho May 10, 2022 •

edited

alculquicondor May 10, 2022 •

edited

sanposhiho May 14, 2022 •

edited

Huang-Wei May 8, 2022 •

edited

sanposhiho commented Jul 19, 2022 •

edited