Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry when it fails to update pods status on scheduling loop #109832

Merged
merged 1 commit into from Jul 19, 2022

Conversation

sanposhiho
Copy link
Member

@sanposhiho sanposhiho commented May 5, 2022

What type of PR is this?

/kind bug

What this PR does / why we need it:

Retry with exponential backoff when it fails to update pods status on scheduling loop.


Scheduler doesn't retry when updating pods status on scheduling loop. If it fails to update pod states to unschedulable by some reasons (e.g., a flaky connection to apiserver), the next time it has a chance to be updated to unschedulable status is when it is scheduled again.
This means that in the worst case, that Pod status may not be updated for 5 minutes. (ref #108761)

Which issue(s) this PR fixes:

Fixes #109796

Special notes for your reviewer:

It's difficult for me to determine which errors are retriable. I used EventBroadcaster as a reference to implement.

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/bug Categorizes issue or PR as related to a bug. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels May 5, 2022
@k8s-ci-robot
Copy link
Contributor

@sanposhiho: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels May 5, 2022
@sanposhiho
Copy link
Member Author

/cc @alculquicondor

pkg/scheduler/util/utils.go Outdated Show resolved Hide resolved
pkg/scheduler/util/utils.go Outdated Show resolved Hide resolved
pkg/scheduler/util/utils.go Outdated Show resolved Hide resolved
pkg/scheduler/util/utils.go Outdated Show resolved Hide resolved
@sanposhiho
Copy link
Member Author

@alculquicondor
Thanks for the review. Address your comments and simplify the retry logic.
Please take a look at this again.

pkg/scheduler/util/utils_test.go Outdated Show resolved Hide resolved
pkg/scheduler/util/utils_test.go Outdated Show resolved Hide resolved
@@ -90,8 +93,16 @@ func MoreImportantPod(pod1, pod2 *v1.Pod) bool {
return GetPodStartTime(pod1).Before(GetPodStartTime(pod2))
}

const (
// Parameters for retrying with exponential backoff.
retryBackoffInitialDuration = 100 * time.Millisecond
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, let's make this smaller. You are already sleeping the time that the server is telling you to retry.

0.1 seconds is way too much time considering that the scheduler is able to schedule ~300 pods/s if the apiserver is healthy.

maybe the factor can be 1.3

Copy link
Member Author

@sanposhiho sanposhiho May 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0.1 seconds is way too much time considering that the scheduler is able to schedule ~300 pods/s if the apiserver is healthy.

Make sense. change it to 10*time.Millisecond.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you remove the comment about calling this in a routine? I think that's a good idea. But do it in the caller (the scheduler package, in this case).

Copy link
Member Author

@sanposhiho sanposhiho May 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yeah, I removed my previous comments because if we do so and the parallel status update is done after starting another scheduling cycle, I guess the pod will be moved from unschedulableQ to activeQ by event handler? (I realized this concern after I posted my deleted comment 😓 )
wdyt?

Copy link
Member Author

@sanposhiho sanposhiho May 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, It seems my concern was wrong. Any status updates are ignored and the pod won’t be moved in that case.

func (p *PriorityQueue) Update(oldPod, newPod *v1.Pod) error {

func isPodUpdated(oldPod, newPod *v1.Pod) bool {

will change the scheduler pkg's implementation to update pods status in goroutine

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I also revert the retryBackoffInitialDuration to 0.1 seconds?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, 100ms sounds good.

if typedErr, ok := err.(*errors.StatusError); ok {
var retryAfterSeconds int32
if typedErr.Status().Details != nil {
retryAfterSeconds = typedErr.Status().Details.RetryAfterSeconds
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the api-server is unreachable, who sets this value? the client?

Copy link
Member Author

@sanposhiho sanposhiho May 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the api-server is unreachable, the error type will not be *errors.StatusError, so won't reach here.
So, scheduler will wait for a backoff time and retry again.

pkg/scheduler/util/utils.go Outdated Show resolved Hide resolved
@sanposhiho sanposhiho force-pushed the retry-on-update branch 3 times, most recently from e1490c8 to 382d648 Compare May 6, 2022 03:31
@sanposhiho
Copy link
Member Author

/retest

@sanposhiho
Copy link
Member Author

(By #109848 #109847)

pkg/scheduler/util/utils_test.go Outdated Show resolved Hide resolved
@@ -90,8 +93,16 @@ func MoreImportantPod(pod1, pod2 *v1.Pod) bool {
return GetPodStartTime(pod1).Before(GetPodStartTime(pod2))
}

const (
// Parameters for retrying with exponential backoff.
retryBackoffInitialDuration = 100 * time.Millisecond
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you remove the comment about calling this in a routine? I think that's a good idea. But do it in the caller (the scheduler package, in this case).

@sanposhiho
Copy link
Member Author

sanposhiho commented May 6, 2022

@alculquicondor @ahg-g
Could you please retake a look at this?

  • change retryBackoffInitialDuration to 0.1 seconds.
  • run updatePod in goroutine. (pkg/scheduler/schedule_one.go)
  • apply the suggestion on pkg/scheduler/util/utils_test.go

pkg/scheduler/schedule_one.go Outdated Show resolved Hide resolved
}
// Otherwise we can retry after retryAfterSeconds.
klog.ErrorS(err, "Server rejected Pod patch (may retry after sleeping)", "pod", klog.KObj(old))
time.Sleep(time.Duration(retryAfterSeconds))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could actually use the backoff.Duration to find the diff. The field is updated every time.

Copy link
Member Author

@sanposhiho sanposhiho May 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could actually use the backoff.Duration to find the diff. The field is updated every time.

I guessed so too, but it doesn't seem to be updated every time since both wait.ExponentialBackoff() and retry.OnError() receives non-pointer wait.Backoff (not pointer *wait.Backoff)...

func OnError(backoff wait.Backoff, retriable func(error) bool, fn func() error) error {

func ExponentialBackoff(backoff Backoff, condition ConditionFunc) error {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh I missed that :(

Do you have any idea how long usually retryAfterSeconds is?

Copy link
Member Author

@sanposhiho sanposhiho May 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have any idea how long usually retryAfterSeconds is?

It depends on the type of error:

GanerateNameConflict→1s
TooManyRequests→10s
ServerTimeout→2s
Timeout→10s

(investigated by searching the usage of NewXXXXX defined in k8s.io/apimachinery/pkg/api/errors/errors.go.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that's great!

I'm happy with the backoffs you have set.

/lgtm
/approve
/hold
@ahg-g anything to add?

Copy link
Member

@Huang-Wei Huang-Wei May 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious about the exact failure reasons, b/c a "PATCH" operation would be pretty lightweight call that won't hit any errors like ErrConflict as it doesn't compare obj's resourceVersions. Based on a quick search of a failure log, it all shows:

weih@m1max:/tmp|⇒  grep "Error updating pod" build-log.txt.2
E0504 17:05:35.053963  119469 schedule_one.go:832] "Error updating pod" err="Patch \"http://127.0.0.1:40021/api/v1/namespaces/postfilter1-08d6ab0a-daf1-4df2-acf9-0e0fd8ca453e/pods/test-pod/status\": dial tcp 127.0.0.1:40021: connect: connection refused" pod="postfilter1-08d6ab0a-daf1-4df2-acf9-0e0fd8ca453e/test-pod"
E0504 17:05:39.733402  119469 schedule_one.go:832] "Error updating pod" err="Patch \"http://127.0.0.1:33249/api/v1/namespaces/postfilter2-00d09641-4bc2-4c7f-a980-f194e12091d1/pods/test-pod/status\": dial tcp 127.0.0.1:33249: connect: connection refused" pod="postfilter2-00d09641-4bc2-4c7f-a980-f194e12091d1/test-pod"
E0504 17:05:44.416195  119469 schedule_one.go:832] "Error updating pod" err="Patch \"http://127.0.0.1:39869/api/v1/namespaces/postfilter3-c090a3c6-5479-435a-8fb8-3d36e8ff9be7/pods/test-pod/status\": dial tcp 127.0.0.1:39869: connect: connection refused" pod="postfilter3-c090a3c6-5479-435a-8fb8-3d36e8ff9be7/test-pod"
E0504 17:07:41.533230  119469 schedule_one.go:832] "Error updating pod" err="Patch \"http://127.0.0.1:37893/api/v1/namespaces/permit-plugins16c02792-1845-48c3-9be5-904831c2c383/pods/test-pod/status\": dial tcp 127.0.0.1:37893: connect: connection refused" pod="permit-plugins16c02792-1845-48c3-9be5-904831c2c383/test-pod"
E0504 17:05:35.053963  119469 schedule_one.go:832] "Error updating pod" err="Patch \"http://127.0.0.1:40021/api/v1/namespaces/postfilter1-08d6ab0a-daf1-4df2-acf9-0e0fd8ca453e/pods/test-pod/status\": dial tcp 127.0.0.1:40021: connect: connection refused" pod="postfilter1-08d6ab0a-daf1-4df2-acf9-0e0fd8ca453e/test-pod"
E0504 17:05:39.733402  119469 schedule_one.go:832] "Error updating pod" err="Patch \"http://127.0.0.1:33249/api/v1/namespaces/postfilter2-00d09641-4bc2-4c7f-a980-f194e12091d1/pods/test-pod/status\": dial tcp 127.0.0.1:33249: connect: connection refused" pod="postfilter2-00d09641-4bc2-4c7f-a980-f194e12091d1/test-pod"
E0504 17:05:44.416195  119469 schedule_one.go:832] "Error updating pod" err="Patch \"http://127.0.0.1:39869/api/v1/namespaces/postfilter3-c090a3c6-5479-435a-8fb8-3d36e8ff9be7/pods/test-pod/status\": dial tcp 127.0.0.1:39869: connect: connection refused" pod="postfilter3-c090a3c6-5479-435a-8fb8-3d36e8ff9be7/test-pod"
E0504 17:07:41.533230  119469 schedule_one.go:832] "Error updating pod" err="Patch \"http://127.0.0.1:37893/api/v1/namespaces/permit-plugins16c02792-1845-48c3-9be5-904831c2c383/pods/test-pod/status\": dial tcp 127.0.0.1:37893: connect: connection refused" pod="permit-plugins16c02792-1845-48c3-9be5-904831c2c383/test-pod"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we can figure the exact failure reasons, I'd propose to restrict the retry logic to those particular error(s), like the "connection refused" one:

func IsConnectionRefused(err error) bool {
can direcetly pin the error.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious about the exact failure reasons,

If we can figure the exact failure reasons, I'd propose to restrict the retry logic to those particular error(s), like the "connection refused" one

At least the error we have seen is "connection refused".
#109783 (comment)

But, when api-server is very busy and could return ServerTimeout or TooManyRequests errors, we'll face the same issue as #109796.
So, I think it's better to retry not only for connection refused error, but also for errors which contains retryAfterSeconds.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But, the current implementation retries for all unknown errors(= all errors which type is not *errors.StatusError), so it would make sense to change it to retry only for connection refused if the error is not *errors.StatusError.

@sanposhiho
Copy link
Member Author

/retest

@sanposhiho
Copy link
Member Author

Flaky test: #109783
Rebase this PR to include the change #109834.

@sanposhiho
Copy link
Member Author

Next flaky test...
#109889

@sanposhiho
Copy link
Member Author

/retest

@sanposhiho
Copy link
Member Author

sanposhiho commented May 8, 2022

/retest

Flaky: #109182

@@ -115,13 +126,48 @@ func PatchPodStatus(cs kubernetes.Interface, old *v1.Pod, newStatus *v1.PodStatu
return nil
}

_, err = cs.CoreV1().Pods(old.Namespace).Patch(context.TODO(), old.Name, types.StrategicMergePatchType, patchBytes, metav1.PatchOptions{}, "status")
return err
backoff := wait.Backoff{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use retry.DefaultBackoff?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10ms, 50ms, 250ms, 1.25s SGTM.

}
if nnnNeedsUpdate {
podStatusCopy.NominatedNodeName = nominatingInfo.NominatedNodeName
}
return util.PatchPodStatus(client, pod, podStatusCopy)
go func() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit concerned with this goroutine, which may cause unintended behavior. It's better to keep it a blocking call.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you want to make this function to be used with retries I think that you should do it as a wait.ConditionFunc .
If you spawn this as a goroutine how do you know when to stop and declare an error? or when it was successful and you should not retry? or if you leak goroutines and create a new one despite the previous didn't finish?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are using go-client's retry.OnError which uses a wait.ConditionFunc underneath.

or when it was successful and you should not retry?

see the update in pkg/scheduler/utils

or if you leak goroutines and create a new one despite the previous didn't finish?

We only retry 6 times (but probably will change to 4), is it that a big concern?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate?

When it comes to goroutines, it increases uncertainty. For the changes introduced here, I don't see a significant benefit that can outweigh the uncertainty.

If we agree it's a transient error, retry in a blocking manner makes more sense to me. While if the error is consistent, retry in goroutines would increase the burden to APIServer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, understood. If a new request to update a Pod (like bind request or another patch request from the next scheduling cycle) is sent from the scheduler (and the Pod is updated successfully) at a stage where the past one has not been completed by retries, the new Pod state, which is updated by the new request, may be overwritten by the past request. That's a problem..

If we agree it's a transient error, retry in a blocking manner makes more sense. While if the error is consistent, retry in goroutines would increase the burden to APIServer.

the scheduler will not be able to make too much progress with the api-server not reachable anyways.

These make sense. When api-server is unreachable from a scheduler, other requests are likely to fail as well, making it difficult for the scheduler to continue scheduling successfully. And putting more load on api-server by many requests from these goroutines is no solution.

I'll update the PR to change here to retry with a blocking.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just FYI, maybe is already known, but client-go already implement retries https://gist.github.com/aojea/31ab71c894c15f46a567c5e8aa235a17

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I didn't know that...

client-go implements some retry logic for requests it considers retryable, generally:

  • no write requests (PUT,POST) , a GET is retryable
  • transient network errors: connection reset by peer or EOF
  • retryable http errors: 50x code with a Retry-After header

So, we need to retry for "connection refused" by ourselves, but client-go do retries for 10s for errors that contain Retry-After like ServerTimeout
This means the current implementation does retry in addition to client-go does. Can we remove the retry in this case? What do you all think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have to do anything to benefit from the builtin retries?

Is it the case that "connection refused" is not included in the retriable errors?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have to do anything to benefit from the builtin retries?

No. (right? @aojea)

Is it the case that "connection refused" is not included in the retriable errors?

I guess this is also No because "connection refused" is not returned from api-server, and it doesn't have Retry-After header.

}
// Otherwise we can retry after retryAfterSeconds.
klog.ErrorS(err, "Server rejected Pod patch (may retry after sleeping)", "pod", klog.KObj(old))
time.Sleep(time.Duration(retryAfterSeconds))
Copy link
Member

@Huang-Wei Huang-Wei May 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious about the exact failure reasons, b/c a "PATCH" operation would be pretty lightweight call that won't hit any errors like ErrConflict as it doesn't compare obj's resourceVersions. Based on a quick search of a failure log, it all shows:

weih@m1max:/tmp|⇒  grep "Error updating pod" build-log.txt.2
E0504 17:05:35.053963  119469 schedule_one.go:832] "Error updating pod" err="Patch \"http://127.0.0.1:40021/api/v1/namespaces/postfilter1-08d6ab0a-daf1-4df2-acf9-0e0fd8ca453e/pods/test-pod/status\": dial tcp 127.0.0.1:40021: connect: connection refused" pod="postfilter1-08d6ab0a-daf1-4df2-acf9-0e0fd8ca453e/test-pod"
E0504 17:05:39.733402  119469 schedule_one.go:832] "Error updating pod" err="Patch \"http://127.0.0.1:33249/api/v1/namespaces/postfilter2-00d09641-4bc2-4c7f-a980-f194e12091d1/pods/test-pod/status\": dial tcp 127.0.0.1:33249: connect: connection refused" pod="postfilter2-00d09641-4bc2-4c7f-a980-f194e12091d1/test-pod"
E0504 17:05:44.416195  119469 schedule_one.go:832] "Error updating pod" err="Patch \"http://127.0.0.1:39869/api/v1/namespaces/postfilter3-c090a3c6-5479-435a-8fb8-3d36e8ff9be7/pods/test-pod/status\": dial tcp 127.0.0.1:39869: connect: connection refused" pod="postfilter3-c090a3c6-5479-435a-8fb8-3d36e8ff9be7/test-pod"
E0504 17:07:41.533230  119469 schedule_one.go:832] "Error updating pod" err="Patch \"http://127.0.0.1:37893/api/v1/namespaces/permit-plugins16c02792-1845-48c3-9be5-904831c2c383/pods/test-pod/status\": dial tcp 127.0.0.1:37893: connect: connection refused" pod="permit-plugins16c02792-1845-48c3-9be5-904831c2c383/test-pod"
E0504 17:05:35.053963  119469 schedule_one.go:832] "Error updating pod" err="Patch \"http://127.0.0.1:40021/api/v1/namespaces/postfilter1-08d6ab0a-daf1-4df2-acf9-0e0fd8ca453e/pods/test-pod/status\": dial tcp 127.0.0.1:40021: connect: connection refused" pod="postfilter1-08d6ab0a-daf1-4df2-acf9-0e0fd8ca453e/test-pod"
E0504 17:05:39.733402  119469 schedule_one.go:832] "Error updating pod" err="Patch \"http://127.0.0.1:33249/api/v1/namespaces/postfilter2-00d09641-4bc2-4c7f-a980-f194e12091d1/pods/test-pod/status\": dial tcp 127.0.0.1:33249: connect: connection refused" pod="postfilter2-00d09641-4bc2-4c7f-a980-f194e12091d1/test-pod"
E0504 17:05:44.416195  119469 schedule_one.go:832] "Error updating pod" err="Patch \"http://127.0.0.1:39869/api/v1/namespaces/postfilter3-c090a3c6-5479-435a-8fb8-3d36e8ff9be7/pods/test-pod/status\": dial tcp 127.0.0.1:39869: connect: connection refused" pod="postfilter3-c090a3c6-5479-435a-8fb8-3d36e8ff9be7/test-pod"
E0504 17:07:41.533230  119469 schedule_one.go:832] "Error updating pod" err="Patch \"http://127.0.0.1:37893/api/v1/namespaces/permit-plugins16c02792-1845-48c3-9be5-904831c2c383/pods/test-pod/status\": dial tcp 127.0.0.1:37893: connect: connection refused" pod="permit-plugins16c02792-1845-48c3-9be5-904831c2c383/test-pod"

}
// Otherwise we can retry after retryAfterSeconds.
klog.ErrorS(err, "Server rejected Pod patch (may retry after sleeping)", "pod", klog.KObj(old))
time.Sleep(time.Duration(retryAfterSeconds))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we can figure the exact failure reasons, I'd propose to restrict the retry logic to those particular error(s), like the "connection refused" one:

func IsConnectionRefused(err error) bool {
can direcetly pin the error.

}

// DeletePod deletes the given <pod> from API server
func DeletePod(cs kubernetes.Interface, pod *v1.Pod) error {
return cs.CoreV1().Pods(pod.Namespace).Delete(context.TODO(), pod.Name, metav1.DeleteOptions{})
return cs.CoreV1().Pods(pod.Namespace).Delete(context.Background(), pod.Name, metav1.DeleteOptions{})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this makes no difference. (in theory, we should pass in a context)

@sanposhiho
Copy link
Member Author

@Huang-Wei @alculquicondor @aojea

Sorry for leaving here for a while.
Given the discussion, I changed the implementation to retry only for "connection refused" error.
Please retake a look 🙏

@alculquicondor
Copy link
Member

can you fix the verify?

@sanposhiho
Copy link
Member Author

@alculquicondor

can you fix the verify?

Thanks for pinging me. (I wasn't aware of it. 🙏 🙏 🙏 )
Fixed.

@@ -91,7 +93,7 @@ func MoreImportantPod(pod1, pod2 *v1.Pod) bool {
}

// PatchPodStatus calculates the delta bytes change from <old.Status> to <newStatus>,
// and then submit a request to API server to patch the pod changes.
// and then submit a request to API server to patch the pod changes with retries.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// and then submit a request to API server to patch the pod changes with retries.
// and then submit a request to API server to patch the pod changes with retries when connection is refused.

client.PrependReactor("patch", "pods", func(action clienttesting.Action) (bool, runtime.Object, error) {
defer func() { reqcount++ }()
if reqcount >= 4 {
// return error if requests comes in more than six times.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

four?

But why change the error? It should stop retrying even if the error is the same.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Fixed typo. 🙏


But why change the error? It should stop retrying even if the error is the same.

This "requests comes in more than four times." error shouldn't be returned in the expected scenario.

The expected scenario is:

  1. PatchPodStatus func sends a patch request to fake client.
  2. fake client returns "connection refused" error
  3. PatchPodStatus func retry (1) three times and fake client returns "connection refused" errors for these three requests.
  4. PatchPodStatus func gives up retrying and returns "connection refused" error. (the test will check if the returned error is "connection refused" or not.)

PrependReactor returns "connection refused" error only four times and if there is a bug like "the system retries forever", the "requests comes in more than four times." error will be returned and this test case will fail. (because the test will check if the returned error is "connection refused" or not.)

@sanposhiho
Copy link
Member Author

@alculquicondor Fixed as suggestions. 🙏

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 14, 2022
@alculquicondor
Copy link
Member

Please rebase....
But note that I'm giving priority to KEP reviews this week.

@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 17, 2022
@sanposhiho
Copy link
Member Author

rebased and squashed.

@sanposhiho
Copy link
Member Author

Fixed bad rebase.

@sanposhiho
Copy link
Member Author

@alculquicondor Could you check this when have a chance 🙏

@alculquicondor
Copy link
Member

/lgtm

Thanks

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 11, 2022
@Huang-Wei
Copy link
Member

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 18, 2022
@sanposhiho
Copy link
Member Author

sanposhiho commented Jul 19, 2022

/retest

#108891

@k8s-ci-robot k8s-ci-robot merged commit 92cb0ae into kubernetes:master Jul 19, 2022
@k8s-ci-robot k8s-ci-robot added this to the v1.25 milestone Jul 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. release-note-none Denotes a PR that doesn't merit a release note. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unschedulable Pod might take a long time to get the condition set
7 participants