New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retry when it fails to update pods status on scheduling loop #109832
Conversation
@sanposhiho: This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/cc @alculquicondor |
fa14703
to
f1fbc71
Compare
@alculquicondor |
pkg/scheduler/util/utils.go
Outdated
@@ -90,8 +93,16 @@ func MoreImportantPod(pod1, pod2 *v1.Pod) bool { | |||
return GetPodStartTime(pod1).Before(GetPodStartTime(pod2)) | |||
} | |||
|
|||
const ( | |||
// Parameters for retrying with exponential backoff. | |||
retryBackoffInitialDuration = 100 * time.Millisecond |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, let's make this smaller. You are already sleeping the time that the server is telling you to retry.
0.1 seconds is way too much time considering that the scheduler is able to schedule ~300 pods/s if the apiserver is healthy.
maybe the factor can be 1.3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
0.1 seconds is way too much time considering that the scheduler is able to schedule ~300 pods/s if the apiserver is healthy.
Make sense. change it to 10*time.Millisecond
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you remove the comment about calling this in a routine? I think that's a good idea. But do it in the caller (the scheduler package, in this case).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yeah, I removed my previous comments because if we do so and the parallel status update is done after starting another scheduling cycle, I guess the pod will be moved from unschedulableQ to activeQ by event handler? (I realized this concern after I posted my deleted comment 😓 )
wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, It seems my concern was wrong. Any status updates are ignored and the pod won’t be moved in that case.
func (p *PriorityQueue) Update(oldPod, newPod *v1.Pod) error { |
func isPodUpdated(oldPod, newPod *v1.Pod) bool { |
will change the scheduler pkg's implementation to update pods status in goroutine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I also revert the retryBackoffInitialDuration to 0.1 seconds?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, 100ms sounds good.
pkg/scheduler/util/utils.go
Outdated
if typedErr, ok := err.(*errors.StatusError); ok { | ||
var retryAfterSeconds int32 | ||
if typedErr.Status().Details != nil { | ||
retryAfterSeconds = typedErr.Status().Details.RetryAfterSeconds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the api-server is unreachable, who sets this value? the client?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the api-server is unreachable, the error type will not be *errors.StatusError
, so won't reach here.
So, scheduler will wait for a backoff time and retry again.
e1490c8
to
382d648
Compare
/retest |
pkg/scheduler/util/utils.go
Outdated
@@ -90,8 +93,16 @@ func MoreImportantPod(pod1, pod2 *v1.Pod) bool { | |||
return GetPodStartTime(pod1).Before(GetPodStartTime(pod2)) | |||
} | |||
|
|||
const ( | |||
// Parameters for retrying with exponential backoff. | |||
retryBackoffInitialDuration = 100 * time.Millisecond |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you remove the comment about calling this in a routine? I think that's a good idea. But do it in the caller (the scheduler package, in this case).
382d648
to
a0b1bdc
Compare
@alculquicondor @ahg-g
|
pkg/scheduler/util/utils.go
Outdated
} | ||
// Otherwise we can retry after retryAfterSeconds. | ||
klog.ErrorS(err, "Server rejected Pod patch (may retry after sleeping)", "pod", klog.KObj(old)) | ||
time.Sleep(time.Duration(retryAfterSeconds)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you could actually use the backoff.Duration
to find the diff. The field is updated every time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you could actually use the backoff.Duration to find the diff. The field is updated every time.
I guessed so too, but it doesn't seem to be updated every time since both wait.ExponentialBackoff()
and retry.OnError()
receives non-pointer wait.Backoff
(not pointer *wait.Backoff
)...
func OnError(backoff wait.Backoff, retriable func(error) bool, fn func() error) error { |
func ExponentialBackoff(backoff Backoff, condition ConditionFunc) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh I missed that :(
Do you have any idea how long usually retryAfterSeconds
is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have any idea how long usually retryAfterSeconds is?
It depends on the type of error:
GanerateNameConflict→1s
TooManyRequests→10s
ServerTimeout→2s
Timeout→10s
(investigated by searching the usage of NewXXXXX
defined in k8s.io/apimachinery/pkg/api/errors/errors.go.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, that's great!
I'm happy with the backoffs you have set.
/lgtm
/approve
/hold
@ahg-g anything to add?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious about the exact failure reasons, b/c a "PATCH" operation would be pretty lightweight call that won't hit any errors like ErrConflict as it doesn't compare obj's resourceVersions. Based on a quick search of a failure log, it all shows:
weih@m1max:/tmp|⇒ grep "Error updating pod" build-log.txt.2
E0504 17:05:35.053963 119469 schedule_one.go:832] "Error updating pod" err="Patch \"http://127.0.0.1:40021/api/v1/namespaces/postfilter1-08d6ab0a-daf1-4df2-acf9-0e0fd8ca453e/pods/test-pod/status\": dial tcp 127.0.0.1:40021: connect: connection refused" pod="postfilter1-08d6ab0a-daf1-4df2-acf9-0e0fd8ca453e/test-pod"
E0504 17:05:39.733402 119469 schedule_one.go:832] "Error updating pod" err="Patch \"http://127.0.0.1:33249/api/v1/namespaces/postfilter2-00d09641-4bc2-4c7f-a980-f194e12091d1/pods/test-pod/status\": dial tcp 127.0.0.1:33249: connect: connection refused" pod="postfilter2-00d09641-4bc2-4c7f-a980-f194e12091d1/test-pod"
E0504 17:05:44.416195 119469 schedule_one.go:832] "Error updating pod" err="Patch \"http://127.0.0.1:39869/api/v1/namespaces/postfilter3-c090a3c6-5479-435a-8fb8-3d36e8ff9be7/pods/test-pod/status\": dial tcp 127.0.0.1:39869: connect: connection refused" pod="postfilter3-c090a3c6-5479-435a-8fb8-3d36e8ff9be7/test-pod"
E0504 17:07:41.533230 119469 schedule_one.go:832] "Error updating pod" err="Patch \"http://127.0.0.1:37893/api/v1/namespaces/permit-plugins16c02792-1845-48c3-9be5-904831c2c383/pods/test-pod/status\": dial tcp 127.0.0.1:37893: connect: connection refused" pod="permit-plugins16c02792-1845-48c3-9be5-904831c2c383/test-pod"
E0504 17:05:35.053963 119469 schedule_one.go:832] "Error updating pod" err="Patch \"http://127.0.0.1:40021/api/v1/namespaces/postfilter1-08d6ab0a-daf1-4df2-acf9-0e0fd8ca453e/pods/test-pod/status\": dial tcp 127.0.0.1:40021: connect: connection refused" pod="postfilter1-08d6ab0a-daf1-4df2-acf9-0e0fd8ca453e/test-pod"
E0504 17:05:39.733402 119469 schedule_one.go:832] "Error updating pod" err="Patch \"http://127.0.0.1:33249/api/v1/namespaces/postfilter2-00d09641-4bc2-4c7f-a980-f194e12091d1/pods/test-pod/status\": dial tcp 127.0.0.1:33249: connect: connection refused" pod="postfilter2-00d09641-4bc2-4c7f-a980-f194e12091d1/test-pod"
E0504 17:05:44.416195 119469 schedule_one.go:832] "Error updating pod" err="Patch \"http://127.0.0.1:39869/api/v1/namespaces/postfilter3-c090a3c6-5479-435a-8fb8-3d36e8ff9be7/pods/test-pod/status\": dial tcp 127.0.0.1:39869: connect: connection refused" pod="postfilter3-c090a3c6-5479-435a-8fb8-3d36e8ff9be7/test-pod"
E0504 17:07:41.533230 119469 schedule_one.go:832] "Error updating pod" err="Patch \"http://127.0.0.1:37893/api/v1/namespaces/permit-plugins16c02792-1845-48c3-9be5-904831c2c383/pods/test-pod/status\": dial tcp 127.0.0.1:37893: connect: connection refused" pod="permit-plugins16c02792-1845-48c3-9be5-904831c2c383/test-pod"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we can figure the exact failure reasons, I'd propose to restrict the retry logic to those particular error(s), like the "connection refused" one:
func IsConnectionRefused(err error) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious about the exact failure reasons,
If we can figure the exact failure reasons, I'd propose to restrict the retry logic to those particular error(s), like the "connection refused" one
At least the error we have seen is "connection refused".
#109783 (comment)
But, when api-server is very busy and could return ServerTimeout
or TooManyRequests
errors, we'll face the same issue as #109796.
So, I think it's better to retry not only for connection refused error, but also for errors which contains retryAfterSeconds
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But, the current implementation retries for all unknown errors(= all errors which type is not *errors.StatusError
), so it would make sense to change it to retry only for connection refused
if the error is not *errors.StatusError
.
/retest |
9dc18c7
to
e7dbcfe
Compare
Next flaky test... |
/retest |
/retest Flaky: #109182 |
e7dbcfe
to
9a622e4
Compare
pkg/scheduler/util/utils.go
Outdated
@@ -115,13 +126,48 @@ func PatchPodStatus(cs kubernetes.Interface, old *v1.Pod, newStatus *v1.PodStatu | |||
return nil | |||
} | |||
|
|||
_, err = cs.CoreV1().Pods(old.Namespace).Patch(context.TODO(), old.Name, types.StrategicMergePatchType, patchBytes, metav1.PatchOptions{}, "status") | |||
return err | |||
backoff := wait.Backoff{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use retry.DefaultBackoff?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
10ms, 50ms, 250ms, 1.25s SGTM.
pkg/scheduler/schedule_one.go
Outdated
} | ||
if nnnNeedsUpdate { | ||
podStatusCopy.NominatedNodeName = nominatingInfo.NominatedNodeName | ||
} | ||
return util.PatchPodStatus(client, pod, podStatusCopy) | ||
go func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit concerned with this goroutine, which may cause unintended behavior. It's better to keep it a blocking call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you want to make this function to be used with retries I think that you should do it as a wait.ConditionFunc
.
If you spawn this as a goroutine how do you know when to stop and declare an error? or when it was successful and you should not retry? or if you leak goroutines and create a new one despite the previous didn't finish?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are using go-client's retry.OnError
which uses a wait.ConditionFunc
underneath.
or when it was successful and you should not retry?
see the update in pkg/scheduler/utils
or if you leak goroutines and create a new one despite the previous didn't finish?
We only retry 6 times (but probably will change to 4), is it that a big concern?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate?
When it comes to goroutines, it increases uncertainty. For the changes introduced here, I don't see a significant benefit that can outweigh the uncertainty.
If we agree it's a transient error, retry in a blocking manner makes more sense to me. While if the error is consistent, retry in goroutines would increase the burden to APIServer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, understood. If a new request to update a Pod (like bind request or another patch request from the next scheduling cycle) is sent from the scheduler (and the Pod is updated successfully) at a stage where the past one has not been completed by retries, the new Pod state, which is updated by the new request, may be overwritten by the past request. That's a problem..
If we agree it's a transient error, retry in a blocking manner makes more sense. While if the error is consistent, retry in goroutines would increase the burden to APIServer.
the scheduler will not be able to make too much progress with the api-server not reachable anyways.
These make sense. When api-server is unreachable from a scheduler, other requests are likely to fail as well, making it difficult for the scheduler to continue scheduling successfully. And putting more load on api-server by many requests from these goroutines is no solution.
I'll update the PR to change here to retry with a blocking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just FYI, maybe is already known, but client-go already implement retries https://gist.github.com/aojea/31ab71c894c15f46a567c5e8aa235a17
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I didn't know that...
client-go implements some retry logic for requests it considers retryable, generally:
- no write requests (PUT,POST) , a GET is retryable
- transient network errors: connection reset by peer or EOF
- retryable http errors: 50x code with a Retry-After header
So, we need to retry for "connection refused" by ourselves, but client-go do retries for 10s for errors that contain Retry-After
like ServerTimeout
This means the current implementation does retry in addition to client-go does. Can we remove the retry in this case? What do you all think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have to do anything to benefit from the builtin retries?
Is it the case that "connection refused" is not included in the retriable errors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have to do anything to benefit from the builtin retries?
No. (right? @aojea)
Is it the case that "connection refused" is not included in the retriable errors?
I guess this is also No because "connection refused" is not returned from api-server, and it doesn't have Retry-After
header.
pkg/scheduler/util/utils.go
Outdated
} | ||
// Otherwise we can retry after retryAfterSeconds. | ||
klog.ErrorS(err, "Server rejected Pod patch (may retry after sleeping)", "pod", klog.KObj(old)) | ||
time.Sleep(time.Duration(retryAfterSeconds)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious about the exact failure reasons, b/c a "PATCH" operation would be pretty lightweight call that won't hit any errors like ErrConflict as it doesn't compare obj's resourceVersions. Based on a quick search of a failure log, it all shows:
weih@m1max:/tmp|⇒ grep "Error updating pod" build-log.txt.2
E0504 17:05:35.053963 119469 schedule_one.go:832] "Error updating pod" err="Patch \"http://127.0.0.1:40021/api/v1/namespaces/postfilter1-08d6ab0a-daf1-4df2-acf9-0e0fd8ca453e/pods/test-pod/status\": dial tcp 127.0.0.1:40021: connect: connection refused" pod="postfilter1-08d6ab0a-daf1-4df2-acf9-0e0fd8ca453e/test-pod"
E0504 17:05:39.733402 119469 schedule_one.go:832] "Error updating pod" err="Patch \"http://127.0.0.1:33249/api/v1/namespaces/postfilter2-00d09641-4bc2-4c7f-a980-f194e12091d1/pods/test-pod/status\": dial tcp 127.0.0.1:33249: connect: connection refused" pod="postfilter2-00d09641-4bc2-4c7f-a980-f194e12091d1/test-pod"
E0504 17:05:44.416195 119469 schedule_one.go:832] "Error updating pod" err="Patch \"http://127.0.0.1:39869/api/v1/namespaces/postfilter3-c090a3c6-5479-435a-8fb8-3d36e8ff9be7/pods/test-pod/status\": dial tcp 127.0.0.1:39869: connect: connection refused" pod="postfilter3-c090a3c6-5479-435a-8fb8-3d36e8ff9be7/test-pod"
E0504 17:07:41.533230 119469 schedule_one.go:832] "Error updating pod" err="Patch \"http://127.0.0.1:37893/api/v1/namespaces/permit-plugins16c02792-1845-48c3-9be5-904831c2c383/pods/test-pod/status\": dial tcp 127.0.0.1:37893: connect: connection refused" pod="permit-plugins16c02792-1845-48c3-9be5-904831c2c383/test-pod"
E0504 17:05:35.053963 119469 schedule_one.go:832] "Error updating pod" err="Patch \"http://127.0.0.1:40021/api/v1/namespaces/postfilter1-08d6ab0a-daf1-4df2-acf9-0e0fd8ca453e/pods/test-pod/status\": dial tcp 127.0.0.1:40021: connect: connection refused" pod="postfilter1-08d6ab0a-daf1-4df2-acf9-0e0fd8ca453e/test-pod"
E0504 17:05:39.733402 119469 schedule_one.go:832] "Error updating pod" err="Patch \"http://127.0.0.1:33249/api/v1/namespaces/postfilter2-00d09641-4bc2-4c7f-a980-f194e12091d1/pods/test-pod/status\": dial tcp 127.0.0.1:33249: connect: connection refused" pod="postfilter2-00d09641-4bc2-4c7f-a980-f194e12091d1/test-pod"
E0504 17:05:44.416195 119469 schedule_one.go:832] "Error updating pod" err="Patch \"http://127.0.0.1:39869/api/v1/namespaces/postfilter3-c090a3c6-5479-435a-8fb8-3d36e8ff9be7/pods/test-pod/status\": dial tcp 127.0.0.1:39869: connect: connection refused" pod="postfilter3-c090a3c6-5479-435a-8fb8-3d36e8ff9be7/test-pod"
E0504 17:07:41.533230 119469 schedule_one.go:832] "Error updating pod" err="Patch \"http://127.0.0.1:37893/api/v1/namespaces/permit-plugins16c02792-1845-48c3-9be5-904831c2c383/pods/test-pod/status\": dial tcp 127.0.0.1:37893: connect: connection refused" pod="permit-plugins16c02792-1845-48c3-9be5-904831c2c383/test-pod"
pkg/scheduler/util/utils.go
Outdated
} | ||
// Otherwise we can retry after retryAfterSeconds. | ||
klog.ErrorS(err, "Server rejected Pod patch (may retry after sleeping)", "pod", klog.KObj(old)) | ||
time.Sleep(time.Duration(retryAfterSeconds)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we can figure the exact failure reasons, I'd propose to restrict the retry logic to those particular error(s), like the "connection refused" one:
func IsConnectionRefused(err error) bool { |
pkg/scheduler/util/utils.go
Outdated
} | ||
|
||
// DeletePod deletes the given <pod> from API server | ||
func DeletePod(cs kubernetes.Interface, pod *v1.Pod) error { | ||
return cs.CoreV1().Pods(pod.Namespace).Delete(context.TODO(), pod.Name, metav1.DeleteOptions{}) | ||
return cs.CoreV1().Pods(pod.Namespace).Delete(context.Background(), pod.Name, metav1.DeleteOptions{}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this makes no difference. (in theory, we should pass in a context)
@Huang-Wei @alculquicondor @aojea Sorry for leaving here for a while. |
can you fix the verify? |
1c022c2
to
4bd39bb
Compare
Thanks for pinging me. (I wasn't aware of it. 🙏 🙏 🙏 ) |
pkg/scheduler/util/utils.go
Outdated
@@ -91,7 +93,7 @@ func MoreImportantPod(pod1, pod2 *v1.Pod) bool { | |||
} | |||
|
|||
// PatchPodStatus calculates the delta bytes change from <old.Status> to <newStatus>, | |||
// and then submit a request to API server to patch the pod changes. | |||
// and then submit a request to API server to patch the pod changes with retries. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// and then submit a request to API server to patch the pod changes with retries. | |
// and then submit a request to API server to patch the pod changes with retries when connection is refused. |
pkg/scheduler/util/utils_test.go
Outdated
client.PrependReactor("patch", "pods", func(action clienttesting.Action) (bool, runtime.Object, error) { | ||
defer func() { reqcount++ }() | ||
if reqcount >= 4 { | ||
// return error if requests comes in more than six times. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
four?
But why change the error? It should stop retrying even if the error is the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Fixed typo. 🙏
But why change the error? It should stop retrying even if the error is the same.
This "requests comes in more than four times." error shouldn't be returned in the expected scenario.
The expected scenario is:
- PatchPodStatus func sends a patch request to fake client.
- fake client returns "connection refused" error
- PatchPodStatus func retry (1) three times and fake client returns "connection refused" errors for these three requests.
- PatchPodStatus func gives up retrying and returns "connection refused" error. (the test will check if the returned error is "connection refused" or not.)
PrependReactor
returns "connection refused" error only four times and if there is a bug like "the system retries forever", the "requests comes in more than four times." error will be returned and this test case will fail. (because the test will check if the returned error is "connection refused" or not.)
@alculquicondor Fixed as suggestions. 🙏 |
Please rebase.... |
8454870
to
c75b09a
Compare
rebased and squashed. |
c75b09a
to
8b18f7c
Compare
Fixed bad rebase. |
@alculquicondor Could you check this when have a chance 🙏 |
/lgtm Thanks |
/hold cancel |
/retest |
What type of PR is this?
/kind bug
What this PR does / why we need it:
Retry with exponential backoff when it fails to update pods status on scheduling loop.
Scheduler doesn't retry when updating pods status on scheduling loop. If it fails to update pod states to unschedulable by some reasons (e.g., a flaky connection to apiserver), the next time it has a chance to be updated to unschedulable status is when it is scheduled again.
This means that in the worst case, that Pod status may not be updated for 5 minutes. (ref #108761)
Which issue(s) this PR fixes:
Fixes #109796
Special notes for your reviewer:
It's difficult for me to determine which errors are retriable. I used EventBroadcaster as a reference to implement.
kubernetes/staging/src/k8s.io/client-go/tools/record/event.go
Line 258 in d582814
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: