add timeout for executor signal #13011

fyp711 · 2024-05-06T05:46:05Z

Pre-requisites

I have double-checked my configuration
I have tested with the :latest image tag (i.e. quay.io/argoproj/workflow-controller:latest) and can confirm the issue still exists on :latest. If not, I have explained why, in detail, in my description below.
I have searched existing issues and could not find a match for this bug
I'd like to contribute the fix myself (see contributing guide)

What happened/what did you expect to happen?

The method ExecPodContainerAndGetOutput in signal.go will stuck, ince this method (GetExecutorOutput) blocks while waiting for the remote stdout output, causing the goroutine to block for a long time, or even for a long time, we should add a timeout to prevent this problem from happening again.

This causes the cleanup to block all four goroutines by default, thus growing indefinitely in the pod_clean_up queue depth

Version

main

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

none

Logs from the workflow controller

none

Logs from in your workflow's wait container

none

The text was updated successfully, but these errors were encountered:

fyp711 · 2024-05-06T06:47:03Z

@agilgur5 Could you help me to look at this issue, Thanks!

Joibel · 2024-05-07T14:23:36Z

Could you explain or give examples of when this is a problem please?

fyp711 · 2024-05-08T08:41:23Z

Could you explain or give examples of when this is a problem please?

The logic of the first piece of code is to send a kubectl exec command using spdy protocol and get its return value.
When the return value is fetched, it blocks and waits for the stdout and stderr value to complete. see this code
https://github.com/kubernetes/client-go/blob/4ebe42d8c9c18f464fcc7b4f15b3a632db4cbdb2/tools/remotecommand/v4.go#L54-L76

It's going to block down here. In some unusual cases, the goroutine gets stuck here forever. if all the pod clean up workers are block here, something very serious will happen.

https://github.com/kubernetes/client-go/blob/4ebe42d8c9c18f464fcc7b4f15b3a632db4cbdb2/tools/remotecommand/v4.go#L67-L72

fyp711 · 2024-05-08T08:42:50Z

@Joibel You can read what I just commented on. Thank you!

fyp711 · 2024-05-08T08:49:36Z

@Joibel This is already happening in our internal test environment. All pod cleanup workers are blocked, preventing pods from being cleaned.

Joibel · 2024-05-08T10:17:12Z

I can see that these might block, and that your fix fixes this.
I'm surprised you're seeing a problem though, and suspect you must have some other issue and that this fix is just going to mask that other problem.
Do you know why you're seeing this on your test environment?
Is there something unusual about what you're doing in the pods that are getting stuck?
Are the nodes they are on being terminated, are they producing a LOT of logs?
Is there intermittent connectivity?
Why do you think that other people don't have this issue? We know lots of users who have high workloads and their controllers can run for weeks or months without problems.

fyp711 · 2024-05-08T11:20:28Z

Is there intermittent connectivity?
Why do you think that other people don't have this issue? We know lots of users who have high workloads and their controllers can run for weeks or months without problems.

@Joibel Thanks for your review, This is to execute a kill -15 command, the kill command log is very few and the kill command does not block. I guess it's because there's a problem connecting multiple proxys, but I can't find out why because the pod has actually been cleaned, So I think we need to prevent this from happening, I think this is a very good improvement to optimize the prevention of this kind of situation, what do you think?

First of all, maybe no one have noticed this problem, but this problem was discovered by users in the kubernetes community, so he added the ability to cancel the commit, for details you can see this link kubernetes/kubernetes#103177

Second, I think this is a good optimization point, and we should make some preventive improvements in this code.

fyp711 added the type/bug label May 6, 2024

fyp711 mentioned this issue May 6, 2024

fix: add timeout for executor signal #13012

Merged

agilgur5 added area/controller Controller issues, panics area/executor area/gc Garbage collection, such as TTLs, retentionPolicy, delays, and more labels May 6, 2024

agilgur5 changed the title ~~add timeout for SPDY executor stream~~ add timeout for executor signal May 7, 2024

agilgur5 removed the area/gc Garbage collection, such as TTLs, retentionPolicy, delays, and more label May 16, 2024

agilgur5 closed this as completed in #13012 May 16, 2024

agilgur5 added this to the v3.5.x patches milestone May 16, 2024

agilgur5 added the P3 Low priority label May 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add timeout for executor signal #13011

add timeout for executor signal #13011

fyp711 commented May 6, 2024

fyp711 commented May 6, 2024

Joibel commented May 7, 2024

fyp711 commented May 8, 2024

fyp711 commented May 8, 2024

fyp711 commented May 8, 2024

Joibel commented May 8, 2024

fyp711 commented May 8, 2024

add timeout for executor signal #13011

add timeout for executor signal #13011

Comments

fyp711 commented May 6, 2024

Pre-requisites

What happened/what did you expect to happen?

Version

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

Logs from the workflow controller

Logs from in your workflow's wait container

fyp711 commented May 6, 2024

Joibel commented May 7, 2024

fyp711 commented May 8, 2024

fyp711 commented May 8, 2024

fyp711 commented May 8, 2024

Joibel commented May 8, 2024

fyp711 commented May 8, 2024