Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add timeout for executor signal #13011

Closed
4 tasks done
fyp711 opened this issue May 6, 2024 · 7 comments · Fixed by #13012
Closed
4 tasks done

add timeout for executor signal #13011

fyp711 opened this issue May 6, 2024 · 7 comments · Fixed by #13012
Labels
area/controller Controller issues, panics area/executor P3 Low priority type/bug

Comments

@fyp711
Copy link
Contributor

fyp711 commented May 6, 2024

Pre-requisites

  • I have double-checked my configuration
  • I have tested with the :latest image tag (i.e. quay.io/argoproj/workflow-controller:latest) and can confirm the issue still exists on :latest. If not, I have explained why, in detail, in my description below.
  • I have searched existing issues and could not find a match for this bug
  • I'd like to contribute the fix myself (see contributing guide)

What happened/what did you expect to happen?

The method ExecPodContainerAndGetOutput in signal.go will stuck, ince this method (GetExecutorOutput) blocks while waiting for the remote stdout output, causing the goroutine to block for a long time, or even for a long time, we should add a timeout to prevent this problem from happening again.

This causes the cleanup to block all four goroutines by default, thus growing indefinitely in the pod_clean_up queue depth

Version

main

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

none

Logs from the workflow controller

none

Logs from in your workflow's wait container

none
@fyp711
Copy link
Contributor Author

fyp711 commented May 6, 2024

@agilgur5 Could you help me to look at this issue, Thanks!

@agilgur5 agilgur5 added area/controller Controller issues, panics area/executor area/gc Garbage collection, such as TTLs, retentionPolicy, delays, and more labels May 6, 2024
@Joibel
Copy link
Member

Joibel commented May 7, 2024

Could you explain or give examples of when this is a problem please?

@agilgur5 agilgur5 changed the title add timeout for SPDY executor stream add timeout for executor signal May 7, 2024
@fyp711
Copy link
Contributor Author

fyp711 commented May 8, 2024

Could you explain or give examples of when this is a problem please?

The logic of the first piece of code is to send a kubectl exec command using spdy protocol and get its return value.
When the return value is fetched, it blocks and waits for the stdout and stderr value to complete. see this code
https://github.com/kubernetes/client-go/blob/4ebe42d8c9c18f464fcc7b4f15b3a632db4cbdb2/tools/remotecommand/v4.go#L54-L76

It's going to block down here. In some unusual cases, the goroutine gets stuck here forever. if all the pod clean up workers are block here, something very serious will happen.

https://github.com/kubernetes/client-go/blob/4ebe42d8c9c18f464fcc7b4f15b3a632db4cbdb2/tools/remotecommand/v4.go#L67-L72

@fyp711
Copy link
Contributor Author

fyp711 commented May 8, 2024

@Joibel You can read what I just commented on. Thank you!

@fyp711
Copy link
Contributor Author

fyp711 commented May 8, 2024

@Joibel This is already happening in our internal test environment. All pod cleanup workers are blocked, preventing pods from being cleaned.

@Joibel
Copy link
Member

Joibel commented May 8, 2024

I can see that these might block, and that your fix fixes this.
I'm surprised you're seeing a problem though, and suspect you must have some other issue and that this fix is just going to mask that other problem.
Do you know why you're seeing this on your test environment?
Is there something unusual about what you're doing in the pods that are getting stuck?
Are the nodes they are on being terminated, are they producing a LOT of logs?
Is there intermittent connectivity?
Why do you think that other people don't have this issue? We know lots of users who have high workloads and their controllers can run for weeks or months without problems.

@fyp711
Copy link
Contributor Author

fyp711 commented May 8, 2024

Is there intermittent connectivity?
Why do you think that other people don't have this issue? We know lots of users who have high workloads and their controllers can run for weeks or months without problems.

@Joibel Thanks for your review, This is to execute a kill -15 command, the kill command log is very few and the kill command does not block. I guess it's because there's a problem connecting multiple proxys, but I can't find out why because the pod has actually been cleaned, So I think we need to prevent this from happening, I think this is a very good improvement to optimize the prevention of this kind of situation, what do you think?

First of all, maybe no one have noticed this problem, but this problem was discovered by users in the kubernetes community, so he added the ability to cancel the commit, for details you can see this link kubernetes/kubernetes#103177

Second, I think this is a good optimization point, and we should make some preventive improvements in this code.

@agilgur5 agilgur5 removed the area/gc Garbage collection, such as TTLs, retentionPolicy, delays, and more label May 16, 2024
@agilgur5 agilgur5 added this to the v3.5.x patches milestone May 16, 2024
@agilgur5 agilgur5 added the P3 Low priority label May 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/controller Controller issues, panics area/executor P3 Low priority type/bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants