Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ETCD throttling #9781

Open
makzzz1986 opened this issue Oct 10, 2022 · 6 comments
Open

ETCD throttling #9781

makzzz1986 opened this issue Oct 10, 2022 · 6 comments
Labels
type/feature Feature request

Comments

@makzzz1986
Copy link

Summary

Argo-Workflows produces a lot of API calls to ETCD and some of the requests can be canceled and Workflow transition between states can be disrupted. You can spot a error on the controller:
cannot validate Workflow: rpc error: code = ResourceExhausted desc = etcdserver: throttle: too many requests

Use Cases

There is a self-calculated throttling limits on AWS EKS and if your K8s cluster is small, ETCD can throttle some updates of Workflows. For example, we have many Cronworkflows and ETCD can skip the update from Running state of Workflow to Finished and it gets stuck forever because the controller is waiting when Workflow is going to be finished. Pods of such Workflow are gone when they are completed, and no sidecontainers will try to update the status again.

Probably, some mechanism of retries should be implemented to avoid this and have a guarantee that the state of Workflow will be changed.

For AWS EKS there is a workaround: scale you cluster up for a short period of time to boost the throttling limit.


Message from the maintainers:

Love this enhancement proposal? Give it a 👍. We prioritise the proposals with the most 👍.

@makzzz1986 makzzz1986 added the type/feature Feature request label Oct 10, 2022
@makzzz1986
Copy link
Author

Unfortunately, AWS EKS cluster API limits are changed dynamically, if your cluster has shrunk, the limits will be decreased as well

@mcntrn
Copy link

mcntrn commented Dec 3, 2022

Experienced a similar situation on small EKS clusters (3 EC2 nodes) that use Fargate to exclusively run hundreds of CronWorkflows simultaneously. Apparently if you launch too many new nodes simultaneously it can cause the EKS control plane to become overloaded and start throwing the too many requests error. Switching the CronWorkflows to run on EC2 instances seems to mitigate the issue (at least for us).

@watkinsmike
Copy link

watkinsmike commented Apr 5, 2023

@makzzz1986 We are also facing this issue. Did you get some confirmation from AWS about the API dynamic resizing and what those thresholds are? What is taken into calculation to determine those limits, # of nodes, size of nodes, both?

@makzzz1986
Copy link
Author

makzzz1986 commented Apr 5, 2023

@watkinsmike I've got a confirmation from AWS that it could be an issue for small clusters. Also, they suspect that Argo-Workflows creates multiple requests at the same moment to the same resource and it causes throttling. They have the ability to tweak the limit and we are checking how it will behave. I can suggest you to open a case about it and share its number, I will try to deliver it to AWS team who is owning this case.

@devjerry0
Copy link

devjerry0 commented Nov 23, 2023

We have been seeing the same issue on a small cluster.
15 m5.2xlarge nodes 8cores 32gb ram. 10 workflows each using 1core and 512MB ram.
8 of those failed with etcd throttling 🤷

@tooptoop4
Copy link
Contributor

did u solve?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/feature Feature request
Projects
None yet
Development

No branches or pull requests

5 participants