ETCD throttling #9781

makzzz1986 · 2022-10-10T08:28:29Z

Summary

Argo-Workflows produces a lot of API calls to ETCD and some of the requests can be canceled and Workflow transition between states can be disrupted. You can spot a error on the controller:
cannot validate Workflow: rpc error: code = ResourceExhausted desc = etcdserver: throttle: too many requests

Use Cases

There is a self-calculated throttling limits on AWS EKS and if your K8s cluster is small, ETCD can throttle some updates of Workflows. For example, we have many Cronworkflows and ETCD can skip the update from Running state of Workflow to Finished and it gets stuck forever because the controller is waiting when Workflow is going to be finished. Pods of such Workflow are gone when they are completed, and no sidecontainers will try to update the status again.

Probably, some mechanism of retries should be implemented to avoid this and have a guarantee that the state of Workflow will be changed.

For AWS EKS there is a workaround: scale you cluster up for a short period of time to boost the throttling limit.

Message from the maintainers:

Love this enhancement proposal? Give it a 👍. We prioritise the proposals with the most 👍.

The text was updated successfully, but these errors were encountered:

makzzz1986 · 2022-10-17T09:02:50Z

Unfortunately, AWS EKS cluster API limits are changed dynamically, if your cluster has shrunk, the limits will be decreased as well

mcntrn · 2022-12-03T01:54:10Z

Experienced a similar situation on small EKS clusters (3 EC2 nodes) that use Fargate to exclusively run hundreds of CronWorkflows simultaneously. Apparently if you launch too many new nodes simultaneously it can cause the EKS control plane to become overloaded and start throwing the too many requests error. Switching the CronWorkflows to run on EC2 instances seems to mitigate the issue (at least for us).

watkinsmike · 2023-04-05T16:38:28Z

@makzzz1986 We are also facing this issue. Did you get some confirmation from AWS about the API dynamic resizing and what those thresholds are? What is taken into calculation to determine those limits, # of nodes, size of nodes, both?

makzzz1986 · 2023-04-05T19:38:55Z

@watkinsmike I've got a confirmation from AWS that it could be an issue for small clusters. Also, they suspect that Argo-Workflows creates multiple requests at the same moment to the same resource and it causes throttling. They have the ability to tweak the limit and we are checking how it will behave. I can suggest you to open a case about it and share its number, I will try to deliver it to AWS team who is owning this case.

devjerry0 · 2023-11-23T14:50:17Z

We have been seeing the same issue on a small cluster.
15 m5.2xlarge nodes 8cores 32gb ram. 10 workflows each using 1core and 512MB ram.
8 of those failed with etcd throttling 🤷

tooptoop4 · 2024-05-15T11:41:01Z

did u solve?

makzzz1986 added the type/feature Feature request label Oct 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ETCD throttling #9781

ETCD throttling #9781

makzzz1986 commented Oct 10, 2022

makzzz1986 commented Oct 17, 2022

mcntrn commented Dec 3, 2022

watkinsmike commented Apr 5, 2023 •

edited

makzzz1986 commented Apr 5, 2023 •

edited

devjerry0 commented Nov 23, 2023 •

edited

tooptoop4 commented May 15, 2024

ETCD throttling #9781

ETCD throttling #9781

Comments

makzzz1986 commented Oct 10, 2022

Summary

Use Cases

makzzz1986 commented Oct 17, 2022

mcntrn commented Dec 3, 2022

watkinsmike commented Apr 5, 2023 • edited

makzzz1986 commented Apr 5, 2023 • edited

devjerry0 commented Nov 23, 2023 • edited

tooptoop4 commented May 15, 2024

watkinsmike commented Apr 5, 2023 •

edited

makzzz1986 commented Apr 5, 2023 •

edited

devjerry0 commented Nov 23, 2023 •

edited