Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Events on CRDs cause full cluster discovery #523

Open
torfjor opened this issue May 25, 2023 · 0 comments
Open

Events on CRDs cause full cluster discovery #523

torfjor opened this issue May 25, 2023 · 0 comments

Comments

@torfjor
Copy link

torfjor commented May 25, 2023

Hi!

We have a setup where a central admin cluster running Argo CD is managing Applications on a fleet of workload clusters. Our central admin cluster connects to the worker clusters through Anthos Fleet and the Connect Gateway. The worker clusters are a mix of Anthos Bare Metal and GKE.

We ran into an issue where we hit the default Connect Gateway API quota with only two registered workload clusters and a handful of deployed Applications. Investigation showed that Argo CD was performing full API discovery requests on registered workload clusters multiple times per minute. Further investigation led us to this event loop in gitops-engine, where c.startMissingWatches() performs a non-cached discovery of the target cluster each time a CRD changes.

This turns out to be problematic for GKE clusters with Backup for GKE enabled, because the system-provided addon-manager will patch its CRDs very often:

Screenshot 2023-05-25 at 12 42 51

Looking at the Connect Gateway API traffic you can see the sharp drop when we added a resource exclusion on gkebackup.gke.io/*

Traffic by response code
(PS: The last sudden spike was caused by us temporarily removing the resource exclusion)

For our use case, having gkebackup.gke.io/* excluded is totally fine. We contacted Google Support about the issue, and the rapidly patched CRDs is intended behaviour. Their immediate response to the chatty nature of Argo CD was to just raise the quota for affected customers.

Writing up this issue because it might not be very evident for people running Argo or Flux targeting GKE clusters unless they have good visibility into their API server traffic.

Possibly related:

Update 05-31-2023:

Just heard back from the product team for Backup for GKE and a fix for the rapidly patched CRDs will be rolled out next week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant