Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

One apiserver can destabilize apiservers for all clusters in the given seed #13321

Open
dharapvj opened this issue Apr 22, 2024 · 1 comment
Open
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@dharapvj
Copy link
Contributor

What happened?

For one of the client, one cluster's apiserver pods got overloaded (presumably for network throughput but can also be for memory). The tale-tell signs were

  1. Loads of context cancelled errors in rest of the control-plane components when trying to get/put queries to to apiserver like controller-manager, csi-driver etc.
  2. Loads of context cancelled errors in kyverno admission controller running in cluster.
  3. Many pods are in crashloop backoff and almost all of them are trying to do things with apiserver and apiserver not able to respond in time
  4. konnectivity-server container in apiserver responds a LOT of logs like "Receive channel from agent is full" Ref
  5. Memory (via VPA) for apiserver had also shot up

Now Main issue here is that while all this was true for one cluster.. this impacted apiservers of ALL the other clusters managed via this seed! All the other clusters also observed timeout errors for apiserver but the network throughput increase was only on one cluster.

Check the graph:
image

Expected behavior

One cluster's issues should not impact other clusters in anyway.

How to reproduce the issue?

  1. Create 3-4 of user clustrs
  2. Deploy a purposefully nasty operator which will overwhelm apiserver of one cluster
  3. Observe cluster operations in other cluster like new workload creation etc

How is your environment configured?

  • KKP version: 2.24.12
  • Shared or separate master/seed clusters?: shared master/seed as well as dedicated seeds but issue seen here in dedicate seed.

Provide your KKP manifest here (if applicable)

# paste manifest here

What cloud provider are you running on?

vsphere

What operating system are you running in your user cluster?

Ubuntu 22.04

Additional information

Additional information in this internal slack thread

@dharapvj dharapvj added the kind/bug Categorizes issue or PR as related to a bug. label Apr 22, 2024
@dharapvj
Copy link
Contributor Author

queries to bring network utilization in above graph

A: sum by(namespace) (rate(container_network_receive_bytes_total{namespace=~"cluster-.*",pod=~"apiserver-.*"}[$__rate_interval]))
B: sum by (namespace) (rate(container_network_transmit_bytes_total{namespace=~"cluster-.*",pod=~"apiserver-.*"}[$__rate_interval]))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

1 participant