One apiserver can destabilize apiservers for all clusters in the given seed #13321

dharapvj · 2024-04-22T07:10:53Z

What happened?

For one of the client, one cluster's apiserver pods got overloaded (presumably for network throughput but can also be for memory). The tale-tell signs were

Loads of context cancelled errors in rest of the control-plane components when trying to get/put queries to to apiserver like controller-manager, csi-driver etc.
Loads of context cancelled errors in kyverno admission controller running in cluster.
Many pods are in crashloop backoff and almost all of them are trying to do things with apiserver and apiserver not able to respond in time
konnectivity-server container in apiserver responds a LOT of logs like "Receive channel from agent is full" Ref
Memory (via VPA) for apiserver had also shot up

Now Main issue here is that while all this was true for one cluster.. this impacted apiservers of ALL the other clusters managed via this seed! All the other clusters also observed timeout errors for apiserver but the network throughput increase was only on one cluster.

Check the graph:

Expected behavior

One cluster's issues should not impact other clusters in anyway.

How to reproduce the issue?

Create 3-4 of user clustrs
Deploy a purposefully nasty operator which will overwhelm apiserver of one cluster
Observe cluster operations in other cluster like new workload creation etc

How is your environment configured?

KKP version: 2.24.12
Shared or separate master/seed clusters?: shared master/seed as well as dedicated seeds but issue seen here in dedicate seed.

Provide your KKP manifest here (if applicable)

# paste manifest here

What cloud provider are you running on?

vsphere

What operating system are you running in your user cluster?

Ubuntu 22.04

Additional information

Additional information in this internal slack thread

dharapvj · 2024-04-22T07:11:42Z

queries to bring network utilization in above graph

A: sum by(namespace) (rate(container_network_receive_bytes_total{namespace=~"cluster-.*",pod=~"apiserver-.*"}[$__rate_interval]))
B: sum by (namespace) (rate(container_network_transmit_bytes_total{namespace=~"cluster-.*",pod=~"apiserver-.*"}[$__rate_interval]))

dharapvj added the kind/bug Categorizes issue or PR as related to a bug. label Apr 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

One apiserver can destabilize apiservers for all clusters in the given seed #13321

One apiserver can destabilize apiservers for all clusters in the given seed #13321

dharapvj commented Apr 22, 2024

dharapvj commented Apr 22, 2024

One apiserver can destabilize apiservers for all clusters in the given seed #13321

One apiserver can destabilize apiservers for all clusters in the given seed #13321

Comments

dharapvj commented Apr 22, 2024

What happened?

Expected behavior

How to reproduce the issue?

How is your environment configured?

Provide your KKP manifest here (if applicable)

What cloud provider are you running on?

What operating system are you running in your user cluster?

Additional information

dharapvj commented Apr 22, 2024