You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For one of the client, one cluster's apiserver pods got overloaded (presumably for network throughput but can also be for memory). The tale-tell signs were
Loads of context cancelled errors in rest of the control-plane components when trying to get/put queries to to apiserver like controller-manager, csi-driver etc.
Loads of context cancelled errors in kyverno admission controller running in cluster.
Many pods are in crashloop backoff and almost all of them are trying to do things with apiserver and apiserver not able to respond in time
konnectivity-server container in apiserver responds a LOT of logs like "Receive channel from agent is full" Ref
Memory (via VPA) for apiserver had also shot up
Now Main issue here is that while all this was true for one cluster.. this impacted apiservers of ALL the other clusters managed via this seed! All the other clusters also observed timeout errors for apiserver but the network throughput increase was only on one cluster.
Check the graph:
Expected behavior
One cluster's issues should not impact other clusters in anyway.
How to reproduce the issue?
Create 3-4 of user clustrs
Deploy a purposefully nasty operator which will overwhelm apiserver of one cluster
Observe cluster operations in other cluster like new workload creation etc
How is your environment configured?
KKP version: 2.24.12
Shared or separate master/seed clusters?: shared master/seed as well as dedicated seeds but issue seen here in dedicate seed.
Provide your KKP manifest here (if applicable)
# paste manifest here
What cloud provider are you running on?
vsphere
What operating system are you running in your user cluster?
queries to bring network utilization in above graph
A: sum by(namespace) (rate(container_network_receive_bytes_total{namespace=~"cluster-.*",pod=~"apiserver-.*"}[$__rate_interval]))
B: sum by (namespace) (rate(container_network_transmit_bytes_total{namespace=~"cluster-.*",pod=~"apiserver-.*"}[$__rate_interval]))
What happened?
For one of the client, one cluster's apiserver pods got overloaded (presumably for network throughput but can also be for memory). The tale-tell signs were
Now Main issue here is that while all this was true for one cluster.. this impacted apiservers of ALL the other clusters managed via this seed! All the other clusters also observed timeout errors for apiserver but the network throughput increase was only on one cluster.
Check the graph:
Expected behavior
One cluster's issues should not impact other clusters in anyway.
How to reproduce the issue?
How is your environment configured?
Provide your KKP manifest here (if applicable)
# paste manifest here
What cloud provider are you running on?
vsphere
What operating system are you running in your user cluster?
Ubuntu 22.04
Additional information
Additional information in this internal slack thread
The text was updated successfully, but these errors were encountered: