Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubewarden controller potentially doing too many requests to apiserver #645

Closed
1 task done
brunorene opened this issue Feb 20, 2024 · 12 comments · Fixed by #741
Closed
1 task done

Kubewarden controller potentially doing too many requests to apiserver #645

brunorene opened this issue Feb 20, 2024 · 12 comments · Fixed by #741
Assignees
Labels
status/waiting-feedback Label to help visualize issue waiting for feedback from someone
Milestone

Comments

@brunorene
Copy link

brunorene commented Feb 20, 2024

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

I noticed that compared to other operators, kubewarden controller seems to send a large amount of requests to the api server, probably to support reconciliation of the policy server. Based on information from my ElasticSearch instance here are some stats:

  • 10 million logs in 24 hours coming from the kubewarden controller
  • From those we have 2,5M that are create requests for services, deployments and secrets called policy-server-default that always fail with 409 because those resources already exist. They are all spread evenly so it seems kw controller is always trying to create these 3 resources constantly. Each of these (services, deployments and secrets) have 5 log entries per second, all with 409 status
  • The big chunk are patch operations on deployments and configmaps also called policy-server-default , which total 1.7M logs. Here we get 2-3 logs lines for each deployment and configmaps per second
  • Next is update on services called policy-server-default that total 859K, these are more or less 1 per second
  • Finally 5M of them are get configmaps, 4-5 log lines per second

I think there should be some room from improvement by slowing down some of these cycles because, in my opinion, it should be enough to send these requests every 2-3 seconds, which would decrease dramatically the load on the api-server. Again I don't have a deep knowledge about the reason to these cadence of requests, but maybe some of these request categories can be streamlined to remove some of the load on the apiserver.

Thanks

Expected Behavior

I would exptect a lower load of requests on the apiserver

Steps To Reproduce

Analyse logs from ElasticSearch coming from the apiserver - usually called audit logs

Environment

- OS: Linux
- Architecture: amd64

Anything else?

Not really. Thanks

@flavio
Copy link
Member

flavio commented Feb 23, 2024

I tried to reproduce the issue on a test cluster. I deployed the latest version of Kubewarden and enabled the default policies.

I didn't see any suspicious activity going on. There were regular WATCH and LIST events done by the controller, to monitor the status of the watched resources, but nothing so invasive. Aside from that, the audit-scanner feature was generating some noise, but only when the CronJob was started.

How many policies do you have deployed, are all active? Is there by any chance some policy that is in pending state, maybe because of a wrong configuration or because it points to a wasm module that cannot be fetched?

Is there something doing changes against the policy resources or against the PolicyServer resource?

This seems like the controller is trying to reconcile a state, something goes wrong, it does a couple of retries and then sleeps for X minutes. Then the cycle starts again 🤔

@flavio flavio added the status/waiting-feedback Label to help visualize issue waiting for feedback from someone label Feb 27, 2024
@brunorene
Copy link
Author

brunorene commented Feb 27, 2024

We have 22 active policies at the moment, each on its own policy server, so 22 policy servers

@brunorene
Copy link
Author

brunorene commented Feb 27, 2024

This is the amount of apiserver calls coming from the kuberwarden controller, during 15 minutes
Captura de ecrã 2024-02-27, às 22 57 06

@brunorene
Copy link
Author

brunorene commented Feb 27, 2024

The final breakdown is 1 slot per policy server

Captura de ecrã 2024-02-27, às 23 06 09

Getting 1000 times a configmap in 15 minutes seems a lot :)

@brunorene
Copy link
Author

Before continuing with this I'll check if it can be something from our side too

@brunorene
Copy link
Author

brunorene commented Feb 27, 2024

Untitled discover search.csv

To be honest doing a 1 minute search with just a single policy server, it does not seem like a lot, but not sure if it can have some room for improvement

@brunorene
Copy link
Author

Untitled discover search (1).csv.zip

Here are 15 mins with all policy servers

@brunorene brunorene changed the title Kubewarden controller potentially doing to many requests to apiserver Kubewarden controller potentially doing too many requests to apiserver Mar 21, 2024
@mpepping
Copy link

mpepping commented Apr 3, 2024

Seeing the same behavior /w ~50 policies in a single policy-server instance. All request originate from the cattle-kubewarden-system:rancher-kubewarden-controller SA.

  • 100+ request/second, matching the characteristics described in this issue (lots of 409's, 200's)
  • 0-10 requests per second when all policies are deleted
  • 0-10 request per seconds when loading policies in chunks of 5-10 policies,
  • 100+ request/second when a certain combination or order of policies is loaded, although combinations are not fixed

I now have a scenario where I can reproduce the issue; 11 policies loaded (0-10 reqs/sec.) when the 12th policy is loaded, request surge to 100+ sec. When that 12th policy is deleted, requests turn back to 0-10 per sec.

The DOWNSIDE is that seems to can be caused by multiple combinations of policies. In this case, the policy triggering the issue is:

apiVersion: policies.kubewarden.io/v1
kind: ClusterAdmissionPolicy
metadata:
  annotations:
    io.kubewarden.policy.category: PSP
    io.kubewarden.policy.description: Pod Security Policy that controls usage of volumes
    io.kubewarden.policy.severity: medium
    io.kubewarden.policy.title: cluster-volumes policy in monitor for namespaces
    meta.helm.sh/release-name: kubewarden-policies
    meta.helm.sh/release-namespace: cattle-kubewarden-system
  creationTimestamp: "2024-04-02T07:47:03Z"
  finalizers:
  - kubewarden
  generation: 1
  labels:
    app.kubernetes.io/component: policy
    app.kubernetes.io/instance: kubewarden-policies
  name: cluster-volumes-monitor-namespaces
  resourceVersion: "23286973"
  uid: 3771d03b-59fa-4df8-9c38-611455909336
spec:
  backgroundAudit: true
  mode: monitor
  module: example.com/kubewarden/policies/volumes-psp:v0.1.11
  mutating: false
  namespaceSelector:
    matchExpressions:
    - key: kubernetes.io/metadata.name
      operator: NotIn
      values:
      - cattle-kubewarden-system
      - cattle-system
      - kube-node-lease
      - kube-system
    - key: kubernetes.io/metadata.name
      operator: In
      values:
      - foo
      - bar
  policyServer: default
  rules:
  - apiGroups:
    - ""
    apiVersions:
    - v1
    operations:
    - CREATE
    - UPDATE
    resources:
    - pods
  settings:
    allowedTypes:
    - configMap
    - downwardAPI
    - emptyDir
    - persistentVolumeClaim
    - secret
    - projected
  timeoutSeconds: 10

@flavio
Copy link
Member

flavio commented Apr 3, 2024

Thanks, this is definitely one of the issues we will look into during the next month

@flavio flavio added this to the 1.13 milestone Apr 19, 2024
@mpepping
Copy link

mpepping commented May 6, 2024

Over the course of one hour we see a single clusterAdmissionPolicy being modified ~600 times, with the following status.conditions values:

Screenshot 2024-05-06 131222

A full audit event for one of the MODIFIED events:

{
  "type": "MODIFIED",
  "object": {
    "kind": "ClusterAdmissionPolicy",
    "metadata": {
      "uid": "dde481c0-9207-4916-9ede-782dba6c823e",
      "annotations": {
        "io.kubewarden.policy.title": "cluster-disallow-secret-in-env policy in monitor for namespaces",
        "meta.helm.sh/release-name": "kubewarden-policies",
        "meta.helm.sh/release-namespace": "cattle-kubewarden-system",
        "io.kubewarden.policy.category": "PSP",
        "io.kubewarden.policy.description": "Policy that inspects env vars and rejects a request if a secret was found",
        "io.kubewarden.policy.severity": "medium"
      },
      "labels": {
        "app.kubernetes.io/component": "policy",
        "app.kubernetes.io/instance": "kubewarden-policies",
        "app.kubernetes.io/managed-by": "Helm"
      },
      "name": "cluster-disallow-secret-in-env-monitor-namespaces"
    },
    "status": {
      "mode": "monitor",
      "policyStatus": "active",
      "conditions": [
        {
          "status": "False",
          "type": "PolicyActive",
          "lastTransitionTime": "2024-05-06T11:08:51Z",
          "message": "The policy webhook has not been created",
          "reason": "PolicyActive"
        },
        {
          "lastTransitionTime": "2024-05-06T11:08:51Z",
          "message": "The latest replica set is not uniquely reachable",
          "reason": "LatestReplicaSetIsNotUniquelyReachable",
          "status": "False",
          "type": "PolicyUniquelyReachable"
        },
        {
          "message": "Configuration for this policy is up to date",
          "reason": "ConfigurationVersionMatch",
          "status": "True",
          "type": "PolicyServerConfigurationUpToDate",
          "lastTransitionTime": "2024-04-23T18:30:45Z"
        }
      ]
    },
    "apiVersion": "policies.kubewarden.io/v1"
  }
}

Seem something is modifying the admissionPolicies once every 10 seconds. Hope this provides some clues in finding a cause.

@jvanz
Copy link
Member

jvanz commented May 7, 2024

I'm thinking about this information about the policy status updates. My gut feeling tells me that we have an issue in the policy reconciliation indeed. Not only because of the information shared here. But when we run the integration tests we can see a lot of errors like this:

  2024-05-07T18:14:15-03:00     ERROR   Reconciler error        {"controller": "admissionpolicy", "controllerGroup": "policies.kubewarden.io", "controllerKind": "AdmissionPolicy", "AdmissionPolicy": {"name":"validating-policy-yu57kw2a","n
amespace":"admission-policy-controller-test"}, "namespace": "admission-policy-controller-test", "name": "validating-policy-yu57kw2a", "reconcileID": "7a775ee2-1688-456a-976a-1a25340423af", "error": "update admission policy status error: O
peration cannot be fulfilled on admissionpolicies.policies.kubewarden.io \"validating-policy-yu57kw2a\": the object has been modified; please apply your changes to the latest version and try again"}                                        

I'm not sure if this is the only cause for the too many request sent to the API server. But it looks like to me that the controller has a bug in the policy reconciliation. Because many times when the controller tries to updates the status, the resource is not the latest version. I still do not know why that happens, maybe too many reconciliation request re enqueue? I do not know. I have not spend time on this yet. Just sharing a though here to not lose it. ;)

@jvanz
Copy link
Member

jvanz commented May 27, 2024

Okay, let me share some thoughs I had while I was working in this issue during the past day.

First of all, sorry for taking so long to give an initial feedback. Busy days with personal stuff. But let's back to business....

I think the comments in this issue include 3 possibles improvements in our controller. They are:

  1. Reduce the number of requests returning 409 errors. This should be mitigated in the PR. This PR uses the controller-runtime helper functions to decide if a resource should be created or updated/patched. This may increase a little bit the number of get request. But I do not expect that as an issue because most of the resources are cached. So, no more new hits in the API server.
  2. Reduce the number of get requests for configmaps. This is indeed an issue. Our controller get the configmaps a lot of times in the reconciliation loops because the both policies and policy server reconciliation loops needs that information to proper set policy status and etc. However, we can mitigate this issue enabling the cache for this resource again. We did that in this PR.
  3. The item 3 is related to the second one. But it will required a more careful change. The number of request sent to update policies statuses is big. This is the result of how the controller works. The policies reconciler waits for the policy server reconciler to finish the reconciliation loop to then properly set the policy status. Therefore, the policy reconcile loop get a lot of resources (this is why is related to the item 2) and set policy status multiple times while waiting for the policy server be ready to answer requests. However, to improve that, we need to reorganize the controller reconciliation loops trying to simplify them. Which will require bigger changes. Therefore, I'll open another issue to first investigate if there is another way to mitigate this. Or if this is possible in the first place and, if it is, fix it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status/waiting-feedback Label to help visualize issue waiting for feedback from someone
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants