Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Pod.status subresource not working after upgrade from 1.9.4 #10182

Open
2 tasks done
PhilippMT opened this issue May 6, 2024 · 4 comments
Open
2 tasks done

[Bug] Pod.status subresource not working after upgrade from 1.9.4 #10182

PhilippMT opened this issue May 6, 2024 · 4 comments
Assignees
Labels
bug Something isn't working triage Default label assigned to all new issues indicating label curation is needed to fully organize.

Comments

@PhilippMT
Copy link

PhilippMT commented May 6, 2024

Kyverno Version

1.12.0

Kubernetes Version

1.29.x

Kubernetes Platform

EKS

Kyverno Rule Type

Mutate

Description

Hello,

I just upgraded kyverno from 1.9.4 to the latest 1.12.1 with a complete new installation. I have a ClusterPolicy which was working fine with 1.9.4 but is not working with the latest 1.12.1. I also tried version 1.11.4 but with the same result. The ClusterPolicy was triggered as soon as a pod of the datadog DaemonSet is in the ready state. The policy mutates the node the pod runs on and removes a startup taint.

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: datadog-agent-startup-taint
spec:
  failurePolicy: Fail
  rules:
    - name: remove-startup-taint
      match:
        any:
          - resources:
              kinds:
                - v1/Pod.status
              namespaces:
                - datadog
              names:
                - datadog-*
      preconditions:
        all:
          - key: '{{request.object.metadata.ownerReferences[0].kind}}'
            operator: Equals
            value: DaemonSet
          - key: '{{request.object.metadata.ownerReferences[0].name}}'
            operator: Equals
            value: datadog
          - key: "{{ to_string((request.object.status.containerStatuses[?name == 'agent'].ready)[0] || false ) }}"
            operator: Equals
            value: 'true'
          - key: '{{request.operation}}'
            operator: In
            value:
              - UPDATE
      # Mutates the Deployment resource to add fields.
      mutate:
        targets:
          - apiVersion: v1
            kind: Node
            name: '{{request.object.spec.nodeName}}'
        patchStrategicMerge:
          spec:
            taints: "{{ target.spec.taints[?key != 'node.datadog.eu/agent-not-ready'] }}"

These are the values I deployed the helm chart with

    config:
      resourceFiltersIncludeNamespaces:
        - flux-system
        - karpenter
      resourceFiltersExclude:
        - '[Node,*,*]'
        - '[Node/*,*,*]'
    features:
      logging:
        format: json
    admissionController:
      rbac:
        clusterRole:
          extraResources:
            - apiGroups:
                - ''
              resources:
                - nodes
              verbs:
                - update
                - list
                - get
      replicas: 3
      priorityClassName: system-cluster-critical
      tolerations:
        - key: node.datadog.eu/agent-not-ready
          operator: Exists
          effect: NoSchedule
      nodeAffinity:
        preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 1
            preference:
              matchExpressions:
                - key: eks.amazonaws.com/nodegroup
                  operator: Exists
      topologySpreadConstraints:
        - maxSkew: 1
          topologyKey: topology.kubernetes.io/zone
          whenUnsatisfiable: DoNotSchedule
      podDisruptionBudget:
        enabled: true
      initContainer:
        resources:
          requests: {cpu: 10m, memory: 64Mi}
          limits: {cpu: 100m, memory: 256Mi}
      container:
        resources:
          requests: {cpu: 100m, memory: 128Mi}
          limits: {memory: 384Mi}
    backgroundController:
      rbac:
        clusterRole:
          extraResources:
            - apiGroups:
                - ''
              resources:
                - nodes
              verbs:
                - update
                - list
                - get
      replicas: 2
      priorityClassName: system-cluster-critical
      tolerations:
        - key: node.datadog.eu/agent-not-ready
          operator: Exists
          effect: NoSchedule
      nodeAffinity:
        preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 1
            preference:
              matchExpressions:
                - key: eks.amazonaws.com/nodegroup
                  operator: Exists
      topologySpreadConstraints:
        - maxSkew: 1
          topologyKey: topology.kubernetes.io/zone
          whenUnsatisfiable: DoNotSchedule
      podDisruptionBudget:
        enabled: true
      resources:
        requests: {cpu: 100m, memory: 64Mi}
        limits: {memory: 128Mi}
    cleanupController:
      rbac:
        clusterRole:
          extraResources: []
      replicas: 2
      priorityClassName: system-cluster-critical
      tolerations:
        - key: node.datadog.eu/agent-not-ready
          operator: Exists
          effect: NoSchedule
      nodeAffinity:
        preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 1
            preference:
              matchExpressions:
                - key: eks.amazonaws.com/nodegroup
                  operator: Exists
      topologySpreadConstraints:
        - maxSkew: 1
          topologyKey: topology.kubernetes.io/zone
          whenUnsatisfiable: DoNotSchedule
      podDisruptionBudget:
        enabled: true
      resources:
        requests: {cpu: 100m, memory: 64Mi}
        limits: {memory: 128Mi}
    reportsController:
      rbac:
        clusterRole:
          extraResources: []
      replicas: 2
      priorityClassName: system-cluster-critical
      tolerations:
        - key: node.datadog.eu/agent-not-ready
          operator: Exists
          effect: NoSchedule
      nodeAffinity:
        preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 1
            preference:
              matchExpressions:
                - key: eks.amazonaws.com/nodegroup
                  operator: Exists
      topologySpreadConstraints:
        - maxSkew: 1
          topologyKey: topology.kubernetes.io/zone
          whenUnsatisfiable: DoNotSchedule
      podDisruptionBudget:
        enabled: true
      resources:
        requests: {cpu: 100m, memory: 64Mi}
        limits: {memory: 128Mi}

The generated webbhook configuration looks fine to me

apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingWebhookConfiguration
metadata:
  annotations:
    admissions.enforcer/disabled: "true"
  creationTimestamp: "2024-05-06T11:08:32Z"
  generation: 11
  labels:
    webhook.kyverno.io/managed-by: kyverno
  name: kyverno-resource-validating-webhook-cfg
  resourceVersion: "1422173941"
  uid: b7d4d77e-0caf-441a-ba22-8ba4bcb4fa95
webhooks:
- admissionReviewVersions:
  - v1
  clientConfig:
    caBundle: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUM3VENDQWRXZ0F3SUJBZ0lCQURBTkJna3Foa2lHOXcwQkFRc0ZBREFZTVJZd0ZBWURWUVFERE>    service:
      name: kyverno-svc
      namespace: kyverno
      path: /validate/fail
      port: 443
  failurePolicy: Fail
  matchPolicy: Equivalent
  name: validate.kyverno.svc-fail
  namespaceSelector:
    matchExpressions:
    - key: kubernetes.io/metadata.name
      operator: NotIn
      values:
      - kube-system
    - key: kubernetes.io/metadata.name
      operator: NotIn
      values:
      - kyverno
  objectSelector: {}
  rules:
  - apiGroups:
    - ""
    apiVersions:
    - v1
    operations:
    - CREATE
    - UPDATE
    - DELETE
    - CONNECT
    resources:
    - configmaps
    - pods
    - pods/ephemeralcontainers
    - pods/status
    scope: Namespaced
  - apiGroups:
    - apps
    apiVersions:
    - v1
    operations:
    - CREATE
    - UPDATE
    - DELETE
    - CONNECT
    resources:
    - daemonsets/status
    - deployments/status
    scope: Namespaced
  sideEffects: NoneOnDryRun
  timeoutSeconds: 10

The "pod.status" subresource seams not to work in general anymore. I also created an example with generate rules which creates ConfigMaps for UPDATES on "Deployment.status" and "DaemonSet.status" which is working as expected, but nothing happends on "Pod.status" updates.

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: check
spec:
  generateExisting: true
  rules:
  - name: generate-config-map
    match:
      any:
      - resources:
          kinds:
          - Pod.status
          operations:
          - UPDATE
    generate:
      kind: ConfigMap
      synchronize: true
      apiVersion: v1
      name: test-pod-abc
      namespace: kyverno
      data:
        data:
          test: |
            Hello World
  - name: generate-daemonset-config-map
    match:
      any:
      - resources:
          kinds:
          - DaemonSet.status
          operations:
          - UPDATE
    generate:
      kind: ConfigMap
      synchronize: true
      apiVersion: v1
      name: test-daemonset-abc
      namespace: kyverno
      data:
        data:
          test: |
            Hello World
  - name: generate-deployment-config-map
    match:
      any:
      - resources:
          kinds:
          - Deployment.status
          operations:
          - UPDATE
    generate:
      kind: ConfigMap
      synchronize: true
      apiVersion: v1
      name: test-deployment-abc
      namespace: kyverno
      data:
        data:
          test: |
            Hello World

Does anybody has an advise for me if there is a misconfiguration?

Best regards
Philipp

Steps to reproduce

  1. Deploy the last ClusterPolicy "check"
  2. Manually delete the pod of any DaemonSet (not in kyverno or kube-system namespaces)
  3. Manually scale any Deployment (not in kyverno or kube-system namespaces)
  4. Wait for the pods to become ready

Result: The 2 configMaps "test-daemonset-abc" and "test-deployment-abc" but not the configMap "test-pod-abc".

Expected behavior

I expect the ClusterPolicy to react on pod.status Update events.

Screenshots

No response

Kyverno logs

No error logs and also no logs which gives any information about the ClusterPolicy is executed.

Slack discussion

No response

Troubleshooting

  • I have read and followed the documentation AND the troubleshooting guide.
  • I have searched other issues in this repository and mine is not recorded.
@PhilippMT PhilippMT added bug Something isn't working triage Default label assigned to all new issues indicating label curation is needed to fully organize. labels May 6, 2024
Copy link

welcome bot commented May 6, 2024

Thanks for opening your first issue here! Be sure to follow the issue template!

@chipzoller
Copy link
Member

  - key: '{{request.operation}}'
    operator: In
    value:
      - UPDATE

The In operator no longer functions as stated in the release notes for 1.12.0 here. Either move this to operations[] in the match block or specify it like:

  - key: '{{request.operation}}'
    operator: Equals
    value: UPDATE

@PhilippMT
Copy link
Author

PhilippMT commented May 9, 2024

The In operator no longer functions as stated in the release notes for 1.12.0 here. Either move this to operations[] in the match block or specify it like:

Hi,

sorry that was my fault. This is the ClusterPolicy I currently use:

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: datadog-agent-startup-taint
spec:
  failurePolicy: Fail
  background: false
  webhookTimeoutSeconds: 30
  mutateExistingOnPolicyUpdate: false
  rules:
    - name: remove-startup-taint
      match:
        any:
          - resources:
              kinds:
                - v1/Pod.status
              namespaces:
                - datadog
              names:
                - datadog-*
      preconditions:
        all:
          - key: '{{request.object.metadata.ownerReferences[0].kind}}'
            operator: Equals
            value: DaemonSet
          - key: '{{request.object.metadata.ownerReferences[0].name}}'
            operator: Equals
            value: datadog
          - key: "{{ to_string((request.object.status.containerStatuses[?name == 'agent'].ready)[0] || false ) }}"
            operator: Equals
            value: 'true'
      # Mutates the Deployment resource to add fields.
      mutate:
        targets:
          - apiVersion: v1
            kind: Node
            name: '{{request.object.spec.nodeName}}'
        patchStrategicMerge:
          spec:
            taints: "{{ target.spec.taints[?key != 'node.datadog.eu/agent-not-ready'] }}"

I also tried to add the update operations

- resources:
      kinds:
        - v1/Pod.status
      namespaces:
        - datadog
      names:
        - datadog-*
      operations:
        - UPDATE

It is not working, even if I completely remove the whole preconditions sequence. It looks like kyverno does not handle pod.status updates at all. I also tried all kinds v1/Pod.status, v1/Pod/status, Pod.status and Pod/status.

I also tried to remove all other selectors except for kinds and operations:

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: datadog-agent-startup-taint
spec:
  admission: true
  background: false
  failurePolicy: Fail
  mutateExistingOnPolicyUpdate: false
  rules:
  - match:
      any:
      - resources:
          kinds:
          - v1/Pod.status
          operations:
          - UPDATE
    mutate:
      patchStrategicMerge:
        spec:
          taints: '{{ target.spec.taints[?key != ''node.datadog.eu/agent-not-ready'']
            }}'
      targets:
      - apiVersion: v1
        kind: Node
        name: '{{request.object.spec.nodeName}}'
    name: remove-startup-taint
    skipBackgroundRequests: true
  validationFailureAction: Audit
  webhookTimeoutSeconds: 30

I do get some error logs from the backgroud-controller now, even if background: false is set.

{
	"content": {
		"service": "background-controller",
		"message": "",
		"attributes": {
			"caller": "mutate/mutate.go:165",
			"level": "error",
			"resource": "v1/Pod/[namespace removed]/[podname removed]",
			"logger": {
				"name": "background"
			},
			"name": "ur-kdxxd",
			"error": "failed to mutate existing resource, rule remove-startup-taint, response error: : failed to substitute variables in target[0].Name {{request.object.spec.nodeName}}, value: <nil>, err: failed to resolve request.object.spec.nodeName at path : JMESPath query failed: Unknown key \"nodeName\" in path",
			"ts": "2024-05-09T18:22:53Z",
			"policy": "datadog-agent-startup-taint"
		}
	}
}

and

cannot generate events for empty target resource

I do get these errors as long as the pod is pending, and the status is updated, e.g. by the cluster-autoscaler. But as soon as the pod is scheduled and is started, kyverno does not handle any further pod.status update events e.g. when a container states changes.

@chipzoller
Copy link
Member

@PhilippMT, remove the excludeGroups: system:nodes line in the Kyverno ConfigMap or change the value to something other than system:nodes. This will probably fix the issue. Kyverno is filtering out requests by groups from system:nodes which is the group in which kubelet can be found and is responsible for updating Pod status.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage Default label assigned to all new issues indicating label curation is needed to fully organize.
Projects
None yet
Development

No branches or pull requests

4 participants