[BUG] fleet-agent fails to deploy on an RKE2 Windows Custom cluster #39372

jameson-mcghee · 2022-10-20T13:09:28Z

Rancher Server Setup

Rancher version: v2.6.8 --> v2.7.0-rc4
Installation option (Docker install/Helm Chart): Helm
- If Helm Chart, Kubernetes Cluster and version (RKE1, RKE2, k3s, EKS, etc): RKE1
Proxy/Cert Details: N/A

Information about the Cluster

Kubernetes version: 1.23.10+rke2r1
Cluster Type (Local/Downstream): Downstream
- If downstream, what type of cluster? (Custom/Imported or specify provider for Hosted/Infrastructure Provider): Custom Windows

User Information

What is the role of the user logged in? (Admin/Cluster Owner/Cluster Member/Project Owner/Project Member/Custom) Standard User with Cluster Owner Role
- If custom, define the set of permissions: N/A

Describe the bug
When performing an HA Upgrade on an RKE2 Custom Windows Cluster, the new pod created by the fleet-agent deployment fails to come up, causing the fleet-agent deployment itself to eventually timeout and fail.

To Reproduce

On an HA Rancher server v2.6.8, create an RKE2 Custom cluster with:
- 3 Linux ETCD nodes
- 2 Linux Control Plane nodes
- 3 Linux Worker Nodes
- 3 Windows Worker Nodes
Perform an HA Upgrade from v2.6.8 --> v2.7.0-rc4
Once the HA Upgrade is complete, and the cluster returns to Active state, navigate to the Cluster Explorer --> Deployments page

Result
Note that the fleet-agent deployment remains in an Updating state for a time, until it eventually goes into a Failed State. A new fleet-agent pod is generated with the image rancher/fleet-agent:v0.5.0-rc2, and the pod goes into a Containercreating/Waiting state and never completes. The Events logs shows several warnings from the kubelet of the pod (see screenshots).

Expected Result
The new pod that is created and the fleet-agent deployment both become Active without error.

Screenshots

aiyengar2 · 2023-03-29T18:56:35Z

This is most directly related to kubernetes/kubernetes#102849 and seems like an issue with the manifest used to deploy fleet-agent onto Windows nodes.

Given the fact that this has been observed between Fleet 0.3.11 and 0.5.0-rc2 and given the upstream issue, I would assume the Windows regression here happened due to rancher/fleet@5c163dd, since securityContext.runAsUser was added to the fleet-agent manifest and Helm chart, which is precisely why the chown operation is taking place (which doesn't apply for Windows).

However, based on Fleet's own codebase, which adds nodeSelectors and tolerations specifically only deploying it on Linux nodes, fleet-agent should not be installed onto Windows nodes in the first place.

TLDR: the issue here is with a discrepancy between the fleet-agent Helm chart and what gets deployed in Manager-Initiated Registration by creating the objects here, where Fleet only adds a toleration to accept being deployed onto Linux nodes but does not add an affinity or nodeSelector forcing it to be deployed only on Linux.

I can confirm this is the case by simply looking at the fleet-agent Deployment on any provisioned cluster (here seen from Rancher 2.7.0), which shows that the securityContext.runAsUser is provided while nodeSelector is not set and affinity does not have anything Windows-related:

apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "2"
    meta.helm.sh/release-name: <REDACTED>
    meta.helm.sh/release-namespace: cattle-fleet-system
    objectset.rio.cattle.io/applied: <REDACTED>
    objectset.rio.cattle.io/id: fleet-agent-bootstrap
  creationTimestamp: <REDACTED>
  generation: 2
  labels:
    app.kubernetes.io/managed-by: Helm
    objectset.rio.cattle.io/hash: <REDACTED>
  managedFields: <REDACTED>
  name: fleet-agent
  namespace: cattle-fleet-system
  resourceVersion: <REDACTED>
  uid: <REDACTED>
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: fleet-agent
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: fleet-agent
    spec:
      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - preference:
              matchExpressions:
              - key: fleet.cattle.io/agent
                operator: In
                values:
                - "true"
            weight: 1
      containers:
      - env:
        - name: NAMESPACE
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
        - name: AGENT_SCOPE
        - name: CHECKIN_INTERVAL
          value: 0s
        - name: GENERATION
          value: bundle
        image: rancher/fleet-agent:v0.5.0
        imagePullPolicy: IfNotPresent
        name: fleet-agent
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext:
        runAsGroup: 1000
        runAsNonRoot: true
        runAsUser: 1000
      serviceAccount: fleet-agent
      serviceAccountName: fleet-agent
      terminationGracePeriodSeconds: 30
      tolerations:
      - effect: NoSchedule
        key: node.cloudprovider.kubernetes.io/uninitialized
        operator: Equal
        value: "true"
      - effect: NoSchedule
        key: cattle.io/os
        operator: Equal
        value: linux
status:
  ...

Since this is a pure Fleet issue and the fix has been diagnosed, I'm moving this issue back over to the Fleet team to address.

slickwarren · 2023-03-30T21:22:13Z

closing in favor of the above linked ticket

jameson-mcghee added kind/bug Issues that are defects reported by users or that we know have reached a real release area/fleet team/hostbusters The team that is responsible for provisioning/managing downstream clusters + K8s version support labels Oct 20, 2022

sowmyav27 added the area/windows label Oct 20, 2022

sowmyav27 added this to the v2.7.1 milestone Oct 20, 2022

sowmyav27 added the release-note Note this issue in the milestone's release notes label Oct 20, 2022

Sahota1225 modified the milestones: 2023-Q1-v2.7x, v2.7.2 Nov 9, 2022

Sahota1225 modified the milestones: v2.7.2, 2023-Q2-v2.7x Dec 21, 2022

sowmyav27 changed the title ~~[BUG] fleet-agent deployment fails when performing HA upgrade on an RKE2 Windows Custom cluster~~ [BUG] fleet-agent fails to deploy on an RKE2 Windows Custom cluster Jan 7, 2023

snasovich added the [zube]: Next Up label Mar 9, 2023

snasovich assigned HarrisonWAffel Mar 9, 2023

zube bot unassigned HarrisonWAffel Mar 9, 2023

HarrisonWAffel self-assigned this Mar 9, 2023

aiyengar2 assigned aiyengar2 and unassigned HarrisonWAffel Mar 29, 2023

aiyengar2 added team/fleet [zube]: To Triage and removed team/hostbusters The team that is responsible for provisioning/managing downstream clusters + K8s version support [zube]: Next Up labels Mar 29, 2023

aiyengar2 mentioned this issue Mar 29, 2023

[SURE-5515] Fleet is not upgrading properly on windows nodes in a cluster with windows workers rancher/fleet#993

Closed

1 task

slickwarren closed this as completed Mar 30, 2023

zube bot added [zube]: Done and removed [zube]: To Triage labels Mar 30, 2023

Sahota1225 removed this from the 2023-Q2-v2.7x milestone Apr 4, 2023

zube bot removed the [zube]: Done label Jun 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] fleet-agent fails to deploy on an RKE2 Windows Custom cluster #39372

[BUG] fleet-agent fails to deploy on an RKE2 Windows Custom cluster #39372

jameson-mcghee commented Oct 20, 2022

aiyengar2 commented Mar 29, 2023

slickwarren commented Mar 30, 2023

[BUG] fleet-agent fails to deploy on an RKE2 Windows Custom cluster #39372

[BUG] fleet-agent fails to deploy on an RKE2 Windows Custom cluster #39372

Comments

jameson-mcghee commented Oct 20, 2022

aiyengar2 commented Mar 29, 2023

slickwarren commented Mar 30, 2023