Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] fleet-agent fails to deploy on an RKE2 Windows Custom cluster #39372

Closed
jameson-mcghee opened this issue Oct 20, 2022 · 2 comments
Closed
Assignees
Labels
area/fleet area/windows kind/bug Issues that are defects reported by users or that we know have reached a real release release-note Note this issue in the milestone's release notes team/fleet

Comments

@jameson-mcghee
Copy link
Contributor

Rancher Server Setup

  • Rancher version: v2.6.8 --> v2.7.0-rc4
  • Installation option (Docker install/Helm Chart): Helm
    • If Helm Chart, Kubernetes Cluster and version (RKE1, RKE2, k3s, EKS, etc): RKE1
  • Proxy/Cert Details: N/A

Information about the Cluster

  • Kubernetes version: 1.23.10+rke2r1
  • Cluster Type (Local/Downstream): Downstream
    • If downstream, what type of cluster? (Custom/Imported or specify provider for Hosted/Infrastructure Provider): Custom Windows

User Information

  • What is the role of the user logged in? (Admin/Cluster Owner/Cluster Member/Project Owner/Project Member/Custom) Standard User with Cluster Owner Role
    • If custom, define the set of permissions: N/A

Describe the bug
When performing an HA Upgrade on an RKE2 Custom Windows Cluster, the new pod created by the fleet-agent deployment fails to come up, causing the fleet-agent deployment itself to eventually timeout and fail.

To Reproduce

  1. On an HA Rancher server v2.6.8, create an RKE2 Custom cluster with:
    • 3 Linux ETCD nodes
    • 2 Linux Control Plane nodes
    • 3 Linux Worker Nodes
    • 3 Windows Worker Nodes
  2. Perform an HA Upgrade from v2.6.8 --> v2.7.0-rc4
  3. Once the HA Upgrade is complete, and the cluster returns to Active state, navigate to the Cluster Explorer --> Deployments page

Result
Note that the fleet-agent deployment remains in an Updating state for a time, until it eventually goes into a Failed State. A new fleet-agent pod is generated with the image rancher/fleet-agent:v0.5.0-rc2, and the pod goes into a Containercreating/Waiting state and never completes. The Events logs shows several warnings from the kubelet of the pod (see screenshots).

Expected Result
The new pod that is created and the fleet-agent deployment both become Active without error.

Screenshots
image
image
image
image

@jameson-mcghee jameson-mcghee added kind/bug Issues that are defects reported by users or that we know have reached a real release area/fleet team/hostbusters The team that is responsible for provisioning/managing downstream clusters + K8s version support labels Oct 20, 2022
@sowmyav27 sowmyav27 added this to the v2.7.1 milestone Oct 20, 2022
@sowmyav27 sowmyav27 added the release-note Note this issue in the milestone's release notes label Oct 20, 2022
@Sahota1225 Sahota1225 modified the milestones: 2023-Q1-v2.7x, v2.7.2 Nov 9, 2022
@Sahota1225 Sahota1225 modified the milestones: v2.7.2, 2023-Q2-v2.7x Dec 21, 2022
@sowmyav27 sowmyav27 changed the title [BUG] fleet-agent deployment fails when performing HA upgrade on an RKE2 Windows Custom cluster [BUG] fleet-agent fails to deploy on an RKE2 Windows Custom cluster Jan 7, 2023
@HarrisonWAffel HarrisonWAffel self-assigned this Mar 9, 2023
@aiyengar2
Copy link
Contributor

This is most directly related to kubernetes/kubernetes#102849 and seems like an issue with the manifest used to deploy fleet-agent onto Windows nodes.

Given the fact that this has been observed between Fleet 0.3.11 and 0.5.0-rc2 and given the upstream issue, I would assume the Windows regression here happened due to rancher/fleet@5c163dd, since securityContext.runAsUser was added to the fleet-agent manifest and Helm chart, which is precisely why the chown operation is taking place (which doesn't apply for Windows).

However, based on Fleet's own codebase, which adds nodeSelectors and tolerations specifically only deploying it on Linux nodes, fleet-agent should not be installed onto Windows nodes in the first place.

TLDR: the issue here is with a discrepancy between the fleet-agent Helm chart and what gets deployed in Manager-Initiated Registration by creating the objects here, where Fleet only adds a toleration to accept being deployed onto Linux nodes but does not add an affinity or nodeSelector forcing it to be deployed only on Linux.

I can confirm this is the case by simply looking at the fleet-agent Deployment on any provisioned cluster (here seen from Rancher 2.7.0), which shows that the securityContext.runAsUser is provided while nodeSelector is not set and affinity does not have anything Windows-related:

apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "2"
    meta.helm.sh/release-name: <REDACTED>
    meta.helm.sh/release-namespace: cattle-fleet-system
    objectset.rio.cattle.io/applied: <REDACTED>
    objectset.rio.cattle.io/id: fleet-agent-bootstrap
  creationTimestamp: <REDACTED>
  generation: 2
  labels:
    app.kubernetes.io/managed-by: Helm
    objectset.rio.cattle.io/hash: <REDACTED>
  managedFields: <REDACTED>
  name: fleet-agent
  namespace: cattle-fleet-system
  resourceVersion: <REDACTED>
  uid: <REDACTED>
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: fleet-agent
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: fleet-agent
    spec:
      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - preference:
              matchExpressions:
              - key: fleet.cattle.io/agent
                operator: In
                values:
                - "true"
            weight: 1
      containers:
      - env:
        - name: NAMESPACE
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
        - name: AGENT_SCOPE
        - name: CHECKIN_INTERVAL
          value: 0s
        - name: GENERATION
          value: bundle
        image: rancher/fleet-agent:v0.5.0
        imagePullPolicy: IfNotPresent
        name: fleet-agent
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext:
        runAsGroup: 1000
        runAsNonRoot: true
        runAsUser: 1000
      serviceAccount: fleet-agent
      serviceAccountName: fleet-agent
      terminationGracePeriodSeconds: 30
      tolerations:
      - effect: NoSchedule
        key: node.cloudprovider.kubernetes.io/uninitialized
        operator: Equal
        value: "true"
      - effect: NoSchedule
        key: cattle.io/os
        operator: Equal
        value: linux
status:
  ...

Since this is a pure Fleet issue and the fix has been diagnosed, I'm moving this issue back over to the Fleet team to address.

@aiyengar2 aiyengar2 added team/fleet [zube]: To Triage and removed team/hostbusters The team that is responsible for provisioning/managing downstream clusters + K8s version support [zube]: Next Up labels Mar 29, 2023
@slickwarren
Copy link
Contributor

closing in favor of the above linked ticket

@Sahota1225 Sahota1225 removed this from the 2023-Q2-v2.7x milestone Apr 4, 2023
@zube zube bot removed the [zube]: Done label Jun 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/fleet area/windows kind/bug Issues that are defects reported by users or that we know have reached a real release release-note Note this issue in the milestone's release notes team/fleet
Projects
None yet
Development

No branches or pull requests

7 participants