Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Default to Azure Linux images #4832

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

mboersma
Copy link
Contributor

@mboersma mboersma commented May 10, 2024

What type of PR is this?

/kind feature

What this PR does / why we need it:

Changes the default node image selection logic to prefer Azure Linux (aka Mariner) images, falling back to Ubuntu if no AL image is found for the Kubernetes version required.

This should speed up provisioning a bit, as well as align CAPZ better with Azure service recommendations and security intitiatives.

Which issue(s) this PR fixes:

Fixes #4828
See also kubernetes-sigs/image-builder#1465

Special notes for your reviewer:

  • cherry-pick candidate

TODOs:

  • squashed commits
  • includes documentation
  • adds unit tests – this does need some for the new funcs!
  • update the "mariner-2" string after republishing recent images

Release note:

Default to Azure Linux images

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels May 10, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from mboersma. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label May 10, 2024
@mboersma
Copy link
Contributor Author

/retitle [WIP] Default to Azure Linux images

This needs new unit tests and has at least one //TODO: as noted in the description. Also needs to pass the -optional tests and get some miles on it.

@k8s-ci-robot k8s-ci-robot changed the title Default to Azure Linux images [WIP] Default to Azure Linux images May 10, 2024
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 10, 2024
Copy link

codecov bot commented May 10, 2024

Codecov Report

Attention: Patch coverage is 14.28571% with 30 lines in your changes missing coverage. Please review.

Project coverage is 62.08%. Comparing base (aa536cc) to head (1b5fbec).
Report is 4 commits behind head on main.

Files Patch % Lines
azure/services/virtualmachineimages/images.go 9.09% 30 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4832      +/-   ##
==========================================
- Coverage   62.19%   62.08%   -0.11%     
==========================================
  Files         201      201              
  Lines       16878    16910      +32     
==========================================
+ Hits        10497    10499       +2     
- Misses       5591     5621      +30     
  Partials      790      790              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

defer done()

// First try Azure Linux, then Ubuntu.
defaultImage, err := s.GetDefaultAzureLinuxImage(ctx, location, k8sVersion)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change can add a couple of API calls to this common code path if it has to fall back to Ubuntu. But they all ultimately call getSKUAndVersion which implements a cache, so in practice it shouldn't cause many new round-trips.

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels May 17, 2024
@mboersma
Copy link
Contributor Author

mboersma commented Jun 3, 2024

The current problem with this PR is that Calico never comes all the way up. The control-plane node and the first worker node have all their calico pods come up, but the calico-node pod on any subsequent worker nodes will be stuck. No logs are emitted, and kubectl pod describe looks like this:

% KUBECONFIG=k.conf kubectl logs -n calico-system calico-node-cwqkn
Defaulted container "calico-node" out of: calico-node, flexvol-driver (init), install-cni (init)
Error from server: Get "https://10.1.0.4:10250/containerLogs/calico-system/calico-node-cwqkn/calico-node": dial tcp 10.1.0.4:10250: i/o timeout
?1 cluster-api-provider-azure % KUBECONFIG=k.conf kubectl describe pod -n calico-system calico-node-cwqkn
Name:                 calico-node-cwqkn
Namespace:            calico-system
Priority:             2000001000
Priority Class Name:  system-node-critical
Service Account:      calico-node
Node:                 default-12994-md-0-6g6kz-sc4lz/10.1.0.4
Start Time:           Mon, 03 Jun 2024 11:06:25 -0600
Labels:               app.kubernetes.io/name=calico-node
                      controller-revision-hash=5c8fc7b67d
                      k8s-app=calico-node
                      pod-template-generation=1
Annotations:          hash.operator.tigera.io/cni-config: 1d49cc679bcf7605c0da8c68a653470b79889bb3
                      hash.operator.tigera.io/system: bb4746872201725da2dea19756c475aa67d9c1e9
                      hash.operator.tigera.io/tigera-ca-private: 0e93a8ddcb650aeeaa893b4ce2186dfcd00d2c82
Status:               Running
IP:                   10.1.0.4
IPs:
  IP:           10.1.0.4
Controlled By:  DaemonSet/calico-node
Init Containers:
  flexvol-driver:
    Container ID:    containerd://40d1e4414a38a2d1027024f52c1b652ff9444ac4ab6c67c9091708486f7106cc
    Image:           mcr.microsoft.com/oss/calico/pod2daemon-flexvol:v3.26.1
    Image ID:        mcr.microsoft.com/oss/calico/pod2daemon-flexvol@sha256:7e51c338e4201975ee34610c15ae5c303fbefe98b40528a2ff22758de376936d
    Port:            <none>
    Host Port:       <none>
    SeccompProfile:  RuntimeDefault
    State:           Terminated
      Reason:        Completed
      Exit Code:     0
      Started:       Mon, 03 Jun 2024 11:06:34 -0600
      Finished:      Mon, 03 Jun 2024 11:06:34 -0600
    Ready:           True
    Restart Count:   0
    Environment:     <none>
    Mounts:
      /host/driver from flexvol-driver-host (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-8fzhb (ro)
  install-cni:
    Container ID:    containerd://9c12174c1c8d4f76caabf03be6a8814f3f0d0f67fe65306183a990067bf9fcca
    Image:           mcr.microsoft.com/oss/calico/cni:v3.26.1
    Image ID:        mcr.microsoft.com/oss/calico/cni@sha256:7eb740f75b78c3614ab31cc8dd8a40e270acb23c9ac6a82faa7d8427fbd2a35e
    Port:            <none>
    Host Port:       <none>
    SeccompProfile:  RuntimeDefault
    Command:
      /opt/cni/bin/install
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Mon, 03 Jun 2024 11:07:34 -0600
      Finished:     Mon, 03 Jun 2024 11:07:38 -0600
    Ready:          True
    Restart Count:  1
    Environment:
      CNI_CONF_NAME:            10-calico.conflist
      SLEEP:                    false
      CNI_NET_DIR:              /etc/cni/net.d
      CNI_NETWORK_CONFIG:       <set to the key 'config' of config map 'cni-config'>  Optional: false
      KUBERNETES_SERVICE_HOST:  10.96.0.1
      KUBERNETES_SERVICE_PORT:  443
    Mounts:
      /host/etc/cni/net.d from cni-net-dir (rw)
      /host/opt/cni/bin from cni-bin-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-8fzhb (ro)
Containers:
  calico-node:
    Container ID:    containerd://1e30ff72f8aed8a3cd6b5d161b0e7ce1d8a0599257ac8418a2adca53fa004fa4
    Image:           mcr.microsoft.com/oss/calico/node:v3.26.1
    Image ID:        mcr.microsoft.com/oss/calico/node@sha256:e3cacb61880218016d18dda7c63801610face22fc0bd39bdedb9d975a7963b11
    Port:            <none>
    Host Port:       <none>
    SeccompProfile:  RuntimeDefault
    State:           Running
      Started:       Mon, 03 Jun 2024 11:07:46 -0600
    Ready:           False
    Restart Count:   0
    Liveness:        http-get http://localhost:9099/liveness delay=0s timeout=10s period=10s #success=1 #failure=3
    Readiness:       exec [/bin/calico-node -felix-ready] delay=0s timeout=5s period=10s #success=1 #failure=3
    Environment:
      DATASTORE_TYPE:                      kubernetes
      WAIT_FOR_DATASTORE:                  true
      CLUSTER_TYPE:                        k8s,operator
      CALICO_DISABLE_FILE_LOGGING:         false
      FELIX_DEFAULTENDPOINTTOHOSTACTION:   ACCEPT
      FELIX_HEALTHENABLED:                 true
      FELIX_HEALTHPORT:                    9099
      NODENAME:                             (v1:spec.nodeName)
      NAMESPACE:                           calico-system (v1:metadata.namespace)
      FELIX_TYPHAK8SNAMESPACE:             calico-system
      FELIX_TYPHAK8SSERVICENAME:           calico-typha
      FELIX_TYPHACAFILE:                   /etc/pki/tls/certs/tigera-ca-bundle.crt
      FELIX_TYPHACERTFILE:                 /node-certs/tls.crt
      FELIX_TYPHAKEYFILE:                  /node-certs/tls.key
      FIPS_MODE_ENABLED:                   false
      FELIX_TYPHACN:                       typha-server
      CALICO_MANAGE_CNI:                   true
      CALICO_IPV4POOL_CIDR:                192.168.0.0/16
      CALICO_IPV4POOL_VXLAN:               Always
      CALICO_IPV4POOL_BLOCK_SIZE:          26
      CALICO_IPV4POOL_NODE_SELECTOR:       all()
      CALICO_IPV4POOL_DISABLE_BGP_EXPORT:  false
      FELIX_VXLANMTU:                      1350
      FELIX_WIREGUARDMTU:                  1350
      CALICO_NETWORKING_BACKEND:           vxlan
      IP:                                  autodetect
      IP_AUTODETECTION_METHOD:             first-found
      IP6:                                 none
      FELIX_IPV6SUPPORT:                   false
      KUBERNETES_SERVICE_HOST:             10.96.0.1
      KUBERNETES_SERVICE_PORT:             443
    Mounts:
      /etc/pki/tls/cert.pem from tigera-ca-bundle (ro,path="ca-bundle.crt")
      /etc/pki/tls/certs from tigera-ca-bundle (ro)
      /host/etc/cni/net.d from cni-net-dir (rw)
      /lib/modules from lib-modules (ro)
      /node-certs from node-certs (ro)
      /run/xtables.lock from xtables-lock (rw)
      /var/lib/calico from var-lib-calico (rw)
      /var/log/calico/cni from cni-log-dir (rw)
      /var/run/calico from var-run-calico (rw)
      /var/run/nodeagent from policysync (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-8fzhb (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  lib-modules:
    Type:          HostPath (bare host directory volume)
    Path:          /lib/modules
    HostPathType:  
  xtables-lock:
    Type:          HostPath (bare host directory volume)
    Path:          /run/xtables.lock
    HostPathType:  FileOrCreate
  policysync:
    Type:          HostPath (bare host directory volume)
    Path:          /var/run/nodeagent
    HostPathType:  DirectoryOrCreate
  tigera-ca-bundle:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      tigera-ca-bundle
    Optional:  false
  node-certs:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  node-certs
    Optional:    false
  var-run-calico:
    Type:          HostPath (bare host directory volume)
    Path:          /var/run/calico
    HostPathType:  
  var-lib-calico:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/calico
    HostPathType:  
  cni-bin-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /opt/cni/bin
    HostPathType:  
  cni-net-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/cni/net.d
    HostPathType:  
  cni-log-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /var/log/calico/cni
    HostPathType:  
  flexvol-driver-host:
    Type:          HostPath (bare host directory volume)
    Path:          /usr/libexec/kubernetes/kubelet-plugins/volume/exec/nodeagent~uds
    HostPathType:  DirectoryOrCreate
  kube-api-access-8fzhb:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              kubernetes.io/os=linux
Tolerations:                 :NoSchedule op=Exists
                             :NoExecute op=Exists
                             CriticalAddonsOnly op=Exists
                             node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/network-unavailable:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
  Type     Reason     Age                   From               Message
  ----     ------     ----                  ----               -------
  Normal   Scheduled  3m1s                  default-scheduler  Successfully assigned calico-system/calico-node-cwqkn to default-12994-md-0-6g6kz-sc4lz
  Normal   Pulling    2m57s                 kubelet            Pulling image "mcr.microsoft.com/oss/calico/pod2daemon-flexvol:v3.26.1"
  Normal   Pulled     2m51s                 kubelet            Successfully pulled image "mcr.microsoft.com/oss/calico/pod2daemon-flexvol:v3.26.1" in 2.473s (5.613s including waiting)
  Normal   Created    2m51s                 kubelet            Created container flexvol-driver
  Normal   Started    2m51s                 kubelet            Started container flexvol-driver
  Normal   Pulling    2m47s                 kubelet            Pulling image "mcr.microsoft.com/oss/calico/cni:v3.26.1"
  Normal   Pulled     2m36s                 kubelet            Successfully pulled image "mcr.microsoft.com/oss/calico/cni:v3.26.1" in 10.698s (10.698s including waiting)
  Normal   Created    111s (x2 over 2m36s)  kubelet            Created container install-cni
  Normal   Started    111s (x2 over 2m36s)  kubelet            Started container install-cni
  Normal   Pulled     111s                  kubelet            Container image "mcr.microsoft.com/oss/calico/cni:v3.26.1" already present on machine
  Normal   Pulling    107s                  kubelet            Pulling image "mcr.microsoft.com/oss/calico/node:v3.26.1"
  Normal   Pulled     99s                   kubelet            Successfully pulled image "mcr.microsoft.com/oss/calico/node:v3.26.1" in 7.375s (7.375s including waiting)
  Normal   Created    99s                   kubelet            Created container calico-node
  Normal   Started    99s                   kubelet            Started container calico-node
  Warning  Unhealthy  99s                   kubelet            Readiness probe failed: calico/node is not ready: felix is not ready: Get "http://localhost:9099/readiness": dial tcp [::1]:9099: connect: connection refused
W0603 17:07:46.725894      24 feature_gate.go:241] Setting GA feature gate ServiceInternalTrafficPolicy=true. It will be removed in a future release.
  Warning  Unhealthy  98s  kubelet  Readiness probe failed: calico/node is not ready: felix is not ready: Get "http://localhost:9099/readiness": dial tcp [::1]:9099: connect: connection refused
W0603 17:07:47.572418      45 feature_gate.go:241] Setting GA feature gate ServiceInternalTrafficPolicy=true. It will be removed in a future release.
  Warning  Unhealthy  98s  kubelet  Readiness probe failed: calico/node is not ready: felix is not ready: Get "http://localhost:9099/readiness": dial tcp [::1]:9099: connect: connection refused
W0603 17:07:47.726779      56 feature_gate.go:241] Setting GA feature gate ServiceInternalTrafficPolicy=true. It will be removed in a future release.
  Warning  Unhealthy  88s  kubelet  Readiness probe failed: calico/node is not ready: felix is not ready: readiness probe reporting 503
W0603 17:07:57.579934     153 feature_gate.go:241] Setting GA feature gate ServiceInternalTrafficPolicy=true. It will be removed in a future release.
  Warning  Unhealthy  78s  kubelet  Readiness probe failed: calico/node is not ready: felix is not ready: readiness probe reporting 503
W0603 17:08:07.570650     165 feature_gate.go:241] Setting GA feature gate ServiceInternalTrafficPolicy=true. It will be removed in a future release.
  Warning  Unhealthy  68s  kubelet  Readiness probe failed: calico/node is not ready: felix is not ready: readiness probe reporting 503

@mboersma
Copy link
Contributor Author

mboersma commented Jun 3, 2024

All pods are healthy but the one, which never will be:

% KUBECONFIG=k.conf kubectl get pods -A                                    
NAMESPACE          NAME                                                        READY   STATUS    RESTARTS        AGE
calico-apiserver   calico-apiserver-58f97bc954-cdxrw                           1/1     Running   0               4m37s
calico-apiserver   calico-apiserver-58f97bc954-v4g7x                           1/1     Running   0               4m37s
calico-system      calico-kube-controllers-5696b6f5cd-8pnkv                    1/1     Running   0               5m32s
calico-system      calico-node-82ktb                                           1/1     Running   0               5m32s
calico-system      calico-node-cwqkn                                           0/1     Running   0               4m21s
calico-system      calico-node-t2bl7                                           1/1     Running   0               3m56s
calico-system      calico-typha-5768f775d4-797td                               1/1     Running   1 (3m11s ago)   3m53s
calico-system      calico-typha-5768f775d4-nhxpv                               1/1     Running   0               5m32s
calico-system      csi-node-driver-qn4b8                                       2/2     Running   0               3m56s
calico-system      csi-node-driver-tvjrl                                       2/2     Running   0               5m32s
calico-system      csi-node-driver-wx87n                                       2/2     Running   0               4m21s
kube-system        cloud-controller-manager-85f4c7cd6-5k5z7                    1/1     Running   0               6m18s
kube-system        cloud-node-manager-kwrb2                                    1/1     Running   0               3m56s
kube-system        cloud-node-manager-sf5rw                                    1/1     Running   0               6m18s
kube-system        cloud-node-manager-wvqc9                                    1/1     Running   0               4m21s
kube-system        coredns-5dd5756b68-4ckr2                                    1/1     Running   0               6m21s
kube-system        coredns-5dd5756b68-scg4j                                    1/1     Running   0               6m21s
kube-system        etcd-default-12994-control-plane-gwlg6                      1/1     Running   0               6m21s
kube-system        kube-apiserver-default-12994-control-plane-gwlg6            1/1     Running   0               6m21s
kube-system        kube-controller-manager-default-12994-control-plane-gwlg6   1/1     Running   0               6m21s
kube-system        kube-proxy-q4gq9                                            1/1     Running   0               3m56s
kube-system        kube-proxy-xbtpd                                            1/1     Running   0               6m21s
kube-system        kube-proxy-zrq9f                                            1/1     Running   0               4m21s
kube-system        kube-scheduler-default-12994-control-plane-gwlg6            1/1     Running   0               6m21s
tigera-operator    tigera-operator-776f4dcbf5-rcrbn                            1/1     Running   1 (5m36s ago)   6m10s
% KUBECONFIG=k.conf kubectl logs -n calico-system calico-node-cwqkn        
Defaulted container "calico-node" out of: calico-node, flexvol-driver (init), install-cni (init)
Error from server: Get "https://10.1.0.4:10250/containerLogs/calico-system/calico-node-cwqkn/calico-node": dial tcp 10.1.0.4:10250: i/o timeout

@k8s-ci-robot
Copy link
Contributor

@mboersma: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-cluster-api-provider-azure-e2e-aks 1b5fbec link true /test pull-cluster-api-provider-azure-e2e-aks
pull-cluster-api-provider-azure-apiversion-upgrade 1b5fbec link true /test pull-cluster-api-provider-azure-apiversion-upgrade
pull-cluster-api-provider-azure-conformance 1b5fbec link false /test pull-cluster-api-provider-azure-conformance
pull-cluster-api-provider-azure-windows-with-ci-artifacts 1b5fbec link false /test pull-cluster-api-provider-azure-windows-with-ci-artifacts
pull-cluster-api-provider-azure-e2e 1b5fbec link true /test pull-cluster-api-provider-azure-e2e
pull-cluster-api-provider-azure-windows-custom-builds 1b5fbec link false /test pull-cluster-api-provider-azure-windows-custom-builds

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/feature Categorizes issue or PR as related to a new feature. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

Provision with Azure Linux by default
2 participants