Skip to content

Xilinx-CNS/kubernetes-onload

Repository files navigation

Onload® Operator for Kubernetes

Use OpenOnload® or EnterpriseOnload® stacks to accelerate your workloads in Kubernetes® and OpenShift® clusters.

Installation requirements

Supported environment

Deployment can also be performed on Kubernetes 1.23+ but full implementation details are not currently provided. The Onload Device Plugin is not currently designed for standalone deployment.

Please see Release Notes for further detail on version compatibility and feature availability.

Access to container images & configuration files

Terminal

Your terminal requires access to:

This documentation standardises on kubectl but both are compatible: alias kubectl=oc.

Most users can benefit from the provided container images along with KMM's in-cluster onload-module builds. A more comprehensive development environment is required for special use cases, namely:

Cluster

Your cluster requires access to the following provided container images:

  • onload-operator
  • onload-device-plugin
  • onload-user
  • onload-source (if in-cluster builds)
  • sfptpd (optional)
  • sfnettest (optional)
  • KMM Operator & dependents
  • DTK (if in-cluster builds on OpenShift)
    • OpenShift includes a driver-toolkit (DTK) image in each release. No action should be required.

The cluster also requires access to the following node-specific kernel module container image(s) which may be provided externally or internally. If using in-cluster builds, push access to an internal registry will be required. Otherwise, only pull access is required if these images are pre-built. Please see Release Notes for further detail on feature availability.

  • onload-module

When using in-cluster builds, other dependencies may be required depending on the method selected. These may include ubi-minimal container image and UBI RPM repositories.

Nodes require 60MB of root-writable local storage, by default in /opt.

Provided Images

This repository's YAML configuration uses the following images by default:

For restricted networks these container images can be mirrored.

Deployment

To accelerate a pod:

Kubernetes objects deployed (simplified):

Diagram of Kubernetes objects

Pods & devices on Nodes:

Diagram of Pods & devices on Nodes

Onload Operator

The Onload Operator follows the Kubernetes Operator pattern which links a Kubernetes Controller, implemented here in the onload-operator container image, to one or more Custom Resource Definitions (CRD), implemented here in the Onload kind of CRD.

To deploy the Onload Operator, its controller container and CRD, run:

kubectl apply -k https://github.com/Xilinx-CNS/kubernetes-onload/config/default?ref=v3.0

This deploys the following by default:

The Onload Operator will not deploy the components necessary for accelerating workload pods without an Onload kind of Custom Resource (CR).

Local Onload Operator images in restricted networks

For restricted networks, the onload-operator and onload-device-plugin image locations will require changing from their DockerHub defaults. To run the above command using locally hosted container images, open this repository locally and use the following overlay:

git clone -b v3.0 https://github.com/Xilinx-CNS/kubernetes-onload && cd kubernetes-onload

cp -r config/samples/default-clusterlocal config/samples/my-operator
$EDITOR config/samples/my-operator/kustomization.yaml
kubectl apply --validate=true -k config/samples/my-operator

Tip

Replacing kubectl apply with kubectl kustomize will output a complete YAML manifest file which can be copied to a network that does not have access to this repository.

Onload Device Plugin

The Onload Device Plugin implements the Kubernetes Device Plugin API to expose a Kubernetes Resource named amd.com/onload.

It is distributed as the container image onload-device-plugin. The image location is configured as an environment variable within the Onload Operator deployment (see above) and its ImagePullPolicy as part of Onload Custom Resource (CR), along with its other customisation properties.

The Onload Operator manages an Onload Device Plugin DaemonSet which deploys, to each node selected for acceleration, a pod consisting of 3 containers:

  • Init (init container, onload-user image) -- for copying Onload files to host filesystem and Onload Worker volume.
  • Onload Worker (onload-worker container, onload-device-plugin image) -- provides Onload Control Plane environment; privileged access to network namespaces.
  • Onload Device Plugin (device-plugin container, onload-device-plugin image) -- for Kubernetes Device Plugin API; privileged access to Kubernetes API.

Onload Custom Resource (CR)

Instruct the Onload Operator to deploy the components necessary for accelerating workload pods by deploying an Onload kind of Custom Resource (CR).

If your cluster is internet-connected OpenShift and you want to use in-cluster builds with the current version of OpenOnload, run:

kubectl apply -k https://github.com/Xilinx-CNS/kubernetes-onload/config/samples/onload/overlays/in-cluster-build-ocp?ref=v3.0

This takes a base Onload CR template and adds the appropriate image versions and in-cluster build configuration. To customise this recommended overlay further, see comments in these files and the variant steps below.

The above overlay configures KMM to modprobe onload and modprobe sfc. Both are required, but the latter may occur outside the Onload Operator. Please see Out-of-tree sfc module for options.

For further explanation of Onload CR's available properties, refer to either inline comments in these templates or the built-in explain command, eg. kubectl explain onload.spec.

The schema for the above templates is defined by an Onload Custom Resource Definition (CRD) in onload_types.go which is distributed as part of Onload Operator's generated YAML bundle.

Important

Due to Kubernetes limitations on label lengths, the combined length of the Name and Namespace of the Onload CR must be less than 32 characters.

In-cluster builds in restricted networks

In restricted networks or on other versions of Kubernetes, change the container image locations and build method(s) to suit your environment. For example, to adapt the overlay in-cluster build on OpenShift in restricted network:

git clone -b v3.0 https://github.com/Xilinx-CNS/kubernetes-onload && cd kubernetes-onload

cd config/samples/onload
cp -r overlays/in-cluster-build-ocp-clusterlocal overlays/my-onload
$EDITOR overlays/my-onload/kustomization.yaml
$EDITOR overlays/my-onload/patch-onload.yaml
kubectl apply -k overlays/my-onload

Consider configuring:

  • Onload Operator & Onload Device Plugin container image tags (recommended to match)
    • In above kustomization.yaml
  • Onload Source & Onload User container image tags and Onload version (all must match)
    • In above kustomization.yaml & version attribute in patch-onload.yaml
  • Onload Module build method and tag (match tag to Onload version for clarity)
    • In above kustomization.yaml & build section in patch-onload.yaml

Onload Module in-cluster builds

The Onload Operator supports all of KMM's core methods for providing compiled kernel modules to the nodes.

Some working examples are provided for use with the Onload CR:

  • dtk-ubi -- currently recommended for OpenShift, depends on DTK & UBI
  • dtk-only -- for OpenShift in very restricted networks, depends only on official OpenShift DTK
  • mkdist-direct -- for consistency with non-containerised Onload deployments (not recommended)
  • ubuntu -- representative sample for non-OpenShift clusters

Please see Onload Module pre-built images for the alternative to building in-cluster.

Out-of-tree sfc kernel module

The out-of-tree sfc kernel module is currently required when using the provided onload kernel module with a Solarflare card.

The following methods may be used:

  • Configure the Onload Operator to deploy a KMM Module for sfc. Please see the example in in-cluster build configuration.

  • OpenShift MachineConfig for Day 0/1 sfc. This is for when newer driver features are required at boot time while using OpenShift, or when Solarflare NICs are used for OpenShift machine traffic, so as to avoid kernel module reloads disconnecting nodes.

  • A user-supported method beyond the scope of this document, such as a custom kernel build or in-house OS image.

Tip

Network interface names can be fixed with UDEV rules.

On a RHCOS node within OpenShift, the directory /etc/udev/rules.d/ can be written to with a MachineConfig CR.

sfptpd

The Solarflare Enhanced PTP Daemon (sfptpd) is not managed by Onload Operator but deployment instructions are included in this repository.

Please see config/samples/sfptpd/ for documentation and examples.

Operation

After you have completed the Deployment steps your cluster is configured with the capability to accelerate workloads using Onload.

An easy test to verify everything is correctly configured is the sfnettest example.

Run Onloaded applications

To accelerate your workload, configure a pod with a AMD Solarflare network interface and amd.com/onload resource:

kind: Pod
metadata:
  annotations:
    k8s.v1.cni.cncf.io/networks: ipvlan-bond0
spec:
  ...
  containers:
  - ...
    resources:
      limits:
        amd.com/onload: 1

All applications started within the pod environment will be accelerated due to the LD_PRELOAD environment variable unless setPreload: false is configured in Onload CR.

Resource amd.com/onload

This Kubernetes Resource automatically exposes the following to a requesting pod:

Device mounts:

  • /dev/onload
  • /dev/onload_epoll
  • /dev/sfc_char

Library mounts (by default in /opt/onload/usr/lib64/):

  • libonload.so
  • libonload_ext.so

Environment variables (if setPreload is true):

  • LD_PRELOAD=<library-mount>/libonload.so

Binary mounts (if mountOnload is true, by default in /opt/onload/usr/bin/)

  • onload

If you wish to customise where files are mounted in the container's filesystem this can be configured with the fields of spec.devicePlugin in an Onload CR.

Important

Kubernetes Device Plugin only affects initial pod scheduling

Kubernetes Device Plugin is designed to configure pods once only, at creation time. If the Onload CR is re-applied to the cluster with settings that would change pod environment -- for example, changing the value of setPreload -- then running pods must be recreated before using these changes.

Additionally, Kubernetes does not evict pods when node resources are removed; pods do not automatically have a formal dependency on Onload Device Plugin or Onload Module. This has the advantage that minor Onload Operator behaviour does not affect the workloads its components pre-configured.

Example client-server with sfnettest

Please see config/samples/sfnettest.

Using Onload profiles

If you want to run your onloaded application with a runtime profile we suggest using a ConfigMap to set the environment variables in the pod(s). We have included an example definition for the 'latency' profile in config/samples/profiles/ directory.

To deploy a ConfigMap named onload-latency-profile in the current namespace:

kubectl apply -k https://github.com/Xilinx-CNS/kubernetes-onload/config/samples/profiles?ref=v3.0

To use this in your pod, add the following to the container spec in your pod definition:

kind: Pod
...
spec:
  ...
  containers:
  - ...
    envFrom:
      - configMapRef:
          name: onload-latency-profile

Converting an existing profile

If you have an existing profile defined as a .opf file you can generate a new ConfigMap definition from this using the scripts/profiles/profile_to_configmap.sh script.

profile_to_configmap.sh takes in a comma separated list of profiles and will output the text definition of the ConfigMap which can be saved into a file, or sent straight to the cluster. To apply the generated ConfigMap straight away run:

./scripts/profiles/profile_to_configmap.sh -p /path/to/profile.opf | kubectl apply -f -

Currently the script produces ConfigMaps with a fixed naming structure, for example if you want to create a ConfigMap from a profile called name.opf the generated name will be onload-name-profile.

Troubleshooting

Please see dedicated troubleshooting guide.

Build

Onload Module pre-built images

Developing Onload Operator does not require building the onload-module image as they can be built in-cluster by KMM.

To build these images outside the cluster, please see ./build/onload-module/ for documentation and examples.

OpenShift MachineConfig for sfc

Please see scripts/machineconfig/ for documentation and examples to deploy an out-of-tree sfc module in Day 0/1 (on boot).

Onload Operator & Onload Device Plugin

Using Onload Operator does not require building these images as official images are available.

Please see DEVELOPING documentation.

Onload Source & Onload User

Developing Onload Operator does not require building these images as official images are available.

If you wish to build these images, please follow 'Distributing as container image' in Onload repository's DEVELOPING. This includes building debug versions. All Onload images in use must be consistent, in exact commit and build parameters. For example, a debug build of onload-user must be used with a debug build of onload-module. Build parameter specification is provided in the sample Onload CRs for the in-cluster build method.

Insecure registries

If your registry is not running with TLS configured, additional configuration may be necessary for accessing and pushing images. For example:

$ oc edit image.config cluster
...
spec:
  registrySources:
    insecureRegistries:
    - image-registry.openshift-image-registry.svc:5000

Ordered upgrades of Onload using Operator

The Onload Operator has the capability to upgrade the version of Onload used by a CR. This can be done by updating the definition of the Onload CR once it is in the cluster.

Important

To trigger the start of an upgrade edit the Onload CR and change the spec.onload.version field.

This can be done using kubectl edit, kubectl patch or re-applying the edited yaml file with kubectl apply.

The fields that the Operator will propagate during an upgrade are:

  • spec.onload.version
  • spec.onload.userImage
  • spec.kernelMappings

Changes to other fields are ignored by the Operator.

For example using kubectl patch (Please note that this is just an illustrative example and shouldn't be applied to a resource in your cluster):

kubectl patch onload onload-sample --type=merge --patch-file=/dev/stdin <<-EOF
{
  "spec": {
    "onload": {
      "kernelMappings": [
        {
          "kernelModuleImage": "docker.io/onload/onload-module:8.2.0",
          "regexp": "^.*\\.x86_64$"
        }
      ],
      "userImage": "docker.io/onload/onload-user:8.2.0",
      "version": "8.2.0"
    }
  }
}
EOF

Upgrade procedure

The upgrade procedure occurs node-by-node, the Operator will pick a node to upgrade (next alphabetically) and start the procedure for this node. Once the upgrade on this node has completed, it will move onto the next node.

Steps during an upgrade:

  1. Change to spec.onload.version.
  2. Operator picks next node to upgrade, or stops if all nodes are upgrade. For each node:
  3. Operator stops the Onload Device Plugin.
  4. Operator evicts pods using amd.com/onload resource.
  5. Operator removes the onload Module (and, if applicable, the sfc Module).
  6. Operator adds new Module(s).
  7. Operator re-starts the Onload Device Plugin.

Pods using Onload

During the upgrade procedure on a Node the Onload Operator will evict all pods that have requested an amd.com/onload resource on the current node. This is done so that these application pods don't encounter unexpected errors during runtime and so that the upgrade completes as expected. If your application's pods are created by a controller (for example a Deployment) then they will get re-created once the upgrade has completed and amd.com/onload resources are available again, if your Pod was created manually it will may have to be re-created manually.

The Operator assumes that all users of either the sfc or onload kernel modules are in pods that have an amd.com/onload resource, if their are pods that are using the sfc interface but do not have a resource registered through the device plugin please shut them down before starting the upgrade.

Limitations

MCO

The Onload Operator does not interact with the Machine Config Operator, this means that the updated sfc driver will have to upgraded separately from Onload. We suggest updating the sfc MachineConfig first, then when that has finished you should trigger the Onload upgrade. This will result in a period of time after the machine has rebooted with the new sfc driver version, but with an old version of Onload. Onloaded apps are not expected to work during this period, and you should wait until the Onload upgrade has finished before re-starting your workload.

Rollbacks

The Onload Operator does not keep a history of previous versions, so it is not possible to "rollback" an upgrade. If you wish to continue using an older version, you can simply re-follow the upgrade procedure but using the earlier version and images.

Verification

The Onload Operator does not perform an automatic validation of an upgrade. The status of the cluster should be checked after the upgrade has finished to ensure that things are as expected.

Freeze

Once an upgrade has started the Onload Operator will try to perform the on all nodes that match its selector. Therefore it is not currently possible to "freeze" a node in place while others are upgraded. If you want to have heterogeneous Onload versions in the same cluster then you should have multiple Onload CRs with non-overlapping node selectors, then each of these can be upgraded separately.

Unloading modules

Due to the Onload Operator's dependence on KMM v1 it is not possible to guarantee that a kernel module is actually unloaded when the Module CR is deleted. This is a known issue with KMM v1, but please try to ensure that there are no other users of the onload (or sfc if applicable) kernel modules when the upgrade starts.

Caveats

  • The Onload Operator manages KMM resources on behalf of the user but does not provide feature parity with KMM. Examples of features not included are: in-cluster container image build signing, node version freezing during ordered upgrade (Onload Operator manages these labels), miscellaneous DevicePlugin configuration, configuration of registry credentials (beyond existing cluster configuration), customisation of kernel module parameters and soft dependencies, and customisation of Namespace and Service Account for dependent resources (instead inherited from Onload CR). Configuring PreflightValidation can be performed independently while the Onload Operator is running.

  • Reloading of the kernel modules onload (and optionally sfc) will occur on first deployment and under certain reconfigurations. When using AMD Solarflare interfaces for Kubernetes control plane traffic, ensure node network interface configuration and workloads will regain correct configuration and cluster connectivity after reload.

  • Interface names may change when switching from an in-tree to out-of-tree sfc kernel module. This is due to changes in default interface names between versions 4 and 5. Ensure appropriate measures have been taken for any additional network configurations that depend on this information.

Footnotes

Trademarks are acknowledged as being the property of their respective owners. Kubernetes® is a trademark of The Linux Foundation. OpenShift® is a trademark of Red Hat, Inc..

Copyright (c) 2023-2024 Advanced Micro Devices, Inc.