Skip to content

rh-ecosystem-edge/accelerator_templates

Repository files navigation

OpenShift Accelerator Templates

Adding hardware support to OpenShift clusters is more complicated than on non-containerised Linux. While Red Hat CoreOS (RHCOS), the operating system OpenShift sits upon, is essentially Red Hat Enterprise Linux (RHEL) its appliance nature gives it two important differences.

Firstly RHCOS is designed to be immutable. This means rebooting the node deletes any changes made to the operating system. Secondly access to the operating system is restricted. While it is possible to connect to an OpenShift node via ssh and change to the root user this behaviour is discouraged, and installing software on the underlying RHCOS operating system will lead to invalidating support contracts.

Together these make the traditional approach of installing packages of drivers and agent software that system administrators can install problematic. Instead a new approach of containerised drivers managed by OpenShift operators is required.

To support this Red Hat has developed a number of technologies:

Together these provide a rich set of tools for third party developers to build on to support their own drivers with custom operators and other tooling.

Creating a custom Operator

Every operator is different and will need different components so the steps required to build the solution will be different, but the following checklist should provide a good starting point for most projects.

  • Work through the Operator Checklist to assess what work has already been done and what is required to be done before shipping.

  • Create the Device Driver and any user land tools. This is the same as for any other version of Linux including RHEL. (Example Source Code and Discussion)

  • Package the device driver into a Driver Container using the Driver Tool Kit (DTK) a toolkit to help create OCI images for kernel modules and drivers. (Example Source Code)

  • Package any user land component required into container images via a Dockerfile and podman build

  • A config for the Node Feature Discovery (NFD) operator, labelling nodes based on its hardware and operating system features. (Example Source)

  • Create a Device Plugin to allow the user land components to request the hardware and make sure they are not being oversubscribed. (Example Source)

  • Create a DaemonSet configuration (as a yaml file), to deploy the user land components. The yaml created here is useful for testing, but will also translate directly into the golang structures that the operator will use to instantiate Kubernetes objects.

  • Create a configuration for the Kernel Module Management operator to load driver containers on nodes that meet set criteria (normally nodes with the given hardware). Again building this as a manually applied yaml file is both a good sanity check that all the parts work together manually before they get automated with an Operator, and translate directly into the golang structures the operator needs.

  • Automate the deployment of the components by creating a custom Operator that deploys the Driver Container and the user land components it needs. (Discussion of building Operators, discussion of integrating with KMM and Example source)

  • Add metrics the operator to report the state of the hardware to the cluster manager.

  • Create any PrometheusRule configuration yaml that might be needed.

  • Add Grafana dashboards and any other supporting components needed to make the cluster operator's life easier.

  • Certify your driver. As only a specific version of the images are certified this needs to be the final step before release.

 

Direct Links

  1. Kernel Modules

  2. Driver Containers

  3. The Kernel Module Management (kmm) operator

  4. The Node Feature Discovery (nfd) operator

  5. Device Plugin

  6. Operator

  7. Integrating with KMM

  8. Observability and Metrics

  9. Certification For Containers and Operators

  10. Support

Appendices

  1. Checklist

  2. Glossary Of Terms

Links and Related Operators

Red Hat OpenShift Container Platform Life Cycle Policy

Intel Technology Enabling for OpenShift with it's related device plugins

 

Corrections and Omissions

If there is something you think needs adding, expanding, or correcting, please file an Issue, or even better raise a PR