Skip to content

Latest commit

 

History

History
199 lines (130 loc) · 12.2 KB

operator.md

File metadata and controls

199 lines (130 loc) · 12.2 KB

Kubeflow Operator

Kubeflow Operator helps deploy, monitor and manage the lifecycle of Kubeflow. Built using the Operator Framework which offers an open source toolkit to build, test, package operators and manage the lifecycle of operators.

The Operator is currently in incubation phase and is based on this design doc. It is built on top of KfDef CR, and uses kfctl as the nucleus for Controller. Current roadmap for this Operator is listed here. The Operator is also published on OperatorHub.

Prerequisites

Deployment Instructions

  1. Clone this repository, build the manifests and install the operator
git clone https://github.com/kubeflow/kfctl.git && cd kfctl

export OPERATOR_NAMESPACE=operators
kubectl create ns ${OPERATOR_NAMESPACE}

cd deploy/
kustomize edit set namespace ${OPERATOR_NAMESPACE}
# kustomize edit add resource kustomize/include/quota # only deploy this if the k8s cluster is 1.15+ and has resource quota support, which will allow only one _kfdef_ instance or one deployment of Kubeflow on the cluster. This follows the singleton model, and is the current recommended and supported mode.

kustomize build | kubectl apply -f -
  1. Deploy KfDef

KfDef can point to a remote URL or to a local kfdef file. To use the set of default kfdefs from Kubeflow, follow the Deploy with default kfdefs section below.

KUBEFLOW_NAMESPACE=kubeflow
kubectl create ns ${KUBEFLOW_NAMESPACE}
kubectl create -f <kfdef> -n ${KUBEFLOW_NAMESPACE}

Deploy with default kfdefs

To use the set of default kfdefs from Kubeflow, you will have to insert the metadata.name field before you can apply it to Kubernetes. Below are the commands for applying the Kubeflow kfdef using Operator. For e.g. for IBM Cloud, commands will be

If you are pointing the kfdef file on the local machine, set the KFDEF to the kfdef file path and skip the curl command.

First point to your Cloud provider kfdef. For e.g. for OpenShift, point to the kfdef in OpenDataHub repo

export KFDEF_URL=https://raw.githubusercontent.com/opendatahub-io/manifests/v0.7-branch-openshift/kfdef/kfctl_openshift.yaml

Similary for GCP, IBM Cloud etc. you can point to the respective kfdefs in Kubeflow repository, e.g.

export KFDEF_URL=https://raw.githubusercontent.com/kubeflow/manifests/master/kfdef/kfctl_ibm.yaml

Then specify the KUBEFLOW_DEPLOYMENT_NAME you want to give to your deployment. Please note that currently multi-user deployments have a hard dependency on using kubeflow as the deployment name.

export KUBEFLOW_DEPLOYMENT_NAME=kubeflow
export KFDEF=$(echo "${KFDEF_URL}" | rev | cut -d/ -f1 | rev)
curl -L ${KFDEF_URL} > ${KFDEF}

Next, we need to update the KFDEF file with the KUBEFLOW_DEPLOYMENT_NAME. We strongly recommend to install the yq tool and run the yq command. However, if you can't install yq, you can run the perl command to do the same thing assuming you are using one of the kfdefs under the manifests repository.

yq w ${KFDEF} 'metadata.name' ${KUBEFLOW_DEPLOYMENT_NAME} > ${KFDEF}.tmp && mv ${KFDEF}.tmp ${KFDEF}
# perl -pi -e $'s@metadata:@metadata:\\\n  name: '"${KUBEFLOW_DEPLOYMENT_NAME}"'@' ${KFDEF}

Lastly, deploy the kfdef resource to the cluster.

kubectl create -f ${KFDEF} -n ${KUBEFLOW_NAMESPACE}

Testing Watcher and Reconciler

One of the major benefits of using kfctl as an Operator is to leverage the functionalities around being able to watch and reconcile your Kubeflow deployments. The Operator is watching on any cluster events for the KfDef instance, as well as the Delete event for all the resources whose owner is the KfDef instance. Each of such events is queued as a request for the reconciler to apply changes to the KfDef instance. For example, if one of the Kubeflow resources is deleted, the reconciler will be triggered to re-apply the KfDef instance, and re-create the deleted resource on the cluster. Therefore, the Kubeflow deployment with this KfDef instance will recover automatically from the unexpected delete event.

Try following to see the operator watcher and reconciler in action:

  1. Check the tf-job-operator deployment is running
kubectl get deploy -n ${KUBEFLOW_NAMESPACE} tf-job-operator
# NAME                                          READY   UP-TO-DATE   AVAILABLE   AGE
# tf-job-operator                               1/1     1            1           7m15s
  1. Delete the tf-job-operator deployment
kubectl delete deploy -n ${KUBEFLOW_NAMESPACE} tf-job-operator
# deployment.extensions "tf-job-operator" deleted
  1. Wait for 10 to 15 seconds, then check the tf-job-operator deployment again

You will be able to see that the deployment is being recreated by the Operator's reconciliation logic.

kubectl get deploy -n ${KUBEFLOW_NAMESPACE} tf-job-operator
# NAME                                          READY   UP-TO-DATE   AVAILABLE   AGE
# tf-job-operator                               0/1     0            0           10s

The Kubeflow operator also support multiple KfDef instances deployment. It watches over all the KfDef instances and handles reconcile requests to all the KfDef instances. To understand more on the operator controller behavior, refer to this controller-runtime link.

The operator responds to following events:

  • When a KfDef instance is created or updated, the operator's reconciler will be notified of the event and invoke the Apply function provided by the kfctl package to deploy Kubeflow. The Kubeflow resources specified with the manifests will be added with the following annotation to indicate that they are owned by this KfDef instance.

    annotations:
      kfctl.kubeflow.io/kfdef-instance: <kfdef-name>.<kfdef-namespace>
    
  • When a KfDef instance is deleted, the operator's reconciler will be notified of the event and invoke the finalizer to run the Delete function provided by the kfctl package and go through all applications and components owned by the KfDef instance.

  • When any resource deployed as part of a KfDef instance is deleted, the operator's reconciler will be notified of the event and invoke the Apply function provided by the kfctl package to re-deploy Kubeflow. The deleted resource will be recreated with the same manifest which was specified when the KfDef instance was created.

Delete Kubeflow

  • Delete Kubeflow deployment, the KfDef instance
kubectl delete kfdef -n ${KUBEFLOW_NAMESPACE} --all

Note that the users profile namespaces created by profile-controller will not be deleted. The ${KUBEFLOW_NAMESPACE} created outside of the operator will not be deleted either. The delete process usually takes up to 15 minutes because the Operator needs to delete each component sequentially to avoid race conditions such as the namespace finalizer issue.

  • Delete Kubeflow Operator
kubectl delete -f deploy/operator.yaml -n ${OPERATOR_NAMESPACE}
kubectl delete clusterrolebinding kubeflow-operator
kubectl delete -f deploy/service_account.yaml -n ${OPERATOR_NAMESPACE}
kubectl delete -f deploy/crds/kfdef.apps.kubeflow.org_kfdefs_crd.yaml
kubectl delete ns ${OPERATOR_NAMESPACE}

Optional: Registering the Operator to OLM Catalog

Please follow the instructions here to register your Operator to OLM if you are using that to install and manage the Operator. If you want to leverage the OperatorHub, please use the default Kubeflow Operator registered there

Trouble Shooting

  • When deleting a Kubeflow deployment, some mutatingwebhookconfigurations may not be removed as they are cluster-wide resources and dynamically created by the individual controller. It's a known issue for some of the Kubeflow components. To remove them, run the following:
kubectl delete mutatingwebhookconfigurations katib-mutating-webhook-config
kubectl delete mutatingwebhookconfigurations cache-webhook-kubeflow

Development Instructions

Prerequisites

  1. Install operator-sdk

  2. Install golang

  3. Install kustomize

Build Instructions

These steps are based on the operator-sdk with modifications that are specific for this Kubeflow operator.

  1. Clone this repository under your $GOPATH. (e.g. ~/go/src/github.com/kubeflow/)
git clone https://github.com/kubeflow/kfctl
cd kfctl
  1. Build and push the operator
export OPERATOR_IMG=<docker_repo>
make build-operator
make push-operator

Note: replace <docker_repo> with the image repo name and tag, for example, docker.io/example/kubeflow-operator:latest.

  1. Follow Deployment Instructions section to test the operator with the newly built image

Current Tested Operators and Pre-built Images

Kubeflow Operator controller logic is based on the kfctl package, so for each major release of kfctl, an operator image is built and tested with that version of manifests to deploy a KfDef instance. Following table shows what releases have been tested.

branch tag operator image manifests version kfdef example note
v1.0 aipipeline/kubeflow-operator:v1.0.0 1.0.0 kfctl_k8s_istio.v1.0.0.yaml
v1.0.1 aipipeline/kubeflow-operator:v1.0.1 1.0.1 kfctl_k8s_istio.v1.0.1.yaml
v1.0.2 aipipeline/kubeflow-operator:v1.0.2 1.0.2 kfctl_k8s_istio.v1.0.2.yaml
master aipipeline/kubeflow-operator:master master kfctl_k8s_istio.yaml as of 05/15/2020

Note: if building a customized operator for a specific version of Kubeflow is desired, you can run git checkout to that specific branch tag. Keep in mind to use the matching version of manifests.