Cleanup k8s DaskCluster resources by introducing a `ttlSecondsAfterFinished` field respected by the controller? #760

consideRatio · 2023-10-25T12:28:47Z

When a k8s DaskCluster resource enters a "Stopped" state, for example by being idle_culled by the k8s DaskGateway controller, the k8s DaskCluster resource is still retained.

apiVersion: gateway.dask.org/v1alpha1
kind: DaskCluster
# ...
status:
  completionTime: "2023-10-25T11:43:39Z"
  credentials: dask-credentials-b3a990d302d84720aae27404f6153ade
  ingressroute: dask-b3a990d302d84720aae27404f6153ade
  ingressroutetcp: dask-b3a990d302d84720aae27404f6153ade
  phase: Stopped
  schedulerPod: dask-scheduler-b3a990d302d84720aae27404f6153ade
  service: dask-b3a990d302d84720aae27404f6153ade

Should a stopped DaskCluster resources get cleaned up directly, or after some time?

This is similar to having k8s Job resource creating a Pod to do some work. Then the Pod and Job is left in a "Completed" state a while. There is a topic about that.

When a Job completes, no more Pods are created, but the Pods are usually not deleted either. Keeping them around allows you to still view the logs of completed pods to check for errors, warnings, or other diagnostic output. The job object also remains after it is completed so that you can view its status. It is up to the user to delete old jobs after noting their status.

CronJob, that is a k8s resource to create Job resources, can cleanup the Job resources and it creates.

Finished Jobs are usually no longer needed in the system. Keeping them around in the system will put pressure on the API server. If the Jobs are managed directly by a higher level controller, such as CronJobs, the Jobs can be cleaned up by CronJobs based on the specified capacity-based cleanup policy.

It appears that in k8s 1.23+ (now probably used by most k8s clusters), there is a controller reading the k8s Job resource's ttlSecondsAfterFinished. I think it can make sense for the k8s dask-gateway resource controller to respect such configuration as well for the DaskCluster resources.

The text was updated successfully, but these errors were encountered:

consideRatio added codebase:k8s new labels Oct 25, 2023

consideRatio mentioned this issue Oct 25, 2023

Dask Cluster Lifecycle Manager for Idle clusters #687

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cleanup k8s DaskCluster resources by introducing a `ttlSecondsAfterFinished` field respected by the controller? #760

Cleanup k8s DaskCluster resources by introducing a `ttlSecondsAfterFinished` field respected by the controller? #760

consideRatio commented Oct 25, 2023

Cleanup k8s DaskCluster resources by introducing a ttlSecondsAfterFinished field respected by the controller? #760

Cleanup k8s DaskCluster resources by introducing a ttlSecondsAfterFinished field respected by the controller? #760

Comments

consideRatio commented Oct 25, 2023

Cleanup k8s DaskCluster resources by introducing a `ttlSecondsAfterFinished` field respected by the controller? #760

Cleanup k8s DaskCluster resources by introducing a `ttlSecondsAfterFinished` field respected by the controller? #760