-
-
Notifications
You must be signed in to change notification settings - Fork 141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding a suspend field to the dask operator #701
Comments
This sounds great. One of the goals on our roadmap is to stop manipulating Pods directly wherever possible and switch to higher-level abstractions. Today a Perhaps we should replace the I have a few thoughts/questions about how this would behave. Currently, we create the If we switch to a What would happen if the |
Yay! Sounds like a good goal.
That sounds like a good idea. Where in the code is the creation of the Pods?
You could always set Parallelism: 1 for the pod so you disallow this. You can always control how the Job gets created so you disallow sharing of the same DaskCluster.
This is how the Job code works in Kubernetes. We assume that if a job is suspended we would terminate the existing active pods. |
I don't necessarily want to constrain people. There may be valid use cases for parallelism, but all of the parallel Pods would share the same Dask cluster.
Ok perfect |
One project that may interest you is https://github.com/kubernetes-sigs/jobset. I'm not sure of the architecture for |
Thanks for sharing that. |
Kubernetes has started adding ways to add queueing capabilities into Kubernetes. The entrypoint for enabling queueing can be by implementing the suspend field.
The BatchJob API contains these field in kubernetes upstream but custom CRDs need to implement suspend schematics for queueing.
There is some work in Kueue for adding suspend capabilities to RayJob and I imagine it would be similar for this project.
Relevant PR for RayJob: ray-project/kuberay#926
Kueue PR to incorporate RayJob: kubernetes-sigs/kueue#667
Documentation for suspend in jobs: https://kubernetes.io/docs/concepts/workloads/controllers/job/#suspending-a-job
I think it would make sense to add it as a DaskJob but there could be a reason to implement queueing in other areas also?
The text was updated successfully, but these errors were encountered: