Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dask Kubernetes v2 (Stability) Release #216

Closed
12 of 21 tasks
jacobtomlinson opened this issue Apr 14, 2023 · 3 comments
Closed
12 of 21 tasks

Dask Kubernetes v2 (Stability) Release #216

jacobtomlinson opened this issue Apr 14, 2023 · 3 comments
Assignees
Labels
tool/dask-kubernetes Uses the Dask Kubernetes classic cluster manager

Comments

@jacobtomlinson
Copy link
Member

jacobtomlinson commented Apr 14, 2023

Dask Kubernetes Summer Roadmap

Note
Creating in rapidsai/deployment so I can use tasklists. When tasklists are GA I'll migrate this issue to the dask/dask-kubernetes repo.

At the end of the summer I want to release V2 of the Dask Kubernetes Operator and fully remove the deprecated classic implementations. This issue outlines the roadmap that we need to complete to get us to a point where we can do that.

High-level goals:

  • Improve stability
  • Ensure feature completeness compared to other implementations

Some of the sections here may want to be split off into separate issues, and some tasks may want to be broken down into smaller chunks. But this will be the high-level milestone tracker issue for this work.

Features

Cluster idle timeout

Cleaning up idle clusters automatically becomes critical for cost-reduction when deploying at scale. Especially when using GPUs.

Tasks

  1. enhancement help wanted operator
    jacobtomlinson

Full Istio support

Currently we have partial Istio support where the scheduler uses it but workers do not. This can be a blocker for clusters that enforce Istio on all comms.

Tasks

  1. operator

UX Improvements

UX can always be improved

Tasks

  1. enhancement operator

Fixes

Replace Pod resources with higher abstractions like Deployment or at least ReplicaSet

Currently we manage bare Pods. There are downsides to this such as pods not being recreated when they are evicted from a node. It would be good to explore higher-level resources and how they could be used to simplify our controller logic.

Tasks

  1. enhancement operator
    Matt711
  2. bug operator
    Matt711
  3. enhancement operator

Ensure patches to DaskCluster and DaskWorkerGroup are propagated to child resources

In the context of CRUD we only have create, read and delete implemented for our resource. We also need to correctly handle updating them.

Tasks

  1. enhancement operator
    Matt711

Ensure scaling/autoscaling is solid

Some users are reporting unwanted behaviour when autoscaling at scale. This needs to be solid.

Tasks

  1. bug operator
  2. bug needs info operator
  3. bug operator
  4. bug needs info operator
  5. enhancement operator
    Matt711

Input sanitisation

Currently we rely on bad configuration being validated by the Kubernetes API, but this doesn't always happen as we expect. We should do more checking and sanitization before calling the Kubernetes API.

Tasks

  1. bug operator
    skirui-source
  2. bug operator
    skirui-source
  3. bug operator
  4. bug operator

Controller idempotency

The controller event handlers should be idempotent and should be able to be called multiple times. Today they are not which can cause problems when restarting the controller while operations are running.

Tasks

  1. bug operator
    Matt711

Hygeine/Tech Debt

Migrate Kubernetes client library to kr8s

Today we use pykube-ng, dask_kubernetes.aiopykube, kubernetes_asyncio and subprocess/kubectl to interact with the Kubernetes API. We should consolidate everything around kr8s which was spun out from here with the intention of unifying our API usage.

Tasks

  1. 18 of 18
    jacobtomlinson

Other

Tasks

  1. jacobtomlinson
  2. jacobtomlinson
@jacobtomlinson jacobtomlinson added the tool/dask-kubernetes Uses the Dask Kubernetes classic cluster manager label Apr 14, 2023
@jacobtomlinson jacobtomlinson changed the title Dask Kubernetes v2 (Stability) Dask Kubernetes v2 (Stability) Release Apr 14, 2023
@tasansal
Copy link

tasansal commented Jun 7, 2023

@jacobtomlinson would it make sense to add this to the list as well?

dask/dask-kubernetes#605

@skirui-source skirui-source removed their assignment Jun 7, 2023
@jacobtomlinson
Copy link
Member Author

@tasansal We have #603 on the list, which that PR closes 😊

@jacobtomlinson
Copy link
Member Author

I'm going to close this epic out as done. Not all tasks here are complete, but they will be prioritised as part of future work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tool/dask-kubernetes Uses the Dask Kubernetes classic cluster manager
Projects
None yet
Development

No branches or pull requests

3 participants