Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

simulator: synchronically replicate cluster state from a real cluster to a fake one for the scheduler testing #327

Open
sanposhiho opened this issue Dec 10, 2023 · 3 comments · May be fixed by #335
Assignees
Labels
area/simulator Issues or PRs related to the simulator. kind/feature Categorizes issue or PR as related to a new feature. priority/next-release Issues or PRs related to features should be implemented in time for the next release.

Comments

@sanposhiho
Copy link
Member

sanposhiho commented Dec 10, 2023

/assign
/kind feature


This issue proposes a new simple component to keep replicating the state from a prod cluster to a fake cluster.

background

Testing the scheduler is a complex challenge. There are countless patterns of operations executed within a cluster, making it impractical to anticipate every scenario with a finite number of tests. More often than not, bugs are discovered only when the scheduler is deployed in an actual cluster.

Having a development or sandbox environment for testing the scheduler—or, indeed, any Kubernetes controllers—is a common practice. However, this approach falls short of capturing all the potential scenarios that might arise in a production cluster. It’s an inevitable truth that a development cluster never sees the exact same use or exhibits the same behavior as its production counterpart, with notable differences in workload sizes and scaling dynamics.

User story

We have a custom scheduler which has a co-scheduling feature.
We want to test it in a cluster that gets similar resources as our production cluster. But, our production cluster is much bigger than our development cluster and it's unrealistic to catch all bugs there.

Resources to sync

We shouldn't simply do that, we have to think about what to sync and what not to.

All resources involved in the scheduling should be synced.
And, we should make it configurable to select which resources to sync, given everyone could have a different scheduler plugin which schedules Pods based on anything.

By default, we should sync:

  • Pods
  • Nodes
  • PVs
  • PVCs
  • SC

Scheduled Pods

We cannot simply sync all changes to Pods, because the real cluster has the scheduler, and it schedules all Pods in the cluster.
If we simply synced all changes to Pods, the scheduling result would also be synced. (and may conflicted with the decision of another scheduler which is in a fake cluster.)

So, we don't sync any of updated events to scheduled Pods.
Pods are synced like:

  1. In a real cluster, Pod-a is created
  2. In a fake cluster, Pod-a is created. (synced)
  3. In a real cluster, the scheduler schedules Pod-a to Node-a. We don't copy this change to a fake cluster.
  4. In a fake cluster, the scheduler, which is different one from (3), schedules Pod-a to Node-x.

It means that the scheduling results may be different between a real cluster and a fake cluster. But, it's OK.
Our purpose is to create a fake cluster for testing the scheduling, which gets the same load as the production cluster.

@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Dec 10, 2023
@utam0k
Copy link
Member

utam0k commented Dec 13, 2023

I'm also interested in this feature.

@sanposhiho
Copy link
Member Author

/retitle simulator: synchronically replicate cluster state from a real cluster to a fake one for the scheduler testing
/area simulator

I'll make it in the simulator, on second thought.

@k8s-ci-robot k8s-ci-robot changed the title mimicube: the tool to synchronically replicate cluster state from a real cluster to a fake one for the scheduler testing simulator: synchronically replicate cluster state from a real cluster to a fake one for the scheduler testing Jan 7, 2024
@k8s-ci-robot k8s-ci-robot added the area/simulator Issues or PRs related to the simulator. label Jan 7, 2024
@sanposhiho sanposhiho linked a pull request Feb 18, 2024 that will close this issue
@sanposhiho
Copy link
Member Author

/priority next-release

@k8s-ci-robot k8s-ci-robot added the priority/next-release Issues or PRs related to features should be implemented in time for the next release. label Mar 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/simulator Issues or PRs related to the simulator. kind/feature Categorizes issue or PR as related to a new feature. priority/next-release Issues or PRs related to features should be implemented in time for the next release.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants