Improve failure domains #10476

fabriziopandini · 2024-04-22T14:15:54Z

Grouping a couple of issue/ideas about failure domain which are not getting attention from the community

To address this issue we need a proposal that looks into how to handle operations for failure domain (going behind the initial placement of machines currently supported)

#4031

Currently failure domains are assumed to be always available, or during an outage/issues with an AZ a KCP machine would still be created there. the short-term solution is to remove the AZ from the status, but this might be confusing as someone would see an AZ missing from the list for no apparent reason. As this is a breaking change, we'll likely want to defer it to v1alpha4

#5667

As a user/operator of a non-cloud provider cluster (e.g. baremetal), I would like CAPI to label Nodes with the well-known label that corresponds to the failuredomain that was selected by CAPI.

As a user/operator I would like to have more control over how CAPI balances my control-plane and worker nodes across failuredomains. For example, one of my failuredomains has less infra resources then the others; equal distribution, as is done today, would not work well for me.

As an operator who uses (or wants to use) cluster-autoscaler, I want CAPI failuredomains and cluster autoscaler to play nicely.

#7417

define how the system reacts to failure domain changes; this is a separate problem, but in kind of builds up on how we can identify failure domain is changed, so IMO the first point should be addressed first.

/kind feature
One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.

k8s-ci-robot · 2024-04-22T14:16:02Z

This issue is currently awaiting triage.

CAPI contributors will take a look as soon as possible, apply one of the triage/* labels and provide further guidance.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

mdbooth · 2024-04-30T09:57:16Z

I have been giving this some thought recently specifically in the context of CAPO, but also with a view to how it could be implemented more generally. The two principal problems we have with the current implementation are:

In OpenStack specifically, a 'failure domain' can in practice be an arbitrarily complex set of configurations spanning separate configurations for at least compute, storage, and network. In order to use MachineSpec.FailureDomain we would effectively have to make this a reference to some other data structure. This dramatically increases complexity for both developers and users.

As failure domains are arbitrarily complex configuration, they can change over time. There is currently no component which can recognise that a machine is no longer compliant with its failure domain and perform some remediation.

In OpenShift we have the Control Plane Machine Set operator (CPMS). This works well for us, but this is because, being in OpenShift, it can take a number of liberties which are unlikely to be acceptable in CAPI, specifically the following are baked directly into the controller:

However, this is the extent of the provider-specific code in CPMS. It's quite a simple interface.

I had an idea that we might be able to borrow ideas from CPMS and the kube scheduler to implement something relatively simple but very flexible. What follows is very rough. It's intended for discussion rather than as a concrete design proposal.

The high level overview is that we would add a FailureDomainPolicyRef to MachineSpec. If a Machine has a FailureDomainPolicyRef, the Machine controller will not create an InfrastructureMachine until the MachineSpec also has a FailureDomainRef.

A user might create:

MachineTemplate:

spec:
  template:
    spec:
      ...
      failureDomainPolicyRef:
        apiVersion: ...
        kind: DefaultCAPIFailureDomainPolicy
        name: MyClusterControlPlane

DefaultCAPIFailureDomainPolicy:

metadata:
  name: MyClusterControlPlane
spec:
  spreadPolicy: Whatever
  failureDomains:
    apiVersion: ...
    kind: OpenStackFailureDomain
    names:
    - AZ1
    - AZ2
    - AZ3

OpenStackFailureDomain

metadata:
  name: AZ1
spec:
  computeAZ: Foo
  storageAZ: Bar
  networkAZ: Baz

If OpenStackFailureDomain is immutable, it can only be 'changed' by creating a new one and updating the failure domain policy.

The failure domain policy controller would watch Machines with a failureDomainPolicyRef. It would assign a failureDomain from the list according to the configured policy. It also has the opportunity to notice that a set of Machines is no longer compliant with the policy and remediate by deleting machines so new, compliant machines can replace them.

Because the failure domain is now a reference to a provider-specific CRD, the infrastructure machine controller can take provider-specific actions to apply the failure domain to an infrastructure machine.

For users who don't need this complexity, the infrastructure cluster controller could create a default policy much the way it does now which could be applied to a KCP machine template.

A design like this in the MachineSpec would also have the advantage that it could be used without modification for any set of machines. So, for example, users who want to spread a set of workers in an MD across 2 FDs would be able to do that.

JoelSpeed · 2024-04-30T13:22:04Z

I believe something like this would also be effective for vSphere, where failure domains are also complex as one cluster could in theory span multiple clusters. Not sure exactly how this is handled in CAPV today.

k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Apr 22, 2024

k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Apr 22, 2024

This was referenced Apr 22, 2024

Represent Failure domains availability and spread KCP accordingly #4031

Closed

Improve FailureDomain handling in CAPI #5667

Closed

Enabling controllers to respond to an updated failure domain #7417

Closed

fabriziopandini added priority/backlog Higher priority than priority/awaiting-more-evidence. kind/proposal Issues or PRs related to proposals. labels Apr 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve failure domains #10476

Improve failure domains #10476

fabriziopandini commented Apr 22, 2024 •

edited

k8s-ci-robot commented Apr 22, 2024

mdbooth commented Apr 30, 2024 •

edited

JoelSpeed commented Apr 30, 2024

Improve failure domains #10476

Improve failure domains #10476

Comments

fabriziopandini commented Apr 22, 2024 • edited

k8s-ci-robot commented Apr 22, 2024

mdbooth commented Apr 30, 2024 • edited

JoelSpeed commented Apr 30, 2024

fabriziopandini commented Apr 22, 2024 •

edited

mdbooth commented Apr 30, 2024 •

edited