Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nodeReuse and automatedCleaningMode implementations: support for removal of a Metal3MachineTemplate #1319

Open
tmmorin opened this issue Nov 17, 2023 · 23 comments
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. triage/accepted Indicates an issue is ready to be actively worked on.

Comments

@tmmorin
Copy link

tmmorin commented Nov 17, 2023

Context: controller needs the Metal3MachineTemplate to determine nodeReuse, on Metal3Machine deletion

Today, when a metal3machine is deleted, the metal3machine controller tries to fetch the Metal3MachineTemplate that was used to create it, and if it finds it, uses it to determine if node reuse is desired (reading Metal3MachineTemplate.spec.nodeReuse).

if m.hasTemplateAnnotation() {
m3mtKey := client.ObjectKey{
Name: m.Metal3Machine.ObjectMeta.GetAnnotations()[clusterv1.TemplateClonedFromNameAnnotation],
Namespace: m.Metal3Machine.Namespace,
}
if err := m.client.Get(ctx, m3mtKey, m3mt); err != nil {
// we are here, because while normal deprovisioning, Metal3MachineTemplate will be deleted first
// and we can't get it even though Metal3Machine has reference to it. We consider it nil and move
// forward with normal deprovisioning.
m3mt = nil
m.Log.Info("Metal3MachineTemplate associated with Metal3Machine is deleted")
} else {
// in case of upgrading, Metal3MachineTemplate will not be deleted and we can fetch it,
// in order to check for node reuse feature in the next step.
m.Log.Info("Found Metal3machineTemplate", "metal3machineTemplate", m3mtKey.Name)
}
}
if m3mt != nil {
if m3mt.Spec.NodeReuse {

Implications on how cluster manifests are managed

If the tool used to managed Metal3Machine and Metal3MachineTemplates resources does garbage collection, it is possible that it would remove a Metal3MachineTemplate before the Metal3Machine using it is removed. The result is that nodeReuse can't reliably be used.

We've hit this issue in the context of Sylva project, where we use a Helm chart to produce both Metal3Machine and Metal3MachineTemplates resources (along with other CAPI resources), and we use dynamic names for Metal3Machinetemplates to compensate for their immutability ensure the trigger of a rolling upgrade when the content of the template changes. What happens it that, on a change of metal3machine template content, the old Metal3MachineTemplates is removed prior to the rolling upgrade, so on Metal3Machine deletion that template never exists anymore, and nodeReuse is considered disabled.

Discussion

We can have a workaround by disabling automatic garbage collection, and adding our own GC tool aware of this use of a Metal3MachineTemplates by Metal3Machines do a smarter garbage collection after a rolling upgrade.

This is however not elegant, requires additional tooling, and Metal3 users need to know about this issue to be able to avoid it.

Would it be possible to improve the controller behavior to avoid that ?

In particular:

  • could finalizers or ownerRefs be used to block the deletion of Metal3MachineTemplates until no Metal3Machine remains that would refer to it via a cluster.x-k8s.io/cloned-from-name-x annotation ?
  • if not, would it be possible to have the controller fall back on "nodeReuse" annotation set directly on a Metal3Machine at creation time if it can't find the Metal3MachineTemplate ?

Same for automatedCleaningMode

(Added on 2023-11-28)

During the discussion around the initial description of this issue, it came up that there is the same design and possible code evolution question about automatedCleaningMode.

@metal3-io-bot metal3-io-bot added the needs-triage Indicates an issue lacks a `triage/foo` label and requires one. label Nov 17, 2023
@tmmorin
Copy link
Author

tmmorin commented Nov 17, 2023

/cc @hardys

@hardys
Copy link
Member

hardys commented Nov 17, 2023

/cc @furkatgofurov7

@furkatgofurov7
Copy link
Member

furkatgofurov7 commented Nov 22, 2023

If the tool used to managed Metal3Machine and Metal3MachineTemplates resources does garbage collection, it is possible that it would remove a Metal3MachineTemplate before the Metal3Machine using it is removed. The result is that nodeReuse can't reliably be used.

I am unsure what tool is being used, but this is a breaking point of the node-reuse workflow in your use case. The feature was implemented as part of a "bigger" feature where we have the following use case:

  • We have a node which we want to keep the data available on the disk without being cleaned and later the same node reused, since there is no point of keeping the disk of the node intact and not reusing it. So, in that sense, automated-cleaning and node-reuse are used in bundle.
    Node-reuse feature is heavily dependent on the scale-in feature of control-plane/MD, where machines are removed one-by-one before creating new ones (delete-create method), so that machine controller has a chance to set the label on the host and do the filtering of it in the next provisioning.

Are you using scale-in feature, by any chance and would that help?

@tmmorin
Copy link
Author

tmmorin commented Nov 22, 2023

If the tool used to managed Metal3Machine and Metal3MachineTemplates resources does garbage collection, it is possible that it would remove a Metal3MachineTemplate before the Metal3Machine using it is removed. The result is that nodeReuse can't reliably be used.

I am unsure what tool is being used, [...]

In our case, a Helm chart defines the CAPI/Metal3 manifests (https://gitlab.com/sylva-projects/sylva-elements/helm-charts/sylva-capi-cluster), and it is instantiated by a FluxCD HelmRelease.

But my understanding is that the issue discussed here is quite generic: any tool doing garbage collection would have the same issue (e.g kustomize used with a GitOps tool like FluxCD or Argo, or a CRD/operator generating CAPI/Metal3 manifests for a cluster, etc). Also, I would tend to think that any tool not doing garbage collection on unused resource templates would be problematic: garbage collection is something that is needed to not leak resources on the long run.

but this is a breaking point of the node-reuse workflow in your use case. The feature was implemented as part of a "bigger" feature where we have the following use case:

  • We have a node which we want to keep the data available on the disk without being cleaned and later the same node reused, since there is no point of keeping the disk of the node intact and not reusing it. So, in that sense, automated-cleaning and node-reuse are used in bundle.

Yes, this is how we intend to use nodeReuse: along with disabling automated cleaning. The use case is explicitly to preserve PV data.

  • Node-reuse feature is heavily dependent on the scale-in feature of control-plane/MD, where machines are removed one-by-one before creating new ones (delete-create method), so that machine controller has a chance to set the label on the host and do the filtering of it in the next provisioning.

Are you using scale-in feature, by any chance and would that help?

Well, yes, the intent definitely is to use maxSurge: 0 for this purpose (we worked with RKE2 bootstrap provider devs @richardcase @alexander-demicev to have this control plane support maxSurge).

Using maxSurge zero helps yes, in the sense that it is necessary for nodeReuse to make any sense, but this does not address the issue raised in this issue: we need a way to preserve the Metal3MachineTemplates for nodeReuse to be effective.

@Rozzii
Copy link
Member

Rozzii commented Nov 22, 2023

/triage accepted
/help

@metal3-io-bot
Copy link
Contributor

@Rozzii:
This request has been marked as needing help from a contributor.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/triage accepted
/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@metal3-io-bot metal3-io-bot added triage/accepted Indicates an issue is ready to be actively worked on. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. and removed needs-triage Indicates an issue lacks a `triage/foo` label and requires one. labels Nov 22, 2023
@Rozzii
Copy link
Member

Rozzii commented Nov 22, 2023

@lentzi90 @kashifest @zaneb

@lentzi90
Copy link
Member

could finalizers or ownerRefs be used to block the deletion of Metal3MachineTemplates until no Metal3Machine remains that would refer to it via a cluster.x-k8s.io/cloned-from-name-x annotation ?

I think this could be doable (finalizers, not ownerRefs). However, I have a few things to note first.

  1. The infrastructure provider (CAPM3) is not normally interacting the InfraMachineTemplate (M3MT). That CAPM3 checks the nodeReuse from it is an exception. Normally it is just CAPI that reads the InfraMachineTemplate and creates InfraMachines from it. This indicates to me that we need to consider carefully before we introduce more "unusual" behavior. Perhaps we should also consider removing the check we have for nodeReuse.
  2. You talk about GC. That would to me indicate that the templates are no longer used, but in fact they are so this seems like the wrong terminology. I understand the issue though: you want to delete the old template at the same time that you create the new. In the "manual" case I would expect users to keep the old M3MT for at least one cycle so that it is easy to roll back in case of issues. I.e. if we start with m3mt-1 and roll to m3mt-2 I would expect m3mt-1 to be left until I start rolling to m3mt-3. Would this be possible? If not, how do you handle rollbacks? What if the rollout gets stuck?
  3. Have you considered using a ClusterClass? I think this could help with the issue but we need to implement it first (Implement the ClusterClass API #1267). It could also introduce more issues though... With a ClusterClass you would also need to switch from one version to the next and I would always expect to have the previous ClusterClass available at least during the transition, since I would not want to switch all clusters at the same time. In other words, this issue seems generic and I would not expect to be able to solve it completely in CAPM3.

@tmmorin
Copy link
Author

tmmorin commented Nov 23, 2023

could finalizers or ownerRefs be used to block the deletion of Metal3MachineTemplates until no Metal3Machine remains that would refer to it via a cluster.x-k8s.io/cloned-from-name-x annotation ?

I think this could be doable (finalizers, not ownerRefs).

👍

However, I have a few things to note first.

1. The infrastructure provider (CAPM3) is not normally interacting the InfraMachineTemplate (M3MT). That CAPM3 checks the nodeReuse from it is an exception. Normally it is just CAPI that reads the InfraMachineTemplate and creates InfraMachines from it. This indicates to me that we need to consider carefully before we introduce more "unusual" behavior. Perhaps we should also consider removing the check we have for nodeReuse.

I had the same reaction.

I'm actually wondering if a solution would not be to ensure that Metal3Machine would keep track of the intention of of having nodeReuse expressed in the template. For instance by having a spec.nodeReuse field (?).

2. You talk about GC. That would to me indicate that the templates are no longer used, but in fact they are so this seems like the wrong terminology.  I understand the issue though: you want to delete the old template at the same time that you create the new. In the "manual" case I would expect users to keep the old M3MT for at least one cycle so that it is easy to roll back in case of issues. I.e. if we start with m3mt-1 and roll to m3mt-2 I would expect m3mt-1 to be left until I start rolling to m3mt-3. 

Yes, this is the idea (or until m3mt-1 is not used by any Machine).

My use of the "garbage collection" termed was only about deleting m3mt-1 once it stopped being used.

Would this be possible? If not, how do you handle rollbacks? What if the rollout gets stuck?

I would definitely not propose to delete m3mt-1 before the rollout with m3mt-2 is done.

3. Have you considered using a ClusterClass? I think this could help with the issue but we need to implement it first ([Implement the ClusterClass API #1267](https://github.com/metal3-io/cluster-api-provider-metal3/issues/1267)). It could also introduce more issues though... With a ClusterClass you would also need to switch from one version to the next and I would always expect to have the previous ClusterClass available at least during the transition, since I would not want to switch all clusters at the same time. In other words, this issue seems generic and I would not expect to be able to solve it completely in CAPM3.

I guess this would be one option.

The tooling we build for Sylva was done with in mind the short term need to fully support baremetal scenarios with Metal3, that no work was started for ClusterClass support in Metal3, and the fact that we felt that the kind of customization of manifests that clusterclass offers would require us to write runtime extensions for many many things. We thus opted for not trying clusterclass.

@lentzi90
Copy link
Member

I'm actually wondering if a solution would not be to ensure that Metal3Machine would keep track of the intention of of having nodeReuse expressed in the template. For instance by having a spec.nodeReuse field (?).

This sounds worth investigating. It would be good to hear from those that designed the current API if this was considered already, and if it was, why it was not implemented this way.

I would definitely not propose to delete m3mt-1 before the rollout with m3mt-2 is done.

I'm confused. I thought this is what you already do and why you created the issue? If you can avoid doing this then what is the issue?

@tmmorin
Copy link
Author

tmmorin commented Nov 27, 2023

I would definitely not propose to delete m3mt-1 before the rollout with m3mt-2 is done.

I'm confused. I thought this is what you already do and why you created the issue? If you can avoid doing this then what is the issue?

Sorry for being confusing, I mixed what I would like to be (be able to delete m3mt-1) with what we are constrained to do today (we have to preserve it), and some confusion comes from the distinction between "an API request is made to delete the object" and "the object is actually removed up" (all finalizer are cleared).

Our goal would be the following: on an upgrade, create m3mt-2 from which new m3m will be built, and delete m3mt-1 (*); we're here assuming that we have a tool (for instance a Gitops tool, or Helm) that is able to produce the earlier versions of all resources, and if a rollback is wanted, it will be handled by that tool, which will produce a machine3machine template with same specs as m3mt-1.

(*): what does "detele m3mt-1" do ? ... that would depend on how the metal3 evolves to allow that...

  • one possibility is that an API deletion is done on the m3mt object, and metal3 has a finalizer to prevent actual deletion until no m3m machine refers to the m3mt
  • another possibility is that an API deletion is done on the m3mt object without any finalizer being set, assuming the metal3 code does not need to look at the m3mt on a m3m deletion

Sorry again for making this more complex than it has to be ;)

@lentzi90
Copy link
Member

Makes sense! Thanks for explaining!
I wonder if it would be worth creating an issue in CAPI regarding the finalizer. 🤔 That would be the most logical place to handle it.
(Adding the nodeReuse field to Metal3Machines is still worth considering here)

@furkatgofurov7
Copy link
Member

furkatgofurov7 commented Nov 27, 2023

Yes, this is how we intend to use nodeReuse: along with disabling automated cleaning. The use case is explicitly to preserve PV data.

Interesting, how are you using disabling automated cleaning feature then, AFAICT it also uses M3MT to sync the value with M3M.

The infrastructure provider (CAPM3) is not normally interacting the InfraMachineTemplate (M3MT). That CAPM3 checks the nodeReuse from it is an exception. Normally it is just CAPI that reads the InfraMachineTemplate and creates InfraMachines from it. This indicates to me that we need to consider carefully before we introduce more "unusual" behavior. Perhaps we should also consider removing the check we have for nodeReuse.

I don't agree with that, then what is the use of

// don't synchronize AutomatedCleaningMode between metal3MachineTemplate
// and metal3Machine if unset in metal3MachineTemplate.
if m.Metal3MachineTemplate.Spec.Template.Spec.AutomatedCleaningMode != nil {
m3m.Spec.AutomatedCleaningMode = m.Metal3MachineTemplate.Spec.Template.Spec.AutomatedCleaningMode
if err := m.client.Update(ctx, m3m); err != nil {
return errors.Wrapf(err, "failed to update metal3Machine: %s", m3m.Name)
}
m.Log.Info("Synchronized automatedCleaningMode between ", "Metal3MachineTemplate", fmt.Sprintf("%v/%v", m.Metal3MachineTemplate.Namespace, m.Metal3MachineTemplate.Name), "Metal3Machine", fmt.Sprintf("%v/%v", m3m.Namespace, m3m.Name))
}

in the CAPM3 code? I mean we already have that unusuality as you are describing in the code base in other places.

I'm actually wondering if a solution would not be to ensure that Metal3Machine would keep track of the intention of of having nodeReuse expressed in the template. For instance by having a spec.nodeReuse field (?).

This sounds worth investigating. It would be good to hear from those that designed the current API if this was considered already, and if it was, why it was not implemented this way.

I don't recall exactly the specifics, but I believe M3MT was considered source of truth since it is the object that is created first and cloned to create M3M afterward, meaning having a user's intention set (nodeReuse field) right at the object on the higher level was the intention.

@lentzi90
Copy link
Member

The infrastructure provider (CAPM3) is not normally interacting the InfraMachineTemplate (M3MT). That CAPM3 checks the nodeReuse from it is an exception. Normally it is just CAPI that reads the InfraMachineTemplate and creates InfraMachines from it. This indicates to me that we need to consider carefully before we introduce more "unusual" behavior. Perhaps we should also consider removing the check we have for nodeReuse.

I don't agree with that, then what is the use of

// don't synchronize AutomatedCleaningMode between metal3MachineTemplate
// and metal3Machine if unset in metal3MachineTemplate.
if m.Metal3MachineTemplate.Spec.Template.Spec.AutomatedCleaningMode != nil {
m3m.Spec.AutomatedCleaningMode = m.Metal3MachineTemplate.Spec.Template.Spec.AutomatedCleaningMode
if err := m.client.Update(ctx, m3m); err != nil {
return errors.Wrapf(err, "failed to update metal3Machine: %s", m3m.Name)
}
m.Log.Info("Synchronized automatedCleaningMode between ", "Metal3MachineTemplate", fmt.Sprintf("%v/%v", m.Metal3MachineTemplate.Namespace, m.Metal3MachineTemplate.Name), "Metal3Machine", fmt.Sprintf("%v/%v", m3m.Namespace, m3m.Name))
}

in the CAPM3 code? I mean we already have that unusuality as you are describing in the code base in other places.

To be honest, to me this feels like abusing the API. We are treating the M3MT more like a Deployment, which it is not intended to be. Good that you brought it up though! I think we should consider carefully if this is something that should be changed in the future.

I'm actually wondering if a solution would not be to ensure that Metal3Machine would keep track of the intention of of having nodeReuse expressed in the template. For instance by having a spec.nodeReuse field (?).

This sounds worth investigating. It would be good to hear from those that designed the current API if this was considered already, and if it was, why it was not implemented this way.

I don't recall exactly the specifics, but I believe M3MT was considered source of truth since it is the object that is created first and cloned to create M3M afterward, meaning having a user's intention set (nodeReuse field) right at the object on the higher level was the intention.

Still sounds strange to me since the "normal flow" would anyway take the source of truth (M3MT) and use it when creating the M3M. The only reason to go back to the template would be if the value changes on either the M3M or M3MT. These kind of changes are generally not allowed though (instead a new M3MT would be needed) so it is strange that we have implemented it this way.

@tmmorin
Copy link
Author

tmmorin commented Nov 27, 2023

Yes, this is how we intend to use nodeReuse: along with disabling automated cleaning. The use case is explicitly to preserve PV data.

Interesting, how are you using disabling automated cleaning feature then, AFAICT it also uses M3MT to sync the value with M3M.

Well, we are only at a stage where we want to ensure that all the pieces are aligned to allow using nodeReuse and disable automated cleaning. I hadn't identified that automatedCleaning was currently implemented with the same reliance on M3MT. If so, then it's worth retitling this issue.

(For automated cleaning, my understanding is that we can also benefit from it by setting it directly in the BMH resource, although this comes with some implications.)

@furkatgofurov7
Copy link
Member

Still sounds strange to me since the "normal flow" would anyway take the source of truth (M3MT) and use it when creating the M3M. The only reason to go back to the template would be if the value changes on either the M3M or M3MT.

That was the design decision in the past agreed by the community, and of course, it might not always be ideal considering all use cases in advance, since node reuse had to take into account a lot of them due to upgrade scenarios.

These kind of changes are generally not allowed though (instead a new M3MT would be needed) so it is strange that we have implemented it this way.

strange, but why? That is the case with all metal3-objectX-template objects including in the project, where we patch it in some way or another.

@furkatgofurov7
Copy link
Member

(For automated cleaning, my understanding is that we can also benefit from it by setting it directly in the BMH resource, although this comes with some implications.)

Yes, that is right. For that feature, you can directly set it in BMH and it should take a precedence over other options

@lentzi90
Copy link
Member

Still sounds strange to me since the "normal flow" would anyway take the source of truth (M3MT) and use it when creating the M3M. The only reason to go back to the template would be if the value changes on either the M3M or M3MT.

That was the design decision in the past agreed by the community, and of course, it might not always be ideal considering all use cases in advance, since node reuse had to take into account a lot of them due to upgrade scenarios.

I understand, and I don't suggest changing it without thinking things through. All I'm saying is that there is a risk when we do things in "weird" ways. I think it would be worth considering alternatives, especially since there were unforeseen consequences that came from this design. Perhaps we could find a much better long term implementation now that we know more 🙂

These kind of changes are generally not allowed though (instead a new M3MT would be needed) so it is strange that we have implemented it this way.

strange, but why? That is the case with all metal3-objectX-template objects including in the project, where we patch it in some way or another.

In general, changes can not be applied without a rolling update to the Machines. This is why the general rule is that the templates are immutable and should be updated by adding a new template and roll over to it. However, CAPI does allow exceptions so we are not directly breaking the contract here. It does complicate things though, and makes it much harder to reason about the state of the system, which can easily lead to more bugs.

Ref: https://cluster-api.sigs.k8s.io/tasks/updating-machine-templates

@tmmorin tmmorin changed the title nodeReuse implementation, support for removal of a Metal3MachineTemplate nodeReuse and automatedCleaningMode implementations: support for removal of a Metal3MachineTemplate Nov 28, 2023
@tmmorin
Copy link
Author

tmmorin commented Nov 28, 2023

In general, changes can not be applied without a rolling update to the Machines. This is why the general rule is that the templates are immutable and should be updated by adding a new template and roll over to it. However, CAPI does allow exceptions so we are not directly breaking the contract here. It does complicate things though, and makes it much harder to reason about the state of the system, which can easily lead to more bugs.

Yes.

Note that https://cluster-api.sigs.k8s.io/tasks/updating-machine-templates gives a kind of frame to the exceptions to this loose rule:

Some infrastructure providers may, at their discretion, choose to support in-place modifications of certain infrastructure machine template fields. This may be useful if an infrastructure provider is able to make changes to running instances/machines, such as updating allocated memory or CPU capacity. In such cases, however, Cluster API will not trigger a rolling update.

I think that the idea is that the thing that are relevant for being mutable in the template are the things that the infra provider can implement on existing machines without a rolling update (like mem or CPU allocation for a virtual machine).

@tmmorin
Copy link
Author

tmmorin commented Nov 29, 2023

Let me push an update here.

In the context of Sylva, while working on rolling updates, we realized that:

The conclusion for us is that we're going to ensure that all these resources (Infra machine template and bootstrap provider config resource) are not deleted when we update the CAPI+providers resources.

Thorough inventory work would be required to see how the different infra providers behave, and Kubeadm... but overall, with this kind of MachineSet controller behavior in mind it seems like the wise path is to assume that XXXMachineTemplates need to be around for quite a while and maybe not try to have a "spot fix" here in Metal3 to hold off the actual m3mt deletion with a finalizer.

@metal3-io-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues will close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@metal3-io-bot metal3-io-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 27, 2024
@Rozzii
Copy link
Member

Rozzii commented Mar 27, 2024

/remove-lifecycle stale

@metal3-io-bot metal3-io-bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 27, 2024
@Rozzii
Copy link
Member

Rozzii commented May 6, 2024

This is going back and forth between stale and active so I will just label it as frozen, feel free to continue the discussion, when implementation PR is pushed on the ticket I will remove the label.
/lifecycle-frozen

@Rozzii Rozzii added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label May 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. triage/accepted Indicates an issue is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

6 participants