🌱 Allow inplace update of fields related to deletion during Machine deletion #10589

davidvossel · 2024-05-10T17:51:34Z

Fixes #10588
/area machine

What this PR does / why we need it:

Machine's default to nodeDrainTimeout: 0s, which blocks indefinitely if a pod can't be evicted. We can't change the nodeDrainTimeout in-place from the MachineDeployment or MachineSet after a machine is marked for deletion.

This results in a machine that wedged forever but can't be updated using the top level objects that own the machine.

To fix this, this PR allows fields related to machine deletion to be updated in place even when the machine is marked for deletion.

NOTE - I did not add unit tests yet for this PR. I want confirmation this is an acceptable approve before investing time into testing.

enxebre · 2024-05-13T07:05:27Z

Thanks @davidvossel change makes sense to me. Smoother deletion is actually one of the supporting use cases for inplace propagation. Let's include some unit tests.

See related #5880
#9285

davidvossel · 2024-05-13T17:53:48Z

Let's include some unit tests.

@enxebre i extended the existing unit test to cover the case of updating a deleting machine.

internal/controllers/machineset/machineset_controller.go

enxebre · 2024-05-20T11:48:47Z

/lgtm

k8s-ci-robot · 2024-05-20T11:48:52Z

LGTM label has been added.

Git tree hash: 44936fae936d0eab3c39b86c432c24c5e199979d

sbueringer · 2024-05-23T13:10:23Z

internal/controllers/machineset/machineset_controller.go

@@ -362,8 +362,21 @@ func (r *Reconciler) syncMachines(ctx context.Context, machineSet *clusterv1.Mac
 	log := ctrl.LoggerFrom(ctx)


Just trying to think through various cases where Machines belonging to MachineSets are deleted

MD is deleted

The following happens:

MD goes away

ownerRef triggers MS deletion

MS goes away

ownerRef triggers Machine deletion

=> Current PR doesn't help in this scenario, because the MS will already be gone when the deletionTimestamp is set on the Machines. In this case folks would have to modify the timestamps on each Machine individually.

I recently had a discussion with @vincepri, that we should maybe consider changing our MD deletion flow. Basically adding a finalizer on MD & MS, so MD & MS stick around until all Machines are gone. If we would do this, the MS => Machine propagation of the timeouts implemented here would help for this case as well

MD is scaled down to 0

The following happens:

MD scales down MS to 0

MS deletes Machine

=> This PR helps in this case because the timeouts are then propagated from MS to Machine

MD rollout

The following happens:

Someone updates the MD (e.g. bump the Kubernetes version)

MD creates a new MS and scales it up

In parallel MD scales down the old MS to 0

=> In this scenario the current PR won't help, because the MD controller does not propagate the timeouts from MD to all MS (only to the new/current one, not to the old ones)

I see how this PR addresses scenario 2. Wondering if we want to solve this problem more holistically. (maybe I also missed some cases, not sure)

Here's what's going on... the use case is subtle, but an easy one to get trapped by.

a MS is created with the default node drain timeout of (wait forever).

The MS needs to scale down to zero (but not be deleted). The intent it to bring this MS back online at some point.

The user discovers that the default node drain timeout is blocking the scale down to zero. The user likely only encounters this drain block the first time they scale down to zero because there are typically other nodes available during normal scale down operations which allows PDB to be satisfied.

The outcome is that the user is now trapped. They can't gracefully scale the MS down to zero because the default node drain timeout can't be updated on the machines. So the user is either forced to take some manual action to tear down the machines or delete the MS.

By allowing the node drain timeout to be modified while the machines are marked for deletion, we give the user a path to unblock themselves using the top level api (either MS or MD) rather than mutating individual machines or performing some other manual operation.

Yup got it, and makes sense. I was just saying that there are various cases with MD+MS where a MS is scaled down to zero. The implementation only coverd one of them. But it's fine for me to consider addressing the others in separate PRs. Would be probably just good to open an issue so we can track that (I can do that)

internal/controllers/machineset/machineset_controller.go

internal/controllers/machineset/machineset_controller_test.go

…ine deletion Signed-off-by: David Vossel <davidvossel@gmail.com>

enxebre · 2024-05-31T14:27:03Z

/lgtm
/assign @sbueringer

k8s-ci-robot · 2024-05-31T14:27:08Z

LGTM label has been added.

Git tree hash: a7820897401d291e168e73cfc2ea745d5f2c8d87

sbueringer · 2024-06-03T11:58:46Z

All good from my side. I would open a follow-up issue once this PR is merged to track further work to get this behavior across all MD workflows (e.g. MD rollout, deletion).

/approve

/hold
In case someone else wants to take a look (@fabriziopandini @chrischdi @vincepri)

Otherwise let's merge in a few days

k8s-ci-robot · 2024-06-03T11:58:54Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sbueringer

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [sbueringer]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added area/machine Issues or PRs related to machine lifecycle management cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels May 10, 2024

k8s-ci-robot requested review from JoelSpeed and richardcase May 10, 2024 17:51

davidvossel force-pushed the nodedrain-during-delete branch from 9dfb847 to 7e05934 Compare May 13, 2024 17:53

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels May 13, 2024

sbueringer changed the title ~~Allow inplace update of fields related to deletion during Machine deletion~~ 🌱 Allow inplace update of fields related to deletion during Machine deletion May 14, 2024

enxebre reviewed May 15, 2024

View reviewed changes

internal/controllers/machineset/machineset_controller.go Outdated Show resolved Hide resolved

davidvossel force-pushed the nodedrain-during-delete branch from 7e05934 to b69cea4 Compare May 16, 2024 20:48

k8s-ci-robot assigned enxebre May 20, 2024

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 20, 2024

sbueringer reviewed May 23, 2024

View reviewed changes

Propagate values inplace during machine deletion that pertain to mach…

da2b7c8

…ine deletion Signed-off-by: David Vossel <davidvossel@gmail.com>

davidvossel force-pushed the nodedrain-during-delete branch from b69cea4 to da2b7c8 Compare May 30, 2024 14:45

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 30, 2024

k8s-ci-robot requested a review from enxebre May 30, 2024 14:45

chrischdi mentioned this pull request May 31, 2024

🐛 Correctly handle concurrent updates to ClusterResourceSetBinding #10656

Open

k8s-ci-robot assigned sbueringer May 31, 2024

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 31, 2024

enxebre mentioned this pull request May 31, 2024

Consider implementing "forced" MD foreground deletion #10710

Open

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 3, 2024

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🌱 Allow inplace update of fields related to deletion during Machine deletion #10589

🌱 Allow inplace update of fields related to deletion during Machine deletion #10589

davidvossel commented May 10, 2024

enxebre commented May 13, 2024

davidvossel commented May 13, 2024

enxebre commented May 20, 2024

k8s-ci-robot commented May 20, 2024

sbueringer May 23, 2024 •

edited

davidvossel May 30, 2024

sbueringer Jun 3, 2024

enxebre commented May 31, 2024

k8s-ci-robot commented May 31, 2024

sbueringer commented Jun 3, 2024

k8s-ci-robot commented Jun 3, 2024

		@@ -362,8 +362,21 @@ func (r Reconciler) syncMachines(ctx context.Context, machineSet clusterv1.Mac
		log := ctrl.LoggerFrom(ctx)

🌱 Allow inplace update of fields related to deletion during Machine deletion #10589

Are you sure you want to change the base?

🌱 Allow inplace update of fields related to deletion during Machine deletion #10589

Conversation

davidvossel commented May 10, 2024

enxebre commented May 13, 2024

davidvossel commented May 13, 2024

enxebre commented May 20, 2024

k8s-ci-robot commented May 20, 2024

sbueringer May 23, 2024 • edited

Choose a reason for hiding this comment

davidvossel May 30, 2024

Choose a reason for hiding this comment

sbueringer Jun 3, 2024

Choose a reason for hiding this comment

enxebre commented May 31, 2024

k8s-ci-robot commented May 31, 2024

sbueringer commented Jun 3, 2024

k8s-ci-robot commented Jun 3, 2024

sbueringer May 23, 2024 •

edited