Rework OperatingSystemConfigKey and WorkerPoolHash to allow considering kubeReserved
#9699
Labels
area/robustness
Robustness, reliability, resilience related
kind/enhancement
Enhancement, improvement, extension
How to categorize this issue?
/area robustness
/kind enhancement
Suggested approach for implementing the "Rolling update of the worker pool when critical kubelet configuration changed" step from #2590 .
Table of Contents
Summary
To roll worker node pools if resource reservations managed via
kubeReserved
change, it becomes necessary to version the calculation of the OperatingSystemConfig key and also the WorkerPoolHash. This ensures that worker pools are only rolled when actually changingkubeReserved
and not unnecessarily oncekubeReserved
starts to be considered for node rolls.Motivation
Changes of
kubeReserved
for existing clusters currently happen in-place. They are applied by restarting the kubelet on each node with the new resource reservations. This can cause immediate preemptions on already loaded nodes. In particular, PodDisruptionsBudgets are not consider, which can lead to workload disruptions. To upgrade existing workload to new node resource reservations with minimal disruptions, we want to roll the worker nodes and use the updated reservations only on new nodes. This requires rolling the worker pool and switching to a new OperatingSystemConfig (OSC), which includes thekubeReserved
value. The new OSC must use a different name to prevent already existing nodes from applying the new kubeReserved values.#2590 introduces a new way to calculate default
kubeReserved
values. Upgrading to these new resource reservations with minimal disruptions requires the previously mentioned mechanism. However, the first attempt in #9465 was unable to handle the initial rollout without disruptions.Problem
Worker pool rolls are triggered if the
WorkerPoolHash
changes. To consider new fields in theWorkerPoolHash
, the current approach is to add a new optional field to theextensionsv1alpha1.Worker
objects. The field is then only included in theWorkerPoolHash
if it is set. Thereby, a node pool roll is only triggered by using the new feature/field.A worker pool must only be rolled if required by changed settings of the worker pool, that is it MUST NOT roll unnecessarily when upgrading the
WorkerPoolHash
calculation.The optional field approach does not work for
kubeReserved
as it already has a value that may differ from the static defaults used by Gardener. Thus,kubeReserved
always has a value and including it in theWorkerPoolHash
would trigger an immediate node roll.In addition, the OperatingSystemConfig key (OSCKey) must also change to ensure that only new workers pick up the new configuration. Currently, this requires manually keeping both the OSCKey and the WorkerPoolHash in sync, such that each change of the OSCKey also coincides with a node pool roll. Instead the
WorkerPoolHash
should include an OSC-specific hash as input to trigger a node roll when the OSC key changes.As the OSC key must also change if
kubeReserved
changes, this shifts the problem of keeping theWorkerPoolHash
stable to keeping the OSC key stable.Goals
WorkerPoolHash
calculation to gardenlet.kubeReserved
changes.Non-Goals
providerConfig
similar tokubeReserved
.Proposal
The central idea is to version the
WorkerPoolHash
and theOSCKey
calculation. Already existing worker pools and OSCs must stick to the old hash version. IfkubeReserved
changes then the worker pool should be upgraded to the new hash version. The necessary state to track the used hash version is stored in a single secret for each shoot.As the Worker configuration and therefore the
WorkerPoolHash
are tied to a specific OSC, we'll start with discussing theOSCKey
calculation and versioning.OSCKey Hash Calculation
We propose to provide two OSCKey hash versions:
worker.Name
,minorKubernetesVersion
,worker.CRI
andworker.Machine.Image.Name
. The resulting value must be identical to the current result.gardener-node-agent-<worker.Name>-hash(worker.CRI, machineType, volume type+size, worker.Machine.Image.Name+Version, minorKubernetesVersion, credentialsRotationStatus, nodeLocalDNS, kubeReserved)[:16]-<suffix>
WorkerPoolHash
.OSCKey Versioning
gardenlet stores a secret called
pool-hashes
in the shoot namespace of the hosting seed. The secret contains the fielddata
, which for each pool contains the used OSCKey hash version and stores the values calculated using the current and latest OSCKey hash version supported by Gardener.The secret is read by gardenlet while reconciling OSCs for a shoot and is updated before writing the updated OSCs. The secret includes an entry for each worker pool in the shoot, worker pools are matched according to their name. An individual entry is updated as follows:
currentVersion
.hashes
field. If any of those hash values changes, then setcurrentVersion
to the latest supported version.hashes
field to include the calculated hash value using thecurrentVersion
and the latest version supported by Gardener. Remove hashes for other versions.Currently, secrets with the
persist
label must also be labeled withmanaged-by: secrets-manager
to be migrated during the control plane migration. To migrate thepool-hashes
secret, the currentmanaged-by: secrets-manager
filter must be removed from computeSecretsToPersist.For the initial rollout of this secret, on startup gardenlet creates
pool-hashes
secrets for each shoot based on the currently existing worker pools in the shoot spec. For each worker pool, only thename
field is included andcurrentVersion
is set to1
. Thehashes
field is not set. The next OSC reconcile will add the missing hash values.The rationale for the fields is as follows:
kubeReserved
is a property of each worker pool and thus must be stored at this granularity.currentVersion
of the hash must be stored to prevent unnecessary changes of the OSCKey.hashes
must be stored to allow fields that are only included in a new hash version to trigger a node roll. For example,kubeReserved
is only included in hash version 2. However, changing the value should nevertheless trigger a hash version upgrade along with a node roll. A change ofkubeReserved
can only be considered by storing the hash (or its underlying informatino) when calculated using version 2.hashes
are only added during OSC reconciliation. Consequently, changes to fields that are only included in the new hash, will only trigger a node roll after the first successful OSC reconciliation.WorkerPoolHash
The
WorkerPool
of anextensionsv1alpha1.Worker
is extended with anoscHash
field. This field is set to the current hash value of the corresponding OSC, unless the OSC still uses hash version 1.The
WorkerPoolHash
calculation works differently depending on whetheroscHash
is set or not.oscHash
is empty: continue using the currentWorkerPoolHash
calculation.oscHash
is set: theWorkerPoolHash
calculation only uses theoscHash
and provider-extension--specific additional fields as input. The latter have to explicitly be passed in by the extension, the raw value ofworkerPool.ProviderConfig.Raw
is no longer added to the hash.oscHash
.The OSC for previously existing worker pools uses hash version 1. Thereby, the
WorkerPoolHash
remains unchanged when initially rolling out this change.Removal of Legacy Hashes
Legacy hash versions can only be removed once we can guarantee that there are no more users. The only way to ensure that is by waiting until all currently supported Kubernetes versions are no longer supported by Gardener. Then it is guaranteed that a node roll has happend since introducing the new hash version and thereby the hash version of all OSCs has been upgraded.
OSCKey Label for Shoots
The shoot health checks in botanist currently have to calculate the OSCKey based on information annotated at each node. This will no longer work with the aforementioned changes. As a replacement, each node is labelled with
worker.gardener.cloud/operatingsystemconfig
that contains the name of the corresponding OSC. Thereby the health checks no longer require knowledge how to calculate the OSC name/key.The label is included in the
Worker
extension object and therefore will be added to all nodes on the next reconciliation. For a smooth migration, the health check initially has to fall back to the current approach of calculating the OSCKey itself. This fallback can be removed after a transition periods of a few Gardener versions.Alternatives
kubeReserved
still uses the default value (ignored by the WorkerPoolHash calculation). This is rather ugly as it requires keeping an additional field for each worker pool.kubeReserved
in-place. ChangingkubeReserved
requires a restart ofkubelet
and results in immediate preemptions of pods if not enough resources are available. Existing mechanisms like maxSurge or PDBs would be ignored.kubeReserved
in theWorkerPoolHash
starting from K8s >= 1.30. This would take more than a year to roll out this change to all clusters.Implementation Steps
Draft:
pool-hashes
secret, but only implement version 1 of the hashWorkerPoolHash
The text was updated successfully, but these errors were encountered: