In place pod resizing should be designed into the kubelet config state loop, not alongside it #116971

smarterclayton · 2023-03-28T19:32:08Z

What happened?

#102884 in its alpha state introduces a new challenge to the kubelet state machine - partial and deferred acceptance of spec changes to a pod in the kubelet, while at the same time we are realizing that the kubelet is not properly feeding actual state to components #116970.

Last minute 1.27 fix #116702 works around but doesn't resolve a key side effect of this problem - reverting the pod state incorrectly / temporarily. The pod worker receives new state (in an ordered fashion from admission, and before that podManager) and then invokes SyncPod with the "latest". In-place resizing then mutates podManager directly, and continues its execution with a different syncPod. However, once we fix #116970, this will be impossible - other components have to consult the pod worker for the latest state, not podManager.

Visually this is described at https://docs.google.com/presentation/d/1wbPFANXVbk-e-5BvVHaZScqoqJsfcU_FQVmG-N7pv3c/edit#slide=id.g1de8a1ca1a4_0_1673 (shared with kubernetes-sig-node google group)

To graduate to beta in-place resizing needs a new capability added to the kubelet state machine - the ability to decide when the spec change requested is "acted on" by the kubelet. There are a couple of ways we could design this, but they need to fit harmoniously into the larger changes we're making to kubelet state vs being working around the kubelet state machine.

To do that we need to make sure we understand the rules that any "level driven update" to pod spec must follow and then implement a mechanism in the kubelet. This issue covers articulating those rules and getting answers as an input to the change we should make to kubelet admission.

Question:

What future spec changes to pods have been discussed that would have requirements?
Can spec changes to a pod be rejected permanently (kubelet never accepts them)?
When a spec change to a pod is rejected temporarily, what other aspects of the pod spec must be updated?
- Jordan identified the container hash as a potential problem
- Is the spec generation "locked" at the last accepted spec change?
Does a temporary pod spec change block other temporary pod spec changes? What should the generation be reported as in the kubelet?
If multiple failed spec changes happen in a row, are there any observable side effects?
Should we just support full conditional inplace pod changes? What value comes from preventing in-place updates if we have to implement a full "spec admission" mechanism for a single set of fields? We already have to support full in-place for static pods anyway.
??? (not complete list)

In general my preference is for all config spec changes to happen before a change is accepted by the pod worker, and the pod worker's state to reflect "admitted spec changes". That means that we would remove the code from syncpod that alters pod, and move it up to right before admission. It also means that we may need to abstract the code so that the kubelet resync loop (HandlePodCleanups) can recheck admission decisions. A final option may be the soft admission handler inside syncPod, but I don't like that because we have to atomically record which version of the spec the kubelet is acting on while within a config loop which is racy.

Outcome of this will be a KEP update to https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/1287-in-place-update-pod-resources/README.md#kubelet-and-api-server-interaction to cover these design details in greater detail before beta.

/sig-node
/priority important-soon

@bobbypage @vinaykul I'll use this for tracking the inputs we need to come up with a design change

What did you expect to happen?

Kubelet spec changes and rejection is a fundamental part of the kubelet state machine, not part of syncPod (which is driven by the state machine).

How can we reproduce it (as minimally and precisely as possible)?

N/A

Anything else we need to know?

No response

Kubernetes version

1.27+

Cloud provider

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

smarterclayton · 2023-03-28T19:32:27Z

/assign @vinaykul
/sig node

smarterclayton · 2023-03-28T20:02:09Z

A part of me is asking why we don't just allow all spec updates (we have to support it correctly for static pods anyway) and treat it as delete->create. Would be confusing for users not to know what changes would cause full restart, but we could always preserve volumes or something. Will take that separately.

vinaykul · 2023-03-29T13:41:15Z

In general my preference is for all config spec changes to happen before a change is accepted by the pod worker, and the pod worker's state to reflect "admitted spec changes". That means that we would remove the code from syncpod that alters pod, and move it up to right before admission. It also means that we may need to abstract the code so that the kubelet resync loop (HandlePodCleanups) can recheck admission decisions. A final option may be the soft admission handler inside syncPod, but I don't like that because we have to atomically record which version of the spec the kubelet is acting on while within a config loop which is racy.

IIUC, your suggestion is closer to my original KEP proposal of making the 'admit resize' decision in HandlePodUpdates and re-evaluating Deferred resize decisions in HandlePodRemoves. The concern with that approach was about lack of reliable retries. Will adding this to HandlePodCleanups adequately address that concern? I'd love to solve this in the channel and get rid of that lock.

SergeyKanzhelev · 2023-03-29T17:25:09Z

/triage accepted

smarterclayton · 2023-04-07T19:09:39Z

Will adding this to HandlePodCleanups adequately address that concern?

HandlePodCleanups is the "backstop" to all retries, so yes, in a sense. I think a question is whether those retries are latency sensitive or can wait the average 1s before a cleanup is invoked. One advantage of having HandlePodCleanups handle it is we can batch up all the retries (for this or other cases) and execute them at once, which is easier to reason about. A downside is that a new pod coming in might grab those extra resources before the deferred pod can take it, but there are other ways to solve that.

Summarizing some requirements:

Static pods take priority (implicitly) over apiserver pods, but we don't model that in the kubelet today explicitly, it just happens "accidentally" due to choices in code:
- static pod admission happens first on kubelet restart to on a restart static pods are likely to be admitted
- a static pod can't be rejected in practice because there is no place to store it's phase outside the kubelet, and we generally don't want to reject static pods for admission (Static pods that are rejected by admission should be periodically retried / preempt non-static pods #116484)
- static pods generally have low resource requests
static pods have a fullname exclusion rule - we can't start a new static pod with the same fullname as a previous pod until it completes (even if UID is different), in order to allow updated static pods to behave the same way that regular pods do
- that means that static pods that are admitted may not actually be running "yet", and thus not be consuming resources, but they intend to consume the resources of the previous pod
- depending on how we handle the previous issue, we could model this as "delayed admission" (where fullname exclusion is just another kind of admission)
- no guarantee the new pod has the same resource requirements as the old one, so at admission time we're double counting the resources needed for that update if the uid changes
terminating pods consume resources (runtime, in CRI) that admission plugins can't track (because they're looking at pod manager, not pod worker)
- if we have terminating pods holding exclusive resources that haven't been released, we might want to allow new pods to enter but not start

We also have to deal with user expectations - would users prefer pods take a bit longer to start before being rejected for admission (what we do today because we don't consider terminating pods, but termination could take seconds or hours), or is it better to aggressively reject pods so they can end up on other nodes.

k8s-triage-robot · 2023-07-06T19:49:45Z

This issue is labeled with priority/important-soon but has not been updated in over 90 days, and should be re-triaged.
Important-soon issues must be staffed and worked on either currently, or very soon, ideally in time for the next release.

You can:

Confirm that this issue is still relevant with /triage accepted (org members only)
Deprioritize it with /priority important-longterm or /priority backlog
Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

SergeyKanzhelev · 2023-11-15T18:44:26Z

/remove-priority important-soon
/priority important-longterm

since it is for the alpha feature, changing priority

k8s-triage-robot · 2024-02-13T19:19:20Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2024-03-14T19:58:51Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

yujuhong · 2024-03-14T20:51:23Z

/remove-lifecycle rotten

smarterclayton added the kind/bug Categorizes issue or PR as related to a bug. label Mar 28, 2023

k8s-ci-robot assigned vinaykul Mar 28, 2023

k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Mar 28, 2023

SergeyKanzhelev added this to Triage in SIG Node Bugs Mar 29, 2023

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 29, 2023

SergeyKanzhelev moved this from Triage to Triaged in SIG Node Bugs Mar 29, 2023

pacoxu moved this from Triaged to High Priority in SIG Node Bugs Apr 11, 2023

k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. and removed triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Jul 6, 2023

LingyanYin mentioned this issue Sep 27, 2023

In-Place Update of Pod Resources kubernetes/enhancements#1287

Open

26 tasks

liggitt mentioned this issue Nov 1, 2023

Enhance InPlacePodVerticalScaling performance #120432

Open

Bryce-Soghigian mentioned this issue Jan 28, 2024

Validate K8s 1.27 in-place Pod spec.resources update alpha feature kubernetes-sigs/karpenter#344

Open

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 13, 2024

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 14, 2024

k8s-ci-robot added the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Mar 14, 2024

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Mar 14, 2024

Bryce-Soghigian mentioned this issue Apr 3, 2024

Support for inPlaceVerticalScaling - Timelines ?? kubernetes/autoscaler#6680

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

In place pod resizing should be designed into the kubelet config state loop, not alongside it #116971

In place pod resizing should be designed into the kubelet config state loop, not alongside it #116971

smarterclayton commented Mar 28, 2023 •

edited

smarterclayton commented Mar 28, 2023

smarterclayton commented Mar 28, 2023

vinaykul commented Mar 29, 2023

SergeyKanzhelev commented Mar 29, 2023

smarterclayton commented Apr 7, 2023

k8s-triage-robot commented Jul 6, 2023

SergeyKanzhelev commented Nov 15, 2023

k8s-triage-robot commented Feb 13, 2024

k8s-triage-robot commented Mar 14, 2024

yujuhong commented Mar 14, 2024

In place pod resizing should be designed into the kubelet config state loop, not alongside it #116971

In place pod resizing should be designed into the kubelet config state loop, not alongside it #116971

Comments

smarterclayton commented Mar 28, 2023 • edited

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Kubernetes version

Cloud provider

OS version

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

smarterclayton commented Mar 28, 2023

smarterclayton commented Mar 28, 2023

vinaykul commented Mar 29, 2023

SergeyKanzhelev commented Mar 29, 2023

smarterclayton commented Apr 7, 2023

k8s-triage-robot commented Jul 6, 2023

SergeyKanzhelev commented Nov 15, 2023

k8s-triage-robot commented Feb 13, 2024

k8s-triage-robot commented Mar 14, 2024

yujuhong commented Mar 14, 2024

smarterclayton commented Mar 28, 2023 •

edited