Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In place pod resizing should be designed into the kubelet config state loop, not alongside it #116971

Open
smarterclayton opened this issue Mar 28, 2023 · 10 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. sig/node Categorizes an issue or PR as relevant to SIG Node.

Comments

@smarterclayton
Copy link
Contributor

smarterclayton commented Mar 28, 2023

What happened?

#102884 in its alpha state introduces a new challenge to the kubelet state machine - partial and deferred acceptance of spec changes to a pod in the kubelet, while at the same time we are realizing that the kubelet is not properly feeding actual state to components #116970.

Last minute 1.27 fix #116702 works around but doesn't resolve a key side effect of this problem - reverting the pod state incorrectly / temporarily. The pod worker receives new state (in an ordered fashion from admission, and before that podManager) and then invokes SyncPod with the "latest". In-place resizing then mutates podManager directly, and continues its execution with a different syncPod. However, once we fix #116970, this will be impossible - other components have to consult the pod worker for the latest state, not podManager.

Visually this is described at https://docs.google.com/presentation/d/1wbPFANXVbk-e-5BvVHaZScqoqJsfcU_FQVmG-N7pv3c/edit#slide=id.g1de8a1ca1a4_0_1673 (shared with kubernetes-sig-node google group)

To graduate to beta in-place resizing needs a new capability added to the kubelet state machine - the ability to decide when the spec change requested is "acted on" by the kubelet. There are a couple of ways we could design this, but they need to fit harmoniously into the larger changes we're making to kubelet state vs being working around the kubelet state machine.

To do that we need to make sure we understand the rules that any "level driven update" to pod spec must follow and then implement a mechanism in the kubelet. This issue covers articulating those rules and getting answers as an input to the change we should make to kubelet admission.

Question:

  • What future spec changes to pods have been discussed that would have requirements?
  • Can spec changes to a pod be rejected permanently (kubelet never accepts them)?
  • When a spec change to a pod is rejected temporarily, what other aspects of the pod spec must be updated?
    • Jordan identified the container hash as a potential problem
    • Is the spec generation "locked" at the last accepted spec change?
  • Does a temporary pod spec change block other temporary pod spec changes? What should the generation be reported as in the kubelet?
  • If multiple failed spec changes happen in a row, are there any observable side effects?
  • Should we just support full conditional inplace pod changes? What value comes from preventing in-place updates if we have to implement a full "spec admission" mechanism for a single set of fields? We already have to support full in-place for static pods anyway.
  • ??? (not complete list)

In general my preference is for all config spec changes to happen before a change is accepted by the pod worker, and the pod worker's state to reflect "admitted spec changes". That means that we would remove the code from syncpod that alters pod, and move it up to right before admission. It also means that we may need to abstract the code so that the kubelet resync loop (HandlePodCleanups) can recheck admission decisions. A final option may be the soft admission handler inside syncPod, but I don't like that because we have to atomically record which version of the spec the kubelet is acting on while within a config loop which is racy.

Outcome of this will be a KEP update to https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/1287-in-place-update-pod-resources/README.md#kubelet-and-api-server-interaction to cover these design details in greater detail before beta.

/sig-node
/priority important-soon

@bobbypage @vinaykul I'll use this for tracking the inputs we need to come up with a design change

What did you expect to happen?

Kubelet spec changes and rejection is a fundamental part of the kubelet state machine, not part of syncPod (which is driven by the state machine).

How can we reproduce it (as minimally and precisely as possible)?

N/A

Anything else we need to know?

No response

Kubernetes version

1.27+

Cloud provider

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

@smarterclayton smarterclayton added the kind/bug Categorizes issue or PR as related to a bug. label Mar 28, 2023
@k8s-ci-robot k8s-ci-robot added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 28, 2023
@smarterclayton
Copy link
Contributor Author

/assign @vinaykul
/sig node

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Mar 28, 2023
@smarterclayton
Copy link
Contributor Author

A part of me is asking why we don't just allow all spec updates (we have to support it correctly for static pods anyway) and treat it as delete->create. Would be confusing for users not to know what changes would cause full restart, but we could always preserve volumes or something. Will take that separately.

@vinaykul
Copy link
Contributor

In general my preference is for all config spec changes to happen before a change is accepted by the pod worker, and the pod worker's state to reflect "admitted spec changes". That means that we would remove the code from syncpod that alters pod, and move it up to right before admission. It also means that we may need to abstract the code so that the kubelet resync loop (HandlePodCleanups) can recheck admission decisions. A final option may be the soft admission handler inside syncPod, but I don't like that because we have to atomically record which version of the spec the kubelet is acting on while within a config loop which is racy.

IIUC, your suggestion is closer to my original KEP proposal of making the 'admit resize' decision in HandlePodUpdates and re-evaluating Deferred resize decisions in HandlePodRemoves. The concern with that approach was about lack of reliable retries. Will adding this to HandlePodCleanups adequately address that concern? I'd love to solve this in the channel and get rid of that lock.

@SergeyKanzhelev SergeyKanzhelev added this to Triage in SIG Node Bugs Mar 29, 2023
@SergeyKanzhelev
Copy link
Member

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 29, 2023
@SergeyKanzhelev SergeyKanzhelev moved this from Triage to Triaged in SIG Node Bugs Mar 29, 2023
@smarterclayton
Copy link
Contributor Author

Will adding this to HandlePodCleanups adequately address that concern?

HandlePodCleanups is the "backstop" to all retries, so yes, in a sense. I think a question is whether those retries are latency sensitive or can wait the average 1s before a cleanup is invoked. One advantage of having HandlePodCleanups handle it is we can batch up all the retries (for this or other cases) and execute them at once, which is easier to reason about. A downside is that a new pod coming in might grab those extra resources before the deferred pod can take it, but there are other ways to solve that.

Summarizing some requirements:

  • Static pods take priority (implicitly) over apiserver pods, but we don't model that in the kubelet today explicitly, it just happens "accidentally" due to choices in code:
  • static pods have a fullname exclusion rule - we can't start a new static pod with the same fullname as a previous pod until it completes (even if UID is different), in order to allow updated static pods to behave the same way that regular pods do
    • that means that static pods that are admitted may not actually be running "yet", and thus not be consuming resources, but they intend to consume the resources of the previous pod
    • depending on how we handle the previous issue, we could model this as "delayed admission" (where fullname exclusion is just another kind of admission)
    • no guarantee the new pod has the same resource requirements as the old one, so at admission time we're double counting the resources needed for that update if the uid changes
  • terminating pods consume resources (runtime, in CRI) that admission plugins can't track (because they're looking at pod manager, not pod worker)
    • if we have terminating pods holding exclusive resources that haven't been released, we might want to allow new pods to enter but not start

We also have to deal with user expectations - would users prefer pods take a bit longer to start before being rejected for admission (what we do today because we don't consider terminating pods, but termination could take seconds or hours), or is it better to aggressively reject pods so they can end up on other nodes.

@pacoxu pacoxu moved this from Triaged to High Priority in SIG Node Bugs Apr 11, 2023
@k8s-triage-robot
Copy link

This issue is labeled with priority/important-soon but has not been updated in over 90 days, and should be re-triaged.
Important-soon issues must be staffed and worked on either currently, or very soon, ideally in time for the next release.

You can:

  • Confirm that this issue is still relevant with /triage accepted (org members only)
  • Deprioritize it with /priority important-longterm or /priority backlog
  • Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. and removed triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Jul 6, 2023
@SergeyKanzhelev
Copy link
Member

/remove-priority important-soon
/priority important-longterm

since it is for the alpha feature, changing priority

@k8s-ci-robot k8s-ci-robot added priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. and removed priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Nov 15, 2023
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 13, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 14, 2024
@k8s-ci-robot k8s-ci-robot added the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Mar 14, 2024
@yujuhong
Copy link
Contributor

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Mar 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
SIG Node Bugs
High Priority
Development

No branches or pull requests

6 participants