-
Notifications
You must be signed in to change notification settings - Fork 38.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
In place pod resizing should be designed into the kubelet config state loop, not alongside it #116971
Comments
/assign @vinaykul |
A part of me is asking why we don't just allow all spec updates (we have to support it correctly for static pods anyway) and treat it as delete->create. Would be confusing for users not to know what changes would cause full restart, but we could always preserve volumes or something. Will take that separately. |
IIUC, your suggestion is closer to my original KEP proposal of making the 'admit resize' decision in HandlePodUpdates and re-evaluating |
/triage accepted |
HandlePodCleanups is the "backstop" to all retries, so yes, in a sense. I think a question is whether those retries are latency sensitive or can wait the average 1s before a cleanup is invoked. One advantage of having HandlePodCleanups handle it is we can batch up all the retries (for this or other cases) and execute them at once, which is easier to reason about. A downside is that a new pod coming in might grab those extra resources before the deferred pod can take it, but there are other ways to solve that. Summarizing some requirements:
We also have to deal with user expectations - would users prefer pods take a bit longer to start before being rejected for admission (what we do today because we don't consider terminating pods, but termination could take seconds or hours), or is it better to aggressively reject pods so they can end up on other nodes. |
This issue is labeled with You can:
For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/ /remove-triage accepted |
/remove-priority important-soon since it is for the alpha feature, changing priority |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
/remove-lifecycle rotten |
What happened?
#102884 in its alpha state introduces a new challenge to the kubelet state machine - partial and deferred acceptance of spec changes to a pod in the kubelet, while at the same time we are realizing that the kubelet is not properly feeding actual state to components #116970.
Last minute 1.27 fix #116702 works around but doesn't resolve a key side effect of this problem - reverting the pod state incorrectly / temporarily. The pod worker receives new state (in an ordered fashion from admission, and before that podManager) and then invokes SyncPod with the "latest". In-place resizing then mutates podManager directly, and continues its execution with a different syncPod. However, once we fix #116970, this will be impossible - other components have to consult the pod worker for the latest state, not podManager.
Visually this is described at https://docs.google.com/presentation/d/1wbPFANXVbk-e-5BvVHaZScqoqJsfcU_FQVmG-N7pv3c/edit#slide=id.g1de8a1ca1a4_0_1673 (shared with kubernetes-sig-node google group)
To graduate to beta in-place resizing needs a new capability added to the kubelet state machine - the ability to decide when the spec change requested is "acted on" by the kubelet. There are a couple of ways we could design this, but they need to fit harmoniously into the larger changes we're making to kubelet state vs being working around the kubelet state machine.
To do that we need to make sure we understand the rules that any "level driven update" to pod spec must follow and then implement a mechanism in the kubelet. This issue covers articulating those rules and getting answers as an input to the change we should make to kubelet admission.
Question:
In general my preference is for all config spec changes to happen before a change is accepted by the pod worker, and the pod worker's state to reflect "admitted spec changes". That means that we would remove the code from syncpod that alters pod, and move it up to right before admission. It also means that we may need to abstract the code so that the kubelet resync loop (HandlePodCleanups) can recheck admission decisions. A final option may be the soft admission handler inside syncPod, but I don't like that because we have to atomically record which version of the spec the kubelet is acting on while within a config loop which is racy.
Outcome of this will be a KEP update to https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/1287-in-place-update-pod-resources/README.md#kubelet-and-api-server-interaction to cover these design details in greater detail before beta.
/sig-node
/priority important-soon
@bobbypage @vinaykul I'll use this for tracking the inputs we need to come up with a design change
What did you expect to happen?
Kubelet spec changes and rejection is a fundamental part of the kubelet state machine, not part of syncPod (which is driven by the state machine).
How can we reproduce it (as minimally and precisely as possible)?
N/A
Anything else we need to know?
No response
Kubernetes version
1.27+
Cloud provider
OS version
Install tools
Container runtime (CRI) and version (if applicable)
Related plugins (CNI, CSI, ...) and versions (if applicable)
The text was updated successfully, but these errors were encountered: