Configurable restart behavior #127

bgrant0607 · 2014-06-16T22:54:57Z

Right now we assume that all containers run forever. We should support configurable restart behavior for the following modes:

run forever (e.g., for services)
run until successful termination (e.g., for batch workloads)
run once (e.g., for tests)

The main tricky issues are:

what to do for multi-container pods
what to do for replicationController

We should also think about how to facilitate implementation of custom policies outside the system. See also:
googlearchive/container-agent#9

bgrant0607 · 2014-06-18T20:20:15Z

Thoughts on multi-container pods:

First of all, I think restart behavior should be specified at the pod level rather than the container level. It wouldn't make sense for one container to terminate and another to restart forever, for example.

Run forever is obviously easy -- we're doing it now.

Run once is fairly easy, too, I think. As soon as one container terminates, probably all should be terminated (call this policy "any").

For run until success, we could restart individual containers until each succeeds (call this policy "all").

We should make all vs. any a separate policy from forever vs. success vs. once. Another variant people would probably want is a "leader" container, to which the other containers' lifetimes would be tied. Since we start containers in order, the leader would need to be the first one in the list. To play devil's advocate, if we had event hooks (#140), the user could probably implement "any" and "leader" policies if we only provided "all" semantics.

Now, set-level behavior:

replicationController at least needs to know the conditions under which it should replace terminated/lost instances. It's hard to provide precise success semantics since containers can be lost with indeterminate exit status, but that's technically true even for a single pod. replicationController should be able to see the restart policies and termination reasons of the pods it controls. If a pod terminates and should not be restarted, I think replicationController should just automatically reduce its desired replica count by one.

I could also imagine users wanting any/all/leader behavior at the set level. However, I don't think we should do that, which leads me to believe we shouldn't do it at the pod level for now, either. If we were to provide the functionality at the set level, it shouldn't be tied to repllicationController. Instead, it would be a separate policy resource associated with the pods via its own label selector. This would allow it to work at either the service level or replicationController level or over any other grouping the user desired. We should ensure that it's not too hard to implement these types of policies using event hooks.

bgrant0607 · 2014-06-24T04:59:41Z

FWIW, two people have recommended the Erlang supervisor model/spec:
http://www.erlang.org/doc/design_principles/sup_princ.html
http://www.erlang.org/doc/man/supervisor.html

IIUC, Erlang calls my "any" policy "one for all" and my "all" policy "one for one".

smarterclayton · 2014-06-25T19:12:20Z

How would an API client know that the individual container "succeeded"? (for a definition of success)

bgrant0607 · 2014-06-25T19:46:39Z

@smarterclayton If you're asking about how Kubernetes will detect successful termination, we need machine-friendly, actionable status codes from Docker and from libcontainer (#137). Every process management and workflow orchestration system in the universe is going to need that. Normal termination with exit code 0 should indicate success.

If you're asking how Kubernetes's clients would detect termination, they could poll the currentState of the pod. We don't really have per-container status information there yet -- we'd need to add that. A Watch API would be better than polling -- that's worth an issue of its own. We could also provide a special status communication mechanism to containers and/or to event hooks (e.g., tmpfs file or environment variable). On top of these primitives, we could build a library and command-line operation to wait for termination and return the status.

aspyker · 2014-06-25T19:47:51Z

ran into this for Acme Air for case # 3 (run-once) for initial database loader processes

smarterclayton · 2014-06-25T20:13:34Z

@bgrant0607 Probably fair to restate my question as whether you have a model in mind (based on previous experience at Google) that defines what you feel is a scalable and reliable mechanism for communicating fault and exit information to an API consumer - for instance, to implement run-once or run-at-least-once containers that are expected to exit and not restart as aspyker mentioned.

For instance, Mesos defines a protocol between master and slave that attempts to provide some level of guarantees for communicating the exit status, subject to the same limitations you noted above about not being truly deterministic. That model assumes bidirectional communication between master and slave, which Kubernetes does not.

Agree that watch from client->slave or client->master->slave is better than polling, although it seems more difficult to scale the master when the number of watchers+tasks grows. Do you see the master recording exit status for run-once containers in a central store, or that being a separate subsystem that could scale orthogonally to the api server / replication server and aggregate events generated by the minions? I could imagine that transient failures of containers with a "restart-always" policy would be useful to know to an api consumer - to be able to see that container X restarted at time T1, T2, and T3.

bgrant0607 · 2014-06-25T23:43:09Z

@smarterclayton First, I think the master should delegate the basic restart policy to the slave: always restart, restart on failure, never restart. The master should only handle cross-node restarts directly. And, yes, the master should pull status from the slaves and store it (#156), as well as periodically check their health (#193). As scaling demands grow, that responsibility could indeed be split out to a separate component or set of components.

Reason for last termination (#137), termination message from the application (#139), time of last termination, and number of terminations should be among the information collected. State of terminated containers/pods should be kept around on the slaves long enough for the master to observe it the vast majority of the time (e.g., 10x the normal polling interval, or 2x the period after which an unresponsive node would be considered failed, anyway; explicit decoupling of stop vs. delete would also enable the master to control deletion of observed terminated containers/pods), but the master would record unobserved terminations as having failed, ideally with as much specificity as possible about what happened (node was unresponsive, node failed, etc.). A monotonically increasing count of restarts could be converted to approximate recency, sliding-window counts, rates, and other useful information by continuous observers. A means of setting or resetting the count is sometimes useful, but non-essential. Termination events could also be buffered (in a bounded-sized ring buffer with an event sequence number) and streamed off the slave and logged for debugging, but shouldn't be necessary for correctness, since events could always be lost.

Reasons for system-initiated container stoppages (e.g., due to liveness probe failures - #66) should be explicitly recorded (#137), but can be treated as failures with respect to restart policy. User-initiated cancellation should override the restart policy, as should user-initiated restarts (#159).

With more comprehensive failure information from libcontainer and Docker we could distinguish container setup errors from later-stage execution failures, but if in doubt, the slave (and master) should be conservative about not restarting "run once" containers that may have started execution.

Containers should have unique identifiers so the system doesn't confuse different instances or incarnations (#199).

Overall system architecture for availability, scalability, fault tolerance, etc. should be discussed elsewhere.

bgrant0607 · 2014-06-27T22:28:55Z

Two relevant Docker issues being discussed:
moby/moby#26 auto-restart processes
moby/moby#1311 production-ready process monitoring

The former is converging towards supporting restarts in Docker, with the 3 modes proposed here: always, on failure, never.

The latter has been debating the merits of "docker exec", which would not run the process under supervision of the Docker daemon. The motivation is to facilitate management by process managers such as systemd, supervisord, upstart, monit, runit, etc. This approach is attractive for a number of reasons.

smarterclayton · 2014-06-27T22:38:10Z

While not called out in the latter issue, the blocking dependency is the ability for the daemon to continue to offer a consistent API for managing containers. This was one of the inspirations for libswarm - allowing the daemon to connect to a process running in the container namespace in order to issue commands that affect the container as a unit (stop, stream logs, execute a new process). The refactored Docker engine to allow that currently exists in a branch of Michael Crosby's, but libchan and swarm are not mature enough yet to provide that behavior.

thockin · 2014-07-18T05:12:48Z

All of this sounds reasonable to me, except the part about it being specified per-pod rather than per-container. I don't think it is far fetched to have an initial loader container that runs to completion when a pod lands on a host and then exits, while the main server is in "run forever" mode.

I don't think forcing the spec to be per-pod buys any simplicity, either. Containers are the things that actually run, why would I spec the policy on the pod?

lexlapax · 2014-07-18T05:25:40Z

Been following this thread. I hope my comments are welcome, as I/we are trying to figure out a way to contribute actual code, configurations etc.

The way I was looking at it, it makes sense for the restart behavior to be a the pod level rather than at the container level, keeping in the abstraction around pods expose a service (composed of one or more containers that may communicate between them and may share compute/network/storage resources)..

For the behavior around singleton containers, you can always have a pod with just one container, which would get you the same thing.

The notion of pods as a service endpoints is much more powerful than the notion of singleton containers as service endpoints.

This again deviates slightly from the original docker intent - a container is a service encapsulation, which is not entirely true,.. that's why you have docker links, and now things like etcd or dns based inter-container linkages which sort of start breaking down when it comes to dependencies etc..

The pod abstraction helps in that regard, and as stated, you could always have one container pods..

thockin · 2014-07-18T05:39:48Z

You don't have to sell me on pods. My concern is that attaching restarts to pods feels artificial for very little gain (its not much simpler, really) and makes impossible some easy-to-imagine use cases.

lexlapax · 2014-07-18T08:15:51Z

simplicity wise, how would this be different conceptually in unix from say a kill signal to a group of processes (pod) vs a kill signal to a singular process (container). implementation wise, it should just cascade down to individual processes..

ironcladlou · 2014-07-18T14:40:56Z

Per-container policies seems the most flexible to me. Another example of a use for per-container policy would be adding a run-once container to an existing pod.

You can compose pod-level behavior using container-level policy, but the inverse is not true.

One disadvantage I can see to per-container is added complexity to the spec. Maybe defaults can help with this. Related: could a pod-scoped default for containers make sense, or would that add more cognitive overhead than it's worth?

thockin · 2014-07-18T14:47:28Z

Dan, I expect the average number of containers per pod to be low - less
than 5 for the vast majority of case - so I don't think that the logic to
support a pod-level default is worthwhile (yet?). It would also set a
precedent for the API that we would sort of be expected to follow for other
things, and that will just lead to complexity in the code.

If it turns out to be a pain point, we can always add more API later - but
getting rid of API is harder.

On Fri, Jul 18, 2014 at 7:41 AM, Dan Mace notifications@github.com wrote:

Per-container policies seems the most flexible to me. Another example of a
use for per-container policy would be adding a run-once container to an
existing pod.

You could compose pod-level behavior using container-level policy, but the
inverse is not true.

One disadvantage I can see to per-container is added complexity to the
spec. Maybe defaults can help with this. Related: could a pod-scoped
default for containers make sense, or would that add more cognitive
overhead than it's worth?

Reply to this email directly or view it on GitHub
#127 (comment)
.

ironcladlou · 2014-07-18T14:49:07Z

@thockin Points well taken. I agree that the complexity of an additional pod-level API is premature.

pmorie · 2014-07-18T15:58:20Z

I think the policy has to be configurable on a container level but a pod-level default will be convenient to have in the spec. If the policy is only configurable on the pod level, that seems that it would prevent you from being able to run a transient task (run-once) in a pod of run-forever containers.

pmorie · 2014-07-18T15:59:46Z

I only read @thockin 's point after posting the above comment. I accept these points; can live without pod-default at the moment.

lexlapax · 2014-07-20T04:00:24Z

I would agree as well, as long as we're open to having a way to extend those apis later to the pod-level , when required. Thanks.

bgrant0607 · 2014-07-25T17:46:30Z

@thockin @smarterclayton @ironcladlou @lexlapax @pmorie @dchen1107

Regarding per-pod vs. per-container restart policies:

Pods are not intended to be scheduling targets, and containers within a pod are not intended to be used for intra-pod workflows. We have no plans to support arbitrary execution-order dependencies between containers within a pod, for example.

The containers are part of the definition of the pod. The containers associated with a pod should only change via explicit update of the pod definition to add/remove/update its containers. When one container terminates, that container should not be removed from the definition of the pod implicitly, but should have its termination status reported.

The reason for a pod-level restart policy is because it affects the lifetime of the pod as a whole. A pod should only be terminated once all the containers within it terminate.

I have not been able to think of a single valid use case where containers within the pod should have different restart policies. It seems confusing, unnecessarily complex, and likely to promote misuse of the pod abstraction, such as with all proposed use cases in this issue.

The common case for multi-container pods is for services where all containers run forever. The common case for batch and test workloads that terminate is just one container per pod. We should allow multiple containers that terminate, but we need to implement clean semantics for this case.

One-off or periodic operations should be performed using runin, which we should expose as soon as the first-class support in Docker is completed, by forking from within a container, or with entirely new pods.

Things like initial data loaders should be triggered using event hooks #140 . Restart policy is not sufficient to make this work.

I also don't want to implement increasingly complex restart policies in the core, but instead provide hooks such that users or higher-level APIs can implement whatever policies they please. In fact, we could entirely punt on restart policies with the right hooks, by giving users the option to nuke their own containers, pods, or sets of containers upon termination, before they restart. However, a simple restart policy would be easier to use for common batch and test use cases, and would convey useful semantic/intent information to the system about the type of workload being run.

lavalamp · 2014-07-25T18:16:15Z

One-off or periodic operations should be performed using runin, which we should expose as soon as the first-class support in Docker is completed, by forking from within a container, or with entirely new pods.

Can you expand on this? Reusing a pod for a periodic action doesn't seem consistent with our model to me.

bgrant0607 · 2014-07-25T21:14:00Z

@lavalamp I was thinking runin would be used for the one-off case, mostly, such as for debugging, data recovery, or emergency maintenance.

Continuous background and/or periodic use cases include:

cleanup / GC / maintenance / wipeout
serving data generation / aggregation / indexing / import
defensive analysis (spam, abuse, dos, etc.)
logs processing / billing / audit / report generation
integrity checking / validation
online/offline feedback / adaptation / machine learning
data snapshots / copies / backups
periodic build/push

"Cloud-native" workloads would store the data to distributed/cloud storage and launch new pods to do the processing, similar to Scheduled Tasks in AppEngine.

Legacy workloads that store data on local storage volumes would need these tasks to run locally, and/or have some way of offloading the data. Some people (e.g., https://news.ycombinator.com/item?id=7258009) argue that one should run cron inside the container to do this, but then that would require a process manager / supervisor. Instead, one could run it in a container by itself, accessing a shared volume, either files in the volume or named sockets or pipes, similar to how log management can be handled.

dchen1107 · 2014-07-25T21:24:13Z

I had a long offline discussion with @bgrant0607 this morning, and I agreed with him based on the definition of pod is a scheduling unit, not a scheduling target. Once you agreed with the definition, a list of potential use cases as intra-pod workflows could be ruled out. Enabling intra-pod workflows through hierarchical scheduling is too complicated, error-prone and not necessary for most of use cases if not all. The usecases which has a run_forever controller container in a pod, and a bunch of run_til_succeed batch jobs listening to controller, should be handled at the higher level.

I came up several possible usecases. One is pre-config type container only run once and personalize pod for service. Brian pointed out it could be handled by event hook. I agreed that event hook is a clean way to handle this, even it is much harder for the users to use at the beginning. Another usecase is cron-type job or debugging process, but run-in feature should handle that.

In this case, I couldn't come up any more usecases which has different restart policies for containers in a given pod. If a service job want to run forever, its monitoring and logging collector jobs should also run forever. A canary version service wants to run once, all its helper containers only requires to run once.

I actually started a PR to introduce a restart policy at container level based on my instinct and my passed experiences. But I failed to convince myself with a solid / valid use cases based on Pod definition. That is why I called a meeting with Brian, and he obviously convinced me on this very topic.

bgrant0607 · 2014-07-25T21:29:25Z

Meanwhile: moby/moby#7226

Fixes kubernetes#127.

No need to go get go-bindata

… ingress repo (kubernetes#127) * Add system-account priledges for addons This adds a superuser level systemaccount (not ideal long term) to get the addons working. This is a stop gap fix until https://github.com/juju-solutions/bundle-canonical-kubernetes/issues/289 can be resolved. * Add return False in case of failure during application of rolebindings * Sync the ingress templates with the latest published changes from the ingress repo This moves the nginx-ingress-controller into the kube-system namespace and fixes a 403 error that was emitting from the default http backend. fixes https://github.com/juju-solutions/bundle-canonical-kubernetes/issues/279

maicohjf · 2019-03-24T07:07:26Z

@smarterclayton First, I think the master should delegate the basic restart policy to the slave: always restart, restart on failure, never restart. The master should only handle cross-node restarts directly. And, yes, the master should pull status from the slaves and store it (#156), as well as periodically check their health (#193). As scaling demands grow, that responsibility could indeed be split out to a separate component or set of components.

Reason for last termination (#137), termination message from the application (#139), time of last termination, and number of terminations should be among the information collected. State of terminated containers/pods should be kept around on the slaves long enough for the master to observe it the vast majority of the time (e.g., 10x the normal polling interval, or 2x the period after which an unresponsive node would be considered failed, anyway; explicit decoupling of stop vs. delete would also enable the master to control deletion of observed terminated containers/pods), but the master would record unobserved terminations as having failed, ideally with as much specificity as possible about what happened (node was unresponsive, node failed, etc.). A monotonically increasing count of restarts could be converted to approximate recency, sliding-window counts, rates, and other useful information by continuous observers. A means of setting or resetting the count is sometimes useful, but non-essential. Termination events could also be buffered (in a bounded-sized ring buffer with an event sequence number) and streamed off the slave and logged for debugging, but shouldn't be necessary for correctness, since events could always be lost.

Reasons for system-initiated container stoppages (e.g., due to liveness probe failures - #66) should be explicitly recorded (#137), but can be treated as failures with respect to restart policy. User-initiated cancellation should override the restart policy, as should user-initiated restarts (#159).

With more comprehensive failure information from libcontainer and Docker we could distinguish container setup errors from later-stage execution failures, but if in doubt, the slave (and master) should be conservative about not restarting "run once" containers that may have started execution.

Containers should have unique identifiers so the system doesn't confuse different instances or incarnations (#199).

Overall system architecture for availability, scalability, fault tolerance, etc. should be discussed elsewhere.

Create a Kubernetes Secret as follows:
 Name: super-secret
 Credential: alice or username: bob
Create a Pod named pod-secrets-via-file using the redis image which mounts a secret named
super-secret at /secrets
Create a second Pod named pod-secrets-via-env using the redis image, which exports credential
/ username as TOPSECRET / CREDENTIALS

kubectl create secret generic spuer-secret --from-literal=Credential=alice
apiVersion: v1
kind: Pod
metadata:
name: pod-secrets-via-file
spec:
containers:
- name: pod-secrets-via-file
image: redis
volumeMounts:
- name: super-secret
mountPath: "/secret"
volumes:
- name: super-secret
secret:
secretName: super-secret

apiVersion: v1
kind: Pod
metadata:
name: pod-secrets-via-env
spec:
containers:
- name: pod-secrets-via-env
image: redis
env:
- name: TOPSECRET
valueFrom:
secretKeyRef:
name: super-secret
key: Credential

Bug 1826230: 1.18.2 missed carries

build/dist maintenance

…es#127) * init * update guestbook all in one

…ctor-unstructured-logging UPSTREAM: 111898: Reflector: support logging Unstructured type

jbeda added the enhancement label Jun 17, 2014

bgrant0607 mentioned this issue Jun 17, 2014

PreStart and PostStop event hooks #140

Closed

smarterclayton mentioned this issue Jun 29, 2014

Kubelet needs a way to send results #285

Closed

tristanz mentioned this issue Jul 4, 2014

Cleanly split core services and schedulers #357

Closed

smarterclayton mentioned this issue Jul 20, 2014

Support Restart policy in the kubelet (pre-design) #544

Closed

erictune added the kubelet label Jul 24, 2014

bgrant0607 mentioned this issue Aug 24, 2015

Containers should support inherent garbage collection & restart policy #12856

Closed

bacongobbler mentioned this issue Oct 19, 2015

Application restart policies deis/deis#4632

Closed

vishh pushed a commit to vishh/kubernetes that referenced this issue Apr 6, 2016

Return an empty state for old versions of Docker.

7a3f7b9

Fixes kubernetes#127.

smarterclayton mentioned this issue Aug 25, 2016

Post deployment hook cpod does not share network namespace with main pod openshift/origin#10603

Closed

derekmahar mentioned this issue Sep 8, 2016

Containers startup throttling #3312

Closed

xingzhou pushed a commit to xingzhou/kubernetes that referenced this issue Dec 15, 2016

Merge pull request kubernetes#127 from ixdy/go-bindata

a530168

No need to go get go-bindata

This was referenced Feb 23, 2017

Pods that fail health checks always restarting on the same minion instead of others? #13385

Closed

Make backoff parameters configurable #17801

Closed

Flapping service handling #13225

Closed

vistalba mentioned this issue Dec 22, 2017

Can't ping service in same kubernetes namespace #57556

Closed

k8s-ci-robot mentioned this issue Jan 12, 2018

[DO NOT MERGE] Add support for experimental-userns-remap-root-uid and experimental-userns-remap-root-gid options to match the remapping used by the container runtime. #55707

Closed

tenkjm mentioned this issue Jan 15, 2018

Error to deploy mongo with azure file storage #58308

Closed

jethrogb mentioned this issue Apr 11, 2018

kubectl apply doesn't update Deployment container ports correctly #62366

Closed

jethrogb mentioned this issue May 16, 2018

Autogenerated container port name conflicts with multiple hostports #63906

Closed

LiuQIu7 mentioned this issue Oct 25, 2018

kube-proxy call iptables-save or iptables-restore cause kernel panic periodly #70229

Closed

ZzEeKkAa mentioned this issue Nov 19, 2018

Bus error (core dumped) #71233

Closed

Nebulazhang mentioned this issue Dec 19, 2018

Delete StatefulSet and re-create immediately, pod "ContainerCreating" because of detach/attach fail #72181

Closed

akarasik mentioned this issue Mar 19, 2019

Restarting the pause container fails #75477

Closed

ivan-cai mentioned this issue Feb 26, 2020

If detaching failed while deleting statefulset, we re-create the statefulset and the pods are schedulerd on the same node, pod will be "ContainerCreating" because of attach failed #88565

Closed

marun pushed a commit to marun/kubernetes that referenced this issue May 13, 2020

Merge pull request kubernetes#127 from marun/1.18.2-missed-carries

1f74e63

Bug 1826230: 1.18.2 missed carries

b3atlesfan pushed a commit to b3atlesfan/kubernetes that referenced this issue Feb 5, 2021

Merge pull request kubernetes#127 from eyakubovich/master

95140bd

build/dist maintenance

pjh pushed a commit to pjh/kubernetes that referenced this issue Jan 31, 2022

Replace redis master/slave terminology with leader/follower (kubernet…

abbb383

…es#127) * init * update guestbook all in one

ncdc added a commit to ncdc/kubernetes that referenced this issue Feb 6, 2023

Merge pull request kubernetes#127 from ncdc/kcp-backport-111898-refle…

f48a2e5

…ctor-unstructured-logging UPSTREAM: 111898: Reflector: support logging Unstructured type

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Configurable restart behavior #127

Configurable restart behavior #127

bgrant0607 commented Jun 16, 2014

bgrant0607 commented Jun 18, 2014

bgrant0607 commented Jun 24, 2014

smarterclayton commented Jun 25, 2014

bgrant0607 commented Jun 25, 2014

aspyker commented Jun 25, 2014

smarterclayton commented Jun 25, 2014

bgrant0607 commented Jun 25, 2014

bgrant0607 commented Jun 27, 2014

smarterclayton commented Jun 27, 2014

thockin commented Jul 18, 2014

lexlapax commented Jul 18, 2014

thockin commented Jul 18, 2014

lexlapax commented Jul 18, 2014

ironcladlou commented Jul 18, 2014

thockin commented Jul 18, 2014

ironcladlou commented Jul 18, 2014

pmorie commented Jul 18, 2014

pmorie commented Jul 18, 2014

lexlapax commented Jul 20, 2014

bgrant0607 commented Jul 25, 2014

lavalamp commented Jul 25, 2014

bgrant0607 commented Jul 25, 2014

dchen1107 commented Jul 25, 2014

bgrant0607 commented Jul 25, 2014

maicohjf commented Mar 24, 2019

Configurable restart behavior #127

Configurable restart behavior #127

Comments

bgrant0607 commented Jun 16, 2014

bgrant0607 commented Jun 18, 2014

bgrant0607 commented Jun 24, 2014

smarterclayton commented Jun 25, 2014

bgrant0607 commented Jun 25, 2014

aspyker commented Jun 25, 2014

smarterclayton commented Jun 25, 2014

bgrant0607 commented Jun 25, 2014

bgrant0607 commented Jun 27, 2014

smarterclayton commented Jun 27, 2014

thockin commented Jul 18, 2014

lexlapax commented Jul 18, 2014

thockin commented Jul 18, 2014

lexlapax commented Jul 18, 2014

ironcladlou commented Jul 18, 2014

thockin commented Jul 18, 2014

ironcladlou commented Jul 18, 2014

pmorie commented Jul 18, 2014

pmorie commented Jul 18, 2014

lexlapax commented Jul 20, 2014

bgrant0607 commented Jul 25, 2014

lavalamp commented Jul 25, 2014

bgrant0607 commented Jul 25, 2014

dchen1107 commented Jul 25, 2014

bgrant0607 commented Jul 25, 2014

maicohjf commented Mar 24, 2019