Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configurable restart behavior #127

Closed
bgrant0607 opened this issue Jun 16, 2014 · 36 comments
Closed

Configurable restart behavior #127

bgrant0607 opened this issue Jun 16, 2014 · 36 comments
Assignees
Labels
area/kubelet sig/node Categorizes an issue or PR as relevant to SIG Node.

Comments

@bgrant0607
Copy link
Member

Right now we assume that all containers run forever. We should support configurable restart behavior for the following modes:

  1. run forever (e.g., for services)
  2. run until successful termination (e.g., for batch workloads)
  3. run once (e.g., for tests)

The main tricky issues are:

  1. what to do for multi-container pods
  2. what to do for replicationController

We should also think about how to facilitate implementation of custom policies outside the system. See also:
googlearchive/container-agent#9

@bgrant0607
Copy link
Member Author

Thoughts on multi-container pods:

First of all, I think restart behavior should be specified at the pod level rather than the container level. It wouldn't make sense for one container to terminate and another to restart forever, for example.

Run forever is obviously easy -- we're doing it now.

Run once is fairly easy, too, I think. As soon as one container terminates, probably all should be terminated (call this policy "any").

For run until success, we could restart individual containers until each succeeds (call this policy "all").

We should make all vs. any a separate policy from forever vs. success vs. once. Another variant people would probably want is a "leader" container, to which the other containers' lifetimes would be tied. Since we start containers in order, the leader would need to be the first one in the list. To play devil's advocate, if we had event hooks (#140), the user could probably implement "any" and "leader" policies if we only provided "all" semantics.

Now, set-level behavior:

replicationController at least needs to know the conditions under which it should replace terminated/lost instances. It's hard to provide precise success semantics since containers can be lost with indeterminate exit status, but that's technically true even for a single pod. replicationController should be able to see the restart policies and termination reasons of the pods it controls. If a pod terminates and should not be restarted, I think replicationController should just automatically reduce its desired replica count by one.

I could also imagine users wanting any/all/leader behavior at the set level. However, I don't think we should do that, which leads me to believe we shouldn't do it at the pod level for now, either. If we were to provide the functionality at the set level, it shouldn't be tied to repllicationController. Instead, it would be a separate policy resource associated with the pods via its own label selector. This would allow it to work at either the service level or replicationController level or over any other grouping the user desired. We should ensure that it's not too hard to implement these types of policies using event hooks.

@bgrant0607
Copy link
Member Author

FWIW, two people have recommended the Erlang supervisor model/spec:
http://www.erlang.org/doc/design_principles/sup_princ.html
http://www.erlang.org/doc/man/supervisor.html

IIUC, Erlang calls my "any" policy "one for all" and my "all" policy "one for one".

@smarterclayton
Copy link
Contributor

How would an API client know that the individual container "succeeded"? (for a definition of success)

@bgrant0607
Copy link
Member Author

@smarterclayton If you're asking about how Kubernetes will detect successful termination, we need machine-friendly, actionable status codes from Docker and from libcontainer (#137). Every process management and workflow orchestration system in the universe is going to need that. Normal termination with exit code 0 should indicate success.

If you're asking how Kubernetes's clients would detect termination, they could poll the currentState of the pod. We don't really have per-container status information there yet -- we'd need to add that. A Watch API would be better than polling -- that's worth an issue of its own. We could also provide a special status communication mechanism to containers and/or to event hooks (e.g., tmpfs file or environment variable). On top of these primitives, we could build a library and command-line operation to wait for termination and return the status.

@aspyker
Copy link

aspyker commented Jun 25, 2014

ran into this for Acme Air for case # 3 (run-once) for initial database loader processes

@smarterclayton
Copy link
Contributor

@bgrant0607 Probably fair to restate my question as whether you have a model in mind (based on previous experience at Google) that defines what you feel is a scalable and reliable mechanism for communicating fault and exit information to an API consumer - for instance, to implement run-once or run-at-least-once containers that are expected to exit and not restart as aspyker mentioned.

For instance, Mesos defines a protocol between master and slave that attempts to provide some level of guarantees for communicating the exit status, subject to the same limitations you noted above about not being truly deterministic. That model assumes bidirectional communication between master and slave, which Kubernetes does not.

Agree that watch from client->slave or client->master->slave is better than polling, although it seems more difficult to scale the master when the number of watchers+tasks grows. Do you see the master recording exit status for run-once containers in a central store, or that being a separate subsystem that could scale orthogonally to the api server / replication server and aggregate events generated by the minions? I could imagine that transient failures of containers with a "restart-always" policy would be useful to know to an api consumer - to be able to see that container X restarted at time T1, T2, and T3.

@bgrant0607
Copy link
Member Author

@smarterclayton First, I think the master should delegate the basic restart policy to the slave: always restart, restart on failure, never restart. The master should only handle cross-node restarts directly. And, yes, the master should pull status from the slaves and store it (#156), as well as periodically check their health (#193). As scaling demands grow, that responsibility could indeed be split out to a separate component or set of components.

Reason for last termination (#137), termination message from the application (#139), time of last termination, and number of terminations should be among the information collected. State of terminated containers/pods should be kept around on the slaves long enough for the master to observe it the vast majority of the time (e.g., 10x the normal polling interval, or 2x the period after which an unresponsive node would be considered failed, anyway; explicit decoupling of stop vs. delete would also enable the master to control deletion of observed terminated containers/pods), but the master would record unobserved terminations as having failed, ideally with as much specificity as possible about what happened (node was unresponsive, node failed, etc.). A monotonically increasing count of restarts could be converted to approximate recency, sliding-window counts, rates, and other useful information by continuous observers. A means of setting or resetting the count is sometimes useful, but non-essential. Termination events could also be buffered (in a bounded-sized ring buffer with an event sequence number) and streamed off the slave and logged for debugging, but shouldn't be necessary for correctness, since events could always be lost.

Reasons for system-initiated container stoppages (e.g., due to liveness probe failures - #66) should be explicitly recorded (#137), but can be treated as failures with respect to restart policy. User-initiated cancellation should override the restart policy, as should user-initiated restarts (#159).

With more comprehensive failure information from libcontainer and Docker we could distinguish container setup errors from later-stage execution failures, but if in doubt, the slave (and master) should be conservative about not restarting "run once" containers that may have started execution.

Containers should have unique identifiers so the system doesn't confuse different instances or incarnations (#199).

Overall system architecture for availability, scalability, fault tolerance, etc. should be discussed elsewhere.

@bgrant0607
Copy link
Member Author

Two relevant Docker issues being discussed:
moby/moby#26 auto-restart processes
moby/moby#1311 production-ready process monitoring

The former is converging towards supporting restarts in Docker, with the 3 modes proposed here: always, on failure, never.

The latter has been debating the merits of "docker exec", which would not run the process under supervision of the Docker daemon. The motivation is to facilitate management by process managers such as systemd, supervisord, upstart, monit, runit, etc. This approach is attractive for a number of reasons.

@smarterclayton
Copy link
Contributor

While not called out in the latter issue, the blocking dependency is the ability for the daemon to continue to offer a consistent API for managing containers. This was one of the inspirations for libswarm - allowing the daemon to connect to a process running in the container namespace in order to issue commands that affect the container as a unit (stop, stream logs, execute a new process). The refactored Docker engine to allow that currently exists in a branch of Michael Crosby's, but libchan and swarm are not mature enough yet to provide that behavior.

@thockin
Copy link
Member

thockin commented Jul 18, 2014

All of this sounds reasonable to me, except the part about it being specified per-pod rather than per-container. I don't think it is far fetched to have an initial loader container that runs to completion when a pod lands on a host and then exits, while the main server is in "run forever" mode.

I don't think forcing the spec to be per-pod buys any simplicity, either. Containers are the things that actually run, why would I spec the policy on the pod?

@lexlapax
Copy link

Been following this thread. I hope my comments are welcome, as I/we are trying to figure out a way to contribute actual code, configurations etc.

The way I was looking at it, it makes sense for the restart behavior to be a the pod level rather than at the container level, keeping in the abstraction around pods expose a service (composed of one or more containers that may communicate between them and may share compute/network/storage resources)..

For the behavior around singleton containers, you can always have a pod with just one container, which would get you the same thing.

The notion of pods as a service endpoints is much more powerful than the notion of singleton containers as service endpoints.

This again deviates slightly from the original docker intent - a container is a service encapsulation, which is not entirely true,.. that's why you have docker links, and now things like etcd or dns based inter-container linkages which sort of start breaking down when it comes to dependencies etc..

The pod abstraction helps in that regard, and as stated, you could always have one container pods..

@thockin
Copy link
Member

thockin commented Jul 18, 2014

You don't have to sell me on pods. My concern is that attaching restarts to pods feels artificial for very little gain (its not much simpler, really) and makes impossible some easy-to-imagine use cases.

@lexlapax
Copy link

simplicity wise, how would this be different conceptually in unix from say a kill signal to a group of processes (pod) vs a kill signal to a singular process (container). implementation wise, it should just cascade down to individual processes..

@ironcladlou
Copy link
Contributor

Per-container policies seems the most flexible to me. Another example of a use for per-container policy would be adding a run-once container to an existing pod.

You can compose pod-level behavior using container-level policy, but the inverse is not true.

One disadvantage I can see to per-container is added complexity to the spec. Maybe defaults can help with this. Related: could a pod-scoped default for containers make sense, or would that add more cognitive overhead than it's worth?

@thockin
Copy link
Member

thockin commented Jul 18, 2014

Dan, I expect the average number of containers per pod to be low - less
than 5 for the vast majority of case - so I don't think that the logic to
support a pod-level default is worthwhile (yet?). It would also set a
precedent for the API that we would sort of be expected to follow for other
things, and that will just lead to complexity in the code.

If it turns out to be a pain point, we can always add more API later - but
getting rid of API is harder.

On Fri, Jul 18, 2014 at 7:41 AM, Dan Mace notifications@github.com wrote:

Per-container policies seems the most flexible to me. Another example of a
use for per-container policy would be adding a run-once container to an
existing pod.

You could compose pod-level behavior using container-level policy, but the
inverse is not true.

One disadvantage I can see to per-container is added complexity to the
spec. Maybe defaults can help with this. Related: could a pod-scoped
default for containers make sense, or would that add more cognitive
overhead than it's worth?

Reply to this email directly or view it on GitHub
#127 (comment)
.

@ironcladlou
Copy link
Contributor

@thockin Points well taken. I agree that the complexity of an additional pod-level API is premature.

@pmorie
Copy link
Member

pmorie commented Jul 18, 2014

I think the policy has to be configurable on a container level but a pod-level default will be convenient to have in the spec. If the policy is only configurable on the pod level, that seems that it would prevent you from being able to run a transient task (run-once) in a pod of run-forever containers.

@pmorie
Copy link
Member

pmorie commented Jul 18, 2014

I only read @thockin 's point after posting the above comment. I accept these points; can live without pod-default at the moment.

@lexlapax
Copy link

I would agree as well, as long as we're open to having a way to extend those apis later to the pod-level , when required. Thanks.

@bgrant0607
Copy link
Member Author

@thockin @smarterclayton @ironcladlou @lexlapax @pmorie @dchen1107

Regarding per-pod vs. per-container restart policies:

Pods are not intended to be scheduling targets, and containers within a pod are not intended to be used for intra-pod workflows. We have no plans to support arbitrary execution-order dependencies between containers within a pod, for example.

The containers are part of the definition of the pod. The containers associated with a pod should only change via explicit update of the pod definition to add/remove/update its containers. When one container terminates, that container should not be removed from the definition of the pod implicitly, but should have its termination status reported.

The reason for a pod-level restart policy is because it affects the lifetime of the pod as a whole. A pod should only be terminated once all the containers within it terminate.

I have not been able to think of a single valid use case where containers within the pod should have different restart policies. It seems confusing, unnecessarily complex, and likely to promote misuse of the pod abstraction, such as with all proposed use cases in this issue.

The common case for multi-container pods is for services where all containers run forever. The common case for batch and test workloads that terminate is just one container per pod. We should allow multiple containers that terminate, but we need to implement clean semantics for this case.

One-off or periodic operations should be performed using runin, which we should expose as soon as the first-class support in Docker is completed, by forking from within a container, or with entirely new pods.

Things like initial data loaders should be triggered using event hooks #140 . Restart policy is not sufficient to make this work.

I also don't want to implement increasingly complex restart policies in the core, but instead provide hooks such that users or higher-level APIs can implement whatever policies they please. In fact, we could entirely punt on restart policies with the right hooks, by giving users the option to nuke their own containers, pods, or sets of containers upon termination, before they restart. However, a simple restart policy would be easier to use for common batch and test use cases, and would convey useful semantic/intent information to the system about the type of workload being run.

@lavalamp
Copy link
Member

One-off or periodic operations should be performed using runin, which we should expose as soon as the first-class support in Docker is completed, by forking from within a container, or with entirely new pods.

Can you expand on this? Reusing a pod for a periodic action doesn't seem consistent with our model to me.

@bgrant0607
Copy link
Member Author

@lavalamp I was thinking runin would be used for the one-off case, mostly, such as for debugging, data recovery, or emergency maintenance.

Continuous background and/or periodic use cases include:

  • cleanup / GC / maintenance / wipeout
  • serving data generation / aggregation / indexing / import
  • defensive analysis (spam, abuse, dos, etc.)
  • logs processing / billing / audit / report generation
  • integrity checking / validation
  • online/offline feedback / adaptation / machine learning
  • data snapshots / copies / backups
  • periodic build/push

"Cloud-native" workloads would store the data to distributed/cloud storage and launch new pods to do the processing, similar to Scheduled Tasks in AppEngine.

Legacy workloads that store data on local storage volumes would need these tasks to run locally, and/or have some way of offloading the data. Some people (e.g., https://news.ycombinator.com/item?id=7258009) argue that one should run cron inside the container to do this, but then that would require a process manager / supervisor. Instead, one could run it in a container by itself, accessing a shared volume, either files in the volume or named sockets or pipes, similar to how log management can be handled.

@dchen1107
Copy link
Member

I had a long offline discussion with @bgrant0607 this morning, and I agreed with him based on the definition of pod is a scheduling unit, not a scheduling target. Once you agreed with the definition, a list of potential use cases as intra-pod workflows could be ruled out. Enabling intra-pod workflows through hierarchical scheduling is too complicated, error-prone and not necessary for most of use cases if not all. The usecases which has a run_forever controller container in a pod, and a bunch of run_til_succeed batch jobs listening to controller, should be handled at the higher level.

I came up several possible usecases. One is pre-config type container only run once and personalize pod for service. Brian pointed out it could be handled by event hook. I agreed that event hook is a clean way to handle this, even it is much harder for the users to use at the beginning. Another usecase is cron-type job or debugging process, but run-in feature should handle that.

In this case, I couldn't come up any more usecases which has different restart policies for containers in a given pod. If a service job want to run forever, its monitoring and logging collector jobs should also run forever. A canary version service wants to run once, all its helper containers only requires to run once.

I actually started a PR to introduce a restart policy at container level based on my instinct and my passed experiences. But I failed to convince myself with a solid / valid use cases based on Pod definition. That is why I called a meeting with Brian, and he obviously convinced me on this very topic.

@bgrant0607
Copy link
Member Author

Meanwhile: moby/moby#7226

vishh pushed a commit to vishh/kubernetes that referenced this issue Apr 6, 2016
xingzhou pushed a commit to xingzhou/kubernetes that referenced this issue Dec 15, 2016
lazypower pushed a commit to lazypower/kubernetes that referenced this issue May 19, 2017
… ingress repo (kubernetes#127)

* Add system-account priledges for addons

This adds a superuser level systemaccount (not ideal long term) to get
the addons working. This is a stop gap fix until
https://github.com/juju-solutions/bundle-canonical-kubernetes/issues/289
can be resolved.

* Add return False in case of failure during application of rolebindings

* Sync the ingress templates with the latest published changes from the
ingress repo

This moves the nginx-ingress-controller into the kube-system namespace
and fixes a 403 error that was emitting from the default http backend.

fixes https://github.com/juju-solutions/bundle-canonical-kubernetes/issues/279
@maicohjf
Copy link

@smarterclayton First, I think the master should delegate the basic restart policy to the slave: always restart, restart on failure, never restart. The master should only handle cross-node restarts directly. And, yes, the master should pull status from the slaves and store it (#156), as well as periodically check their health (#193). As scaling demands grow, that responsibility could indeed be split out to a separate component or set of components.

Reason for last termination (#137), termination message from the application (#139), time of last termination, and number of terminations should be among the information collected. State of terminated containers/pods should be kept around on the slaves long enough for the master to observe it the vast majority of the time (e.g., 10x the normal polling interval, or 2x the period after which an unresponsive node would be considered failed, anyway; explicit decoupling of stop vs. delete would also enable the master to control deletion of observed terminated containers/pods), but the master would record unobserved terminations as having failed, ideally with as much specificity as possible about what happened (node was unresponsive, node failed, etc.). A monotonically increasing count of restarts could be converted to approximate recency, sliding-window counts, rates, and other useful information by continuous observers. A means of setting or resetting the count is sometimes useful, but non-essential. Termination events could also be buffered (in a bounded-sized ring buffer with an event sequence number) and streamed off the slave and logged for debugging, but shouldn't be necessary for correctness, since events could always be lost.

Reasons for system-initiated container stoppages (e.g., due to liveness probe failures - #66) should be explicitly recorded (#137), but can be treated as failures with respect to restart policy. User-initiated cancellation should override the restart policy, as should user-initiated restarts (#159).

With more comprehensive failure information from libcontainer and Docker we could distinguish container setup errors from later-stage execution failures, but if in doubt, the slave (and master) should be conservative about not restarting "run once" containers that may have started execution.

Containers should have unique identifiers so the system doesn't confuse different instances or incarnations (#199).

Overall system architecture for availability, scalability, fault tolerance, etc. should be discussed elsewhere.

Create a Kubernetes Secret as follows:
 Name: super-secret
 Credential: alice or username: bob
Create a Pod named pod-secrets-via-file using the redis image which mounts a secret named
super-secret at /secrets
Create a second Pod named pod-secrets-via-env using the redis image, which exports credential
/ username as TOPSECRET / CREDENTIALS

kubectl create secret generic spuer-secret --from-literal=Credential=alice
apiVersion: v1
kind: Pod
metadata:
name: pod-secrets-via-file
spec:
containers:
- name: pod-secrets-via-file
image: redis
volumeMounts:
- name: super-secret
mountPath: "/secret"
volumes:
- name: super-secret
secret:
secretName: super-secret

apiVersion: v1
kind: Pod
metadata:
name: pod-secrets-via-env
spec:
containers:
- name: pod-secrets-via-env
image: redis
env:
- name: TOPSECRET
valueFrom:
secretKeyRef:
name: super-secret
key: Credential

marun pushed a commit to marun/kubernetes that referenced this issue May 13, 2020
b3atlesfan pushed a commit to b3atlesfan/kubernetes that referenced this issue Feb 5, 2021
pjh pushed a commit to pjh/kubernetes that referenced this issue Jan 31, 2022
ncdc added a commit to ncdc/kubernetes that referenced this issue Feb 6, 2023
…ctor-unstructured-logging

UPSTREAM: 111898: Reflector: support logging Unstructured type
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/kubelet sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
None yet
Development

No branches or pull requests