New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Configurable restart behavior #127
Comments
Thoughts on multi-container pods: First of all, I think restart behavior should be specified at the pod level rather than the container level. It wouldn't make sense for one container to terminate and another to restart forever, for example. Run forever is obviously easy -- we're doing it now. Run once is fairly easy, too, I think. As soon as one container terminates, probably all should be terminated (call this policy "any"). For run until success, we could restart individual containers until each succeeds (call this policy "all"). We should make all vs. any a separate policy from forever vs. success vs. once. Another variant people would probably want is a "leader" container, to which the other containers' lifetimes would be tied. Since we start containers in order, the leader would need to be the first one in the list. To play devil's advocate, if we had event hooks (#140), the user could probably implement "any" and "leader" policies if we only provided "all" semantics. Now, set-level behavior: replicationController at least needs to know the conditions under which it should replace terminated/lost instances. It's hard to provide precise success semantics since containers can be lost with indeterminate exit status, but that's technically true even for a single pod. replicationController should be able to see the restart policies and termination reasons of the pods it controls. If a pod terminates and should not be restarted, I think replicationController should just automatically reduce its desired replica count by one. I could also imagine users wanting any/all/leader behavior at the set level. However, I don't think we should do that, which leads me to believe we shouldn't do it at the pod level for now, either. If we were to provide the functionality at the set level, it shouldn't be tied to repllicationController. Instead, it would be a separate policy resource associated with the pods via its own label selector. This would allow it to work at either the service level or replicationController level or over any other grouping the user desired. We should ensure that it's not too hard to implement these types of policies using event hooks. |
FWIW, two people have recommended the Erlang supervisor model/spec: IIUC, Erlang calls my "any" policy "one for all" and my "all" policy "one for one". |
How would an API client know that the individual container "succeeded"? (for a definition of success) |
@smarterclayton If you're asking about how Kubernetes will detect successful termination, we need machine-friendly, actionable status codes from Docker and from libcontainer (#137). Every process management and workflow orchestration system in the universe is going to need that. Normal termination with exit code 0 should indicate success. If you're asking how Kubernetes's clients would detect termination, they could poll the currentState of the pod. We don't really have per-container status information there yet -- we'd need to add that. A Watch API would be better than polling -- that's worth an issue of its own. We could also provide a special status communication mechanism to containers and/or to event hooks (e.g., tmpfs file or environment variable). On top of these primitives, we could build a library and command-line operation to wait for termination and return the status. |
ran into this for Acme Air for case # 3 (run-once) for initial database loader processes |
@bgrant0607 Probably fair to restate my question as whether you have a model in mind (based on previous experience at Google) that defines what you feel is a scalable and reliable mechanism for communicating fault and exit information to an API consumer - for instance, to implement run-once or run-at-least-once containers that are expected to exit and not restart as aspyker mentioned. For instance, Mesos defines a protocol between master and slave that attempts to provide some level of guarantees for communicating the exit status, subject to the same limitations you noted above about not being truly deterministic. That model assumes bidirectional communication between master and slave, which Kubernetes does not. Agree that watch from client->slave or client->master->slave is better than polling, although it seems more difficult to scale the master when the number of watchers+tasks grows. Do you see the master recording exit status for run-once containers in a central store, or that being a separate subsystem that could scale orthogonally to the api server / replication server and aggregate events generated by the minions? I could imagine that transient failures of containers with a "restart-always" policy would be useful to know to an api consumer - to be able to see that container X restarted at time T1, T2, and T3. |
@smarterclayton First, I think the master should delegate the basic restart policy to the slave: always restart, restart on failure, never restart. The master should only handle cross-node restarts directly. And, yes, the master should pull status from the slaves and store it (#156), as well as periodically check their health (#193). As scaling demands grow, that responsibility could indeed be split out to a separate component or set of components. Reason for last termination (#137), termination message from the application (#139), time of last termination, and number of terminations should be among the information collected. State of terminated containers/pods should be kept around on the slaves long enough for the master to observe it the vast majority of the time (e.g., 10x the normal polling interval, or 2x the period after which an unresponsive node would be considered failed, anyway; explicit decoupling of stop vs. delete would also enable the master to control deletion of observed terminated containers/pods), but the master would record unobserved terminations as having failed, ideally with as much specificity as possible about what happened (node was unresponsive, node failed, etc.). A monotonically increasing count of restarts could be converted to approximate recency, sliding-window counts, rates, and other useful information by continuous observers. A means of setting or resetting the count is sometimes useful, but non-essential. Termination events could also be buffered (in a bounded-sized ring buffer with an event sequence number) and streamed off the slave and logged for debugging, but shouldn't be necessary for correctness, since events could always be lost. Reasons for system-initiated container stoppages (e.g., due to liveness probe failures - #66) should be explicitly recorded (#137), but can be treated as failures with respect to restart policy. User-initiated cancellation should override the restart policy, as should user-initiated restarts (#159). With more comprehensive failure information from libcontainer and Docker we could distinguish container setup errors from later-stage execution failures, but if in doubt, the slave (and master) should be conservative about not restarting "run once" containers that may have started execution. Containers should have unique identifiers so the system doesn't confuse different instances or incarnations (#199). Overall system architecture for availability, scalability, fault tolerance, etc. should be discussed elsewhere. |
Two relevant Docker issues being discussed: The former is converging towards supporting restarts in Docker, with the 3 modes proposed here: always, on failure, never. The latter has been debating the merits of "docker exec", which would not run the process under supervision of the Docker daemon. The motivation is to facilitate management by process managers such as systemd, supervisord, upstart, monit, runit, etc. This approach is attractive for a number of reasons. |
While not called out in the latter issue, the blocking dependency is the ability for the daemon to continue to offer a consistent API for managing containers. This was one of the inspirations for libswarm - allowing the daemon to connect to a process running in the container namespace in order to issue commands that affect the container as a unit (stop, stream logs, execute a new process). The refactored Docker engine to allow that currently exists in a branch of Michael Crosby's, but libchan and swarm are not mature enough yet to provide that behavior. |
All of this sounds reasonable to me, except the part about it being specified per-pod rather than per-container. I don't think it is far fetched to have an initial loader container that runs to completion when a pod lands on a host and then exits, while the main server is in "run forever" mode. I don't think forcing the spec to be per-pod buys any simplicity, either. Containers are the things that actually run, why would I spec the policy on the pod? |
Been following this thread. I hope my comments are welcome, as I/we are trying to figure out a way to contribute actual code, configurations etc. The way I was looking at it, it makes sense for the restart behavior to be a the pod level rather than at the container level, keeping in the abstraction around pods expose a service (composed of one or more containers that may communicate between them and may share compute/network/storage resources).. For the behavior around singleton containers, you can always have a pod with just one container, which would get you the same thing. The notion of pods as a service endpoints is much more powerful than the notion of singleton containers as service endpoints. This again deviates slightly from the original docker intent - a container is a service encapsulation, which is not entirely true,.. that's why you have docker links, and now things like etcd or dns based inter-container linkages which sort of start breaking down when it comes to dependencies etc.. The pod abstraction helps in that regard, and as stated, you could always have one container pods.. |
You don't have to sell me on pods. My concern is that attaching restarts to pods feels artificial for very little gain (its not much simpler, really) and makes impossible some easy-to-imagine use cases. |
simplicity wise, how would this be different conceptually in unix from say a kill signal to a group of processes (pod) vs a kill signal to a singular process (container). implementation wise, it should just cascade down to individual processes.. |
Per-container policies seems the most flexible to me. Another example of a use for per-container policy would be adding a run-once container to an existing pod. You can compose pod-level behavior using container-level policy, but the inverse is not true. One disadvantage I can see to per-container is added complexity to the spec. Maybe defaults can help with this. Related: could a pod-scoped default for containers make sense, or would that add more cognitive overhead than it's worth? |
Dan, I expect the average number of containers per pod to be low - less If it turns out to be a pain point, we can always add more API later - but On Fri, Jul 18, 2014 at 7:41 AM, Dan Mace notifications@github.com wrote:
|
@thockin Points well taken. I agree that the complexity of an additional pod-level API is premature. |
I think the policy has to be configurable on a container level but a pod-level default will be convenient to have in the spec. If the policy is only configurable on the pod level, that seems that it would prevent you from being able to run a transient task (run-once) in a pod of run-forever containers. |
I only read @thockin 's point after posting the above comment. I accept these points; can live without pod-default at the moment. |
I would agree as well, as long as we're open to having a way to extend those apis later to the pod-level , when required. Thanks. |
@thockin @smarterclayton @ironcladlou @lexlapax @pmorie @dchen1107 Regarding per-pod vs. per-container restart policies: Pods are not intended to be scheduling targets, and containers within a pod are not intended to be used for intra-pod workflows. We have no plans to support arbitrary execution-order dependencies between containers within a pod, for example. The containers are part of the definition of the pod. The containers associated with a pod should only change via explicit update of the pod definition to add/remove/update its containers. When one container terminates, that container should not be removed from the definition of the pod implicitly, but should have its termination status reported. The reason for a pod-level restart policy is because it affects the lifetime of the pod as a whole. A pod should only be terminated once all the containers within it terminate. I have not been able to think of a single valid use case where containers within the pod should have different restart policies. It seems confusing, unnecessarily complex, and likely to promote misuse of the pod abstraction, such as with all proposed use cases in this issue. The common case for multi-container pods is for services where all containers run forever. The common case for batch and test workloads that terminate is just one container per pod. We should allow multiple containers that terminate, but we need to implement clean semantics for this case. One-off or periodic operations should be performed using runin, which we should expose as soon as the first-class support in Docker is completed, by forking from within a container, or with entirely new pods. Things like initial data loaders should be triggered using event hooks #140 . Restart policy is not sufficient to make this work. I also don't want to implement increasingly complex restart policies in the core, but instead provide hooks such that users or higher-level APIs can implement whatever policies they please. In fact, we could entirely punt on restart policies with the right hooks, by giving users the option to nuke their own containers, pods, or sets of containers upon termination, before they restart. However, a simple restart policy would be easier to use for common batch and test use cases, and would convey useful semantic/intent information to the system about the type of workload being run. |
Can you expand on this? Reusing a pod for a periodic action doesn't seem consistent with our model to me. |
@lavalamp I was thinking runin would be used for the one-off case, mostly, such as for debugging, data recovery, or emergency maintenance. Continuous background and/or periodic use cases include:
"Cloud-native" workloads would store the data to distributed/cloud storage and launch new pods to do the processing, similar to Scheduled Tasks in AppEngine. Legacy workloads that store data on local storage volumes would need these tasks to run locally, and/or have some way of offloading the data. Some people (e.g., https://news.ycombinator.com/item?id=7258009) argue that one should run cron inside the container to do this, but then that would require a process manager / supervisor. Instead, one could run it in a container by itself, accessing a shared volume, either files in the volume or named sockets or pipes, similar to how log management can be handled. |
I had a long offline discussion with @bgrant0607 this morning, and I agreed with him based on the definition of pod is a scheduling unit, not a scheduling target. Once you agreed with the definition, a list of potential use cases as intra-pod workflows could be ruled out. Enabling intra-pod workflows through hierarchical scheduling is too complicated, error-prone and not necessary for most of use cases if not all. The usecases which has a run_forever controller container in a pod, and a bunch of run_til_succeed batch jobs listening to controller, should be handled at the higher level. I came up several possible usecases. One is pre-config type container only run once and personalize pod for service. Brian pointed out it could be handled by event hook. I agreed that event hook is a clean way to handle this, even it is much harder for the users to use at the beginning. Another usecase is cron-type job or debugging process, but run-in feature should handle that. In this case, I couldn't come up any more usecases which has different restart policies for containers in a given pod. If a service job want to run forever, its monitoring and logging collector jobs should also run forever. A canary version service wants to run once, all its helper containers only requires to run once. I actually started a PR to introduce a restart policy at container level based on my instinct and my passed experiences. But I failed to convince myself with a solid / valid use cases based on Pod definition. That is why I called a meeting with Brian, and he obviously convinced me on this very topic. |
Meanwhile: moby/moby#7226 |
No need to go get go-bindata
… ingress repo (kubernetes#127) * Add system-account priledges for addons This adds a superuser level systemaccount (not ideal long term) to get the addons working. This is a stop gap fix until https://github.com/juju-solutions/bundle-canonical-kubernetes/issues/289 can be resolved. * Add return False in case of failure during application of rolebindings * Sync the ingress templates with the latest published changes from the ingress repo This moves the nginx-ingress-controller into the kube-system namespace and fixes a 403 error that was emitting from the default http backend. fixes https://github.com/juju-solutions/bundle-canonical-kubernetes/issues/279
Create a Kubernetes Secret as follows: kubectl create secret generic spuer-secret --from-literal=Credential=alice apiVersion: v1 |
Bug 1826230: 1.18.2 missed carries
build/dist maintenance
…es#127) * init * update guestbook all in one
…ctor-unstructured-logging UPSTREAM: 111898: Reflector: support logging Unstructured type
Right now we assume that all containers run forever. We should support configurable restart behavior for the following modes:
The main tricky issues are:
We should also think about how to facilitate implementation of custom policies outside the system. See also:
googlearchive/container-agent#9
The text was updated successfully, but these errors were encountered: