Support Restart policy in the kubelet (pre-design) #544

smarterclayton · 2014-07-20T23:30:58Z

In order to begin to enable #127 (as soon as #356 is merged) I'm going to start down the following path:

in the sync loop for the kubelet, retrieve the list of completed containers (not just running) and update an in-memory list of tuples (podFullName, containerName, containerAttemptID, exitCode, timeStarted, timeEnded, killed)
on container manifest, add a restart policy value for the pod with a default that is equivalent to the current API behavior (restart always)
when starting a pod, check the in-memory list for previous attempts and follow the policy
- if the list is empty, initialize it from the list of completed containers on disk
expose the attempt log in the kubelet server (on podInfo or as a watch api or both)
tie container events in the kubelet more directly to the attempt list and log consistently

Work not included here:

record attempts on disk independent of docker
delete completed containers from disk once they've been recorded in the attempt log
report completion status on a pod from the apiserver

pmorie · 2014-07-21T13:43:54Z

Seems like #137 is a dep of this, agree @smarterclayton ?

ncdc · 2014-07-21T13:44:56Z

@pmorie makes sense to me, yeah

smarterclayton · 2014-07-21T13:45:42Z

Part of this will be to put in place the framework for #137 - killed in the triple represents some of that data. Agree this should set the foundation for that.

smarterclayton · 2014-07-21T13:49:02Z

I'll probably also introduce PodInstanceID at the same time as ContainerAttemptID (which is the DockerID) from https://github.com/GoogleCloudPlatform/kubernetes/blob/master/docs/identifiers.md

ironcladlou · 2014-07-21T14:03:18Z

on container manifest, add a restart policy value for the pod as a whole

Not sure I understand that statement; in #127 there's an open discussion re: per-pod/per-container policy. Which direction are you headed?

smarterclayton · 2014-07-21T14:18:21Z

I'll update to say per container.

thockin · 2014-07-21T15:21:05Z

I know Dawn was starting to look at this too - might be worth making sure
we don't overlap too much.

I do think restart should be a per-container thing.

I also want to think a bit about HA. Maybe it is not important here but
worth explicitly ruling out. Internally, we had some real problems with
the agent becoming unavailable for various reasons, which meant restarts
were not being processed.

We solved this by moving restarts out of the agent proper and into a third
party (very much like systemd) which was simpler, had a single purpose, and
was updated less frequently.

In the first containervm we solved it by wrapping each container run in a
shall loop which would do restarts no matter what happened on the main
agent.

Like I said - not sure it matters here, but worth thinking about. What is
our target SLA for container restarts?
On Jul 21, 2014 7:18 AM, "Clayton Coleman" notifications@github.com wrote:

I'll update to say per container.

Reply to this email directly or view it on GitHub
#544 (comment)
.

smarterclayton · 2014-07-21T15:28:15Z

As I was thinking about this problem, I think I'd argue for a PodControl abstraction that the sync loop invokes to manage the details of the actual pod - that allows a separate subsystem to manage restart policy as well down the road.

Dawn, let me know what you'll cover and whether there are pieces you want to split up.

dchen1107 · 2014-07-21T22:19:35Z

Clayton, I started a PR simply adding a restart policy for the container to container manifest with a default that is equivalent to the current API behavior (restart always).

Initially I planed to put more effort on restart policy, but I am fine that you take over the whole issue or handle over the entire one to me.

smarterclayton · 2014-07-21T22:26:46Z

Up to you - if you've got other items you'd like to take I'm happy to cover this one.

dchen1107 · 2014-07-21T22:35:42Z

If you don't mind, I would like to take over this working item. I am still new to kubernete and golang, want to take this over to familiar with the entire workflow and codebase. Thanks!

ironcladlou · 2014-07-24T16:47:52Z

Two more important pieces:

kubulet.go#GetPodInfo needs adapted to do a full container list using the docker client in order to correctly report currentState for stopped containers. It's also worth noting that the docker container state details aren't even reported by the kubelet unless cadvisor is present.
It becomes very important to delete old containers which are actively removed from the pod (in response to an active container delete from the pod, or the deletion of the pod itself). The current killContainer implementation just stops containers. Due to the naming conventions, old stopped containers from previous pods can become part of the effective state of new pods whose container and pod IDs align. (I could be mistaken on that last example; I thought I encountered it early on and am going from memory.)

This branch contains a very quick n' dirty run-once policy implementation and also handles container cleanup. It doesn't implement the more efficient/comprehensive design outlined above, but it at least demonstrates a lot of points of modification for the overall change. The hacky policy implementation is just incidental to the other POC work going on, so mostly just check out the modifications to kubelet.go.

smarterclayton · 2014-08-08T15:03:36Z

@dchen1107 any updates on where this is?

thockin · 2014-08-08T15:29:31Z

Dawn is out all this week.

We need to try to reconcile with
moby/moby#7226 and
moby/moby#7414

On Fri, Aug 8, 2014 at 8:03 AM, Clayton Coleman notifications@github.com
wrote:

@dchen1107 https://github.com/dchen1107 any updates on where this is?

Reply to this email directly or view it on GitHub
#544 (comment)
.

smarterclayton · 2014-08-21T16:25:58Z

Any updates on this?

dawnchen · 2014-08-21T16:33:14Z

There maybe two dawn chen in github,sorry I'm not the one in this thread
before.
2014年8月22日上午12:26于 "Clayton Coleman" notifications@github.com写道：

Any updates on this?

—
Reply to this email directly or view it on GitHub
#544 (comment)
.

dchen1107 · 2014-08-21T17:24:41Z

I have a pending one in branch, but not finish it. Will concentrate on this the rest of week.

thockin · 2014-08-23T17:37:21Z

In the previous PR you added RestartPolicy to PodState and you added it as
a struct holding an enum. I did not get to that PR fast enough, but here's
my thoughts.

PodState is wrong - it has to be on Manifest or Container - that is the
unit that kubelet conbsumes right now.
I don't get the enum-in-a-struct model. I would have expected to see
either a plain enum:

const (
    RestartAlways    RestartPolicyType = "Always"
    RestartOnFailure RestartPolicyType = "OnFailure"
    RestartNever     RestartPolicyType = "Never"
)
RestartPolicy RestartPolicy

or else a struct with optional fields (proto style) for possible extensions
later:

type RestartPolicy {
    Always *RestartPolicyAlways
    OnError *RestartOnError
    Never *RestartNever
}
type RestartPolicyAlways struct{}
type RestartPolicyOnError struct{}
type RestartPolicyNever struct{}

We also might choose to rephrase "never" to "run once".

We also should think about enums and strings: "ON_ERROR" vs "OnError".
With Port protocols we said we do caseless comparisons, which results in
"ONERROR". This is just a convention we should decide on and write down.

On Thu, Aug 21, 2014 at 12:24 PM, Dawn Chen notifications@github.com
wrote:

I have a pending one in branch, but not finish it. Will concentrate on
this the rest of week.

Reply to this email directly or view it on GitHub
#544 (comment)
.

dchen1107 · 2014-08-25T18:34:28Z

@thockin 1) You are right. My current working PR did move RestartPolicy to ContainerManifest. 2) enum-in-a-struct model is for extension. RestartOnFailure could take extra parameters, for example, max failures. Also all restart should include an configurable parameter -- minimal interval between each restarts, etc.

thockin · 2014-08-26T17:46:04Z

So then I think it should look something like LivenessProbe:

type RestartPolicy {
DelaySeconds int
Always *RestartPolicyAlways
OnError *RestartOnError
Never *RestartNever
}
type RestartPolicyAlways struct{}
type RestartPolicyOnError struct{
MaxTries int
}
type RestartPolicyNever struct{}

On Mon, Aug 25, 2014 at 11:34 AM, Dawn Chen notifications@github.com
wrote:

@thockin https://github.com/thockin 1) You are right. My current
working PR did move RestartPolicy to ContainerManifest. 2) enum-in-a-struct
model is for extension. RestartOnFailure could take extra parameters, for
example, max failures. Also all restart should include an configurable
parameter -- minimal interval between each restarts, etc.

Reply to this email directly or view it on GitHub
#544 (comment)
.

brendandburns · 2014-09-24T23:39:56Z

I believe this is now fixed. Re-open if its not.

smarterclayton · 2014-09-24T23:44:38Z

It was fixed

On Sep 24, 2014, at 7:40 PM, Brendan Burns notifications@github.com wrote:

Closed #544.

—
Reply to this email directly or view it on GitHub.

Fixes kubernetes#544.

* Fix interface problem & apiserver bind address

…approvers Revert OWNERS changes

…ecker Separate linux/windows health checker files. Build health checker plugin for Windows

smarterclayton added the design label Jul 20, 2014

smarterclayton self-assigned this Jul 20, 2014

smarterclayton assigned dchen1107 and unassigned smarterclayton Jul 22, 2014

smarterclayton mentioned this issue Jul 24, 2014

Validation error propagation issue? #380

Closed

ironcladlou mentioned this issue Jul 28, 2014

[WIP] Build resource prototype #662

Closed

bgrant0607 added this to the v1.0 milestone Aug 27, 2014

brendandburns modified the milestones: 0.7, v1.0 Sep 24, 2014

brendandburns closed this as completed Sep 24, 2014

vishh pushed a commit to vishh/kubernetes that referenced this issue Apr 6, 2016

Failures to create stats summary should be non-fatal.

e89d40b

Fixes kubernetes#544.

mqliang pushed a commit to mqliang/kubernetes that referenced this issue Dec 8, 2016

Fix interface problem (kubernetes#544)

700ff05

* Fix interface problem & apiserver bind address

mqliang pushed a commit to mqliang/kubernetes that referenced this issue Mar 3, 2017

Fix interface problem (kubernetes#544)

b39bea6

* Fix interface problem & apiserver bind address

tkashem pushed a commit to tkashem/kubernetes that referenced this issue Feb 3, 2021

Merge pull request kubernetes#544 from sttts/sttts-revert-downstream-…

e761892

…approvers Revert OWNERS changes

linxiulei pushed a commit to linxiulei/kubernetes that referenced this issue Jan 18, 2024

Merge pull request kubernetes#544 from mcshooter/buildWindowsHealthCh…

031e658

…ecker Separate linux/windows health checker files. Build health checker plugin for Windows

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Restart policy in the kubelet (pre-design) #544

Support Restart policy in the kubelet (pre-design) #544

smarterclayton commented Jul 20, 2014

pmorie commented Jul 21, 2014

ncdc commented Jul 21, 2014

smarterclayton commented Jul 21, 2014

smarterclayton commented Jul 21, 2014

ironcladlou commented Jul 21, 2014

smarterclayton commented Jul 21, 2014

thockin commented Jul 21, 2014

smarterclayton commented Jul 21, 2014

dchen1107 commented Jul 21, 2014

smarterclayton commented Jul 21, 2014

dchen1107 commented Jul 21, 2014

ironcladlou commented Jul 24, 2014

smarterclayton commented Aug 8, 2014

thockin commented Aug 8, 2014

smarterclayton commented Aug 21, 2014

dawnchen commented Aug 21, 2014

dchen1107 commented Aug 21, 2014

thockin commented Aug 23, 2014

dchen1107 commented Aug 25, 2014

thockin commented Aug 26, 2014

brendandburns commented Sep 24, 2014

smarterclayton commented Sep 24, 2014

Support Restart policy in the kubelet (pre-design) #544

Support Restart policy in the kubelet (pre-design) #544

Comments

smarterclayton commented Jul 20, 2014

pmorie commented Jul 21, 2014

ncdc commented Jul 21, 2014

smarterclayton commented Jul 21, 2014

smarterclayton commented Jul 21, 2014

ironcladlou commented Jul 21, 2014

smarterclayton commented Jul 21, 2014

thockin commented Jul 21, 2014

smarterclayton commented Jul 21, 2014

dchen1107 commented Jul 21, 2014

smarterclayton commented Jul 21, 2014

dchen1107 commented Jul 21, 2014

ironcladlou commented Jul 24, 2014

smarterclayton commented Aug 8, 2014

thockin commented Aug 8, 2014

smarterclayton commented Aug 21, 2014

dawnchen commented Aug 21, 2014

dchen1107 commented Aug 21, 2014

thockin commented Aug 23, 2014

dchen1107 commented Aug 25, 2014

thockin commented Aug 26, 2014

brendandburns commented Sep 24, 2014

smarterclayton commented Sep 24, 2014