Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Restart policy in the kubelet (pre-design) #544

Closed
smarterclayton opened this issue Jul 20, 2014 · 22 comments
Closed

Support Restart policy in the kubelet (pre-design) #544

smarterclayton opened this issue Jul 20, 2014 · 22 comments
Assignees
Labels
kind/design Categorizes issue or PR as related to design.

Comments

@smarterclayton
Copy link
Contributor

In order to begin to enable #127 (as soon as #356 is merged) I'm going to start down the following path:

  • in the sync loop for the kubelet, retrieve the list of completed containers (not just running) and update an in-memory list of tuples (podFullName, containerName, containerAttemptID, exitCode, timeStarted, timeEnded, killed)
  • on container manifest, add a restart policy value for the pod with a default that is equivalent to the current API behavior (restart always)
  • when starting a pod, check the in-memory list for previous attempts and follow the policy
    • if the list is empty, initialize it from the list of completed containers on disk
  • expose the attempt log in the kubelet server (on podInfo or as a watch api or both)
  • tie container events in the kubelet more directly to the attempt list and log consistently

Work not included here:

  • record attempts on disk independent of docker
  • delete completed containers from disk once they've been recorded in the attempt log
  • report completion status on a pod from the apiserver
@pmorie
Copy link
Member

pmorie commented Jul 21, 2014

Seems like #137 is a dep of this, agree @smarterclayton ?

@ncdc
Copy link
Member

ncdc commented Jul 21, 2014

@pmorie makes sense to me, yeah

@smarterclayton
Copy link
Contributor Author

Part of this will be to put in place the framework for #137 - killed in the triple represents some of that data. Agree this should set the foundation for that.

@smarterclayton
Copy link
Contributor Author

I'll probably also introduce PodInstanceID at the same time as ContainerAttemptID (which is the DockerID) from https://github.com/GoogleCloudPlatform/kubernetes/blob/master/docs/identifiers.md

@ironcladlou
Copy link
Contributor

on container manifest, add a restart policy value for the pod as a whole

Not sure I understand that statement; in #127 there's an open discussion re: per-pod/per-container policy. Which direction are you headed?

@smarterclayton
Copy link
Contributor Author

I'll update to say per container.

@thockin
Copy link
Member

thockin commented Jul 21, 2014

I know Dawn was starting to look at this too - might be worth making sure
we don't overlap too much.

I do think restart should be a per-container thing.

I also want to think a bit about HA. Maybe it is not important here but
worth explicitly ruling out. Internally, we had some real problems with
the agent becoming unavailable for various reasons, which meant restarts
were not being processed.

We solved this by moving restarts out of the agent proper and into a third
party (very much like systemd) which was simpler, had a single purpose, and
was updated less frequently.

In the first containervm we solved it by wrapping each container run in a
shall loop which would do restarts no matter what happened on the main
agent.

Like I said - not sure it matters here, but worth thinking about. What is
our target SLA for container restarts?
On Jul 21, 2014 7:18 AM, "Clayton Coleman" notifications@github.com wrote:

I'll update to say per container.

Reply to this email directly or view it on GitHub
#544 (comment)
.

@smarterclayton
Copy link
Contributor Author

As I was thinking about this problem, I think I'd argue for a PodControl abstraction that the sync loop invokes to manage the details of the actual pod - that allows a separate subsystem to manage restart policy as well down the road.

Dawn, let me know what you'll cover and whether there are pieces you want to split up.

@dchen1107
Copy link
Member

Clayton, I started a PR simply adding a restart policy for the container to container manifest with a default that is equivalent to the current API behavior (restart always).

Initially I planed to put more effort on restart policy, but I am fine that you take over the whole issue or handle over the entire one to me.

@smarterclayton
Copy link
Contributor Author

Up to you - if you've got other items you'd like to take I'm happy to cover this one.

@dchen1107
Copy link
Member

If you don't mind, I would like to take over this working item. I am still new to kubernete and golang, want to take this over to familiar with the entire workflow and codebase. Thanks!

@ironcladlou
Copy link
Contributor

Two more important pieces:

  1. kubulet.go#GetPodInfo needs adapted to do a full container list using the docker client in order to correctly report currentState for stopped containers. It's also worth noting that the docker container state details aren't even reported by the kubelet unless cadvisor is present.
  2. It becomes very important to delete old containers which are actively removed from the pod (in response to an active container delete from the pod, or the deletion of the pod itself). The current killContainer implementation just stops containers. Due to the naming conventions, old stopped containers from previous pods can become part of the effective state of new pods whose container and pod IDs align. (I could be mistaken on that last example; I thought I encountered it early on and am going from memory.)

This branch contains a very quick n' dirty run-once policy implementation and also handles container cleanup. It doesn't implement the more efficient/comprehensive design outlined above, but it at least demonstrates a lot of points of modification for the overall change. The hacky policy implementation is just incidental to the other POC work going on, so mostly just check out the modifications to kubelet.go.

@smarterclayton
Copy link
Contributor Author

@dchen1107 any updates on where this is?

@thockin
Copy link
Member

thockin commented Aug 8, 2014

Dawn is out all this week.

We need to try to reconcile with
moby/moby#7226 and
moby/moby#7414

On Fri, Aug 8, 2014 at 8:03 AM, Clayton Coleman notifications@github.com
wrote:

@dchen1107 https://github.com/dchen1107 any updates on where this is?

Reply to this email directly or view it on GitHub
#544 (comment)
.

@smarterclayton
Copy link
Contributor Author

Any updates on this?

@dawnchen
Copy link

There maybe two dawn chen in github,sorry I'm not the one in this thread
before.
2014年8月22日 上午12:26于 "Clayton Coleman" notifications@github.com写道:

Any updates on this?


Reply to this email directly or view it on GitHub
#544 (comment)
.

@dchen1107
Copy link
Member

I have a pending one in branch, but not finish it. Will concentrate on this the rest of week.

@thockin
Copy link
Member

thockin commented Aug 23, 2014

In the previous PR you added RestartPolicy to PodState and you added it as
a struct holding an enum. I did not get to that PR fast enough, but here's
my thoughts.

  1. PodState is wrong - it has to be on Manifest or Container - that is the
    unit that kubelet conbsumes right now.

  2. I don't get the enum-in-a-struct model. I would have expected to see
    either a plain enum:

const (
    RestartAlways    RestartPolicyType = "Always"
    RestartOnFailure RestartPolicyType = "OnFailure"
    RestartNever     RestartPolicyType = "Never"
)
RestartPolicy RestartPolicy

or else a struct with optional fields (proto style) for possible extensions
later:

type RestartPolicy {
    Always *RestartPolicyAlways
    OnError *RestartOnError
    Never *RestartNever
}
type RestartPolicyAlways struct{}
type RestartPolicyOnError struct{}
type RestartPolicyNever struct{}

We also might choose to rephrase "never" to "run once".

We also should think about enums and strings: "ON_ERROR" vs "OnError".
With Port protocols we said we do caseless comparisons, which results in
"ONERROR". This is just a convention we should decide on and write down.

On Thu, Aug 21, 2014 at 12:24 PM, Dawn Chen notifications@github.com
wrote:

I have a pending one in branch, but not finish it. Will concentrate on
this the rest of week.

Reply to this email directly or view it on GitHub
#544 (comment)
.

@dchen1107
Copy link
Member

@thockin 1) You are right. My current working PR did move RestartPolicy to ContainerManifest. 2) enum-in-a-struct model is for extension. RestartOnFailure could take extra parameters, for example, max failures. Also all restart should include an configurable parameter -- minimal interval between each restarts, etc.

@thockin
Copy link
Member

thockin commented Aug 26, 2014

So then I think it should look something like LivenessProbe:

type RestartPolicy {
DelaySeconds int
Always *RestartPolicyAlways
OnError *RestartOnError
Never *RestartNever
}
type RestartPolicyAlways struct{}
type RestartPolicyOnError struct{
MaxTries int
}
type RestartPolicyNever struct{}

On Mon, Aug 25, 2014 at 11:34 AM, Dawn Chen notifications@github.com
wrote:

@thockin https://github.com/thockin 1) You are right. My current
working PR did move RestartPolicy to ContainerManifest. 2) enum-in-a-struct
model is for extension. RestartOnFailure could take extra parameters, for
example, max failures. Also all restart should include an configurable
parameter -- minimal interval between each restarts, etc.

Reply to this email directly or view it on GitHub
#544 (comment)
.

@bgrant0607 bgrant0607 added this to the v1.0 milestone Aug 27, 2014
@brendandburns brendandburns modified the milestones: 0.7, v1.0 Sep 24, 2014
@brendandburns
Copy link
Contributor

I believe this is now fixed. Re-open if its not.

@smarterclayton
Copy link
Contributor Author

It was fixed

On Sep 24, 2014, at 7:40 PM, Brendan Burns notifications@github.com wrote:

Closed #544.


Reply to this email directly or view it on GitHub.

vishh pushed a commit to vishh/kubernetes that referenced this issue Apr 6, 2016
mqliang pushed a commit to mqliang/kubernetes that referenced this issue Dec 8, 2016
* Fix interface problem & apiserver bind address
mqliang pushed a commit to mqliang/kubernetes that referenced this issue Mar 3, 2017
* Fix interface problem & apiserver bind address
tkashem pushed a commit to tkashem/kubernetes that referenced this issue Feb 3, 2021
linxiulei pushed a commit to linxiulei/kubernetes that referenced this issue Jan 18, 2024
…ecker

Separate linux/windows health checker files. Build health checker plugin for Windows
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/design Categorizes issue or PR as related to design.
Projects
None yet
Development

No branches or pull requests

9 participants