Flapping service handling #13225

evanphx · 2015-08-26T19:14:12Z

This is more of a question and perhaps a feature request. Does Kubernetes have support to deal with flapping (starting and stoping very quickly) services specially so that their flapping doesn't have a negative effect on the cluster? And if not, should it?

A negative effect could be:

high cpu burn
log flooding
resource hogging (quick claim and release cycles)

thockin · 2015-08-26T21:11:46Z

We do not have any special handling for flapping that I know of.

On Wed, Aug 26, 2015 at 12:14 PM, Evan Phoenix notifications@github.com
wrote:

This is more of a question and perhaps a feature request. Does Kubernetes
have support to deal with flapping (starting and stoping very quickly)
services specially so that their flapping doesn't have a negative effect on
the cluster? And if not, should it?

A negative effect could be:

high cpu burn

log flooding

resource hogging (quick claim and release cycles)

—
Reply to this email directly or view it on GitHub
#13225.

kamalmarhubi · 2015-08-27T15:51:32Z

Where would it make most sense to add this? A couple of options I can think of:

a new pod restart policy maxFailures that is allowed to be used with a replication controller
- the restart policy would need to become a map to express the value
- the kubelet would restart the pod up to maxFailures times, then the pod would be permanently failed (perhaps a new pod phase?)
- the replication controller the replication controller would enter a failure phase if too many pods fail
- this may require changes to the phases on both pods and replication controllers, which is non-ideal and perhaps even API breaking (I'm not sure how to read "New phase values should not be added to existing objects in the future" from the API conventions/
introduce a new type of job controller for this so that the API breakage is avoided; it would have differences similar to the above from the replication controller
decide that this is outside of scope for Kubernetes, and that the application developer should have monitoring for flapping and deal with this as part of their normal application monitoring

thockin · 2015-08-27T16:08:46Z

I think it might be simpler to start with adding restart backoff to kubelet
(or do we have that already?) such that a pod which crashes more than X
times in Y seconds starts getting throttled. Don;t even make it
configurable. Second step could be to add a max-restarts parameter to
pod.spec, as a new field, not as a restart policy. How many times you can
restart is actually orthogonal to restartPolicy. Setting max-restarts to 0
is roughly equivalent to RestartNever.

On Thu, Aug 27, 2015 at 8:52 AM, Kamal Marhubi notifications@github.com
wrote:

Where would it make most sense to add this? A couple of options I can
think of:

a new pod restart policy maxFailures that is allowed to be used with a
replication controller

the restart policy would need to become a map to express the value

the kubelet would restart the pod up to maxFailures times, then
the pod would be permanently failed (perhaps a new pod phase?)

the replication controller the replication controller would enter
a failure phase if too many pods fail

this may require changes to the phases on both pods and
replication controllers, which is non-ideal and perhaps even API breaking
(I'm not sure how to read "New phase values should not be added to existing
objects in the future" from the API conventions
http://kubernetes.io/v1.0/docs/devel/api-conventions.html#typical-status-properties
/

introduce a new type of job controller for this so that the API
breakage is avoided; it would have differences similar to the above from
the replication controller

—
Reply to this email directly or view it on GitHub
#13225 (comment)
.

bprashanth · 2015-08-27T22:34:56Z

The first part should already work (https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/dockertools/manager.go#L1719) we backoff based on the containers terminated time stamp (https://github.com/kubernetes/kubernetes/blob/master/pkg/api/types.go#L796) and the maxContainerBackoff (https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/kubelet.go#L84), so I think this boils down to the api change

bgrant0607 · 2015-10-08T04:04:39Z

We should not add maxFailures. Backoff is the way to go.

bgrant0607 · 2017-02-23T07:15:21Z

We implemented backoff in Kubelet a long time ago. Backoff in controllers is #22298 and #33041.

Other related discussions about restart policy knobs were in #13385 and #127.

bprashanth added the team/api label Aug 27, 2015

bprashanth added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Aug 27, 2015

wojtek-t mentioned this issue Sep 16, 2015

Slow GCE tests are not running anywhere #14042

Closed

bgrant0607 added the priority/backlog Higher priority than priority/awaiting-more-evidence. label Oct 8, 2015

bgrant0607 added the kind/enhancement label Nov 12, 2015

bgrant0607 removed the team/api (deprecated - do not use) label Feb 23, 2017

bgrant0607 closed this as completed Feb 23, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flapping service handling #13225

Flapping service handling #13225

evanphx commented Aug 26, 2015

thockin commented Aug 26, 2015

kamalmarhubi commented Aug 27, 2015

thockin commented Aug 27, 2015

bprashanth commented Aug 27, 2015

bgrant0607 commented Oct 8, 2015

bgrant0607 commented Feb 23, 2017

Flapping service handling #13225

Flapping service handling #13225

Comments

evanphx commented Aug 26, 2015

thockin commented Aug 26, 2015

kamalmarhubi commented Aug 27, 2015

thockin commented Aug 27, 2015

bprashanth commented Aug 27, 2015

bgrant0607 commented Oct 8, 2015

bgrant0607 commented Feb 23, 2017