Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flapping service handling #13225

Closed
evanphx opened this issue Aug 26, 2015 · 6 comments
Closed

Flapping service handling #13225

evanphx opened this issue Aug 26, 2015 · 6 comments
Labels
priority/backlog Higher priority than priority/awaiting-more-evidence. sig/node Categorizes an issue or PR as relevant to SIG Node.

Comments

@evanphx
Copy link

evanphx commented Aug 26, 2015

This is more of a question and perhaps a feature request. Does Kubernetes have support to deal with flapping (starting and stoping very quickly) services specially so that their flapping doesn't have a negative effect on the cluster? And if not, should it?

A negative effect could be:

  • high cpu burn
  • log flooding
  • resource hogging (quick claim and release cycles)
@thockin
Copy link
Member

thockin commented Aug 26, 2015

We do not have any special handling for flapping that I know of.

On Wed, Aug 26, 2015 at 12:14 PM, Evan Phoenix notifications@github.com
wrote:

This is more of a question and perhaps a feature request. Does Kubernetes
have support to deal with flapping (starting and stoping very quickly)
services specially so that their flapping doesn't have a negative effect on
the cluster? And if not, should it?

A negative effect could be:

  • high cpu burn
  • log flooding
  • resource hogging (quick claim and release cycles)


Reply to this email directly or view it on GitHub
#13225.

@kamalmarhubi
Copy link
Contributor

Where would it make most sense to add this? A couple of options I can think of:

  • a new pod restart policy maxFailures that is allowed to be used with a replication controller
    • the restart policy would need to become a map to express the value
    • the kubelet would restart the pod up to maxFailures times, then the pod would be permanently failed (perhaps a new pod phase?)
    • the replication controller the replication controller would enter a failure phase if too many pods fail
    • this may require changes to the phases on both pods and replication controllers, which is non-ideal and perhaps even API breaking (I'm not sure how to read "New phase values should not be added to existing objects in the future" from the API conventions/
  • introduce a new type of job controller for this so that the API breakage is avoided; it would have differences similar to the above from the replication controller
  • decide that this is outside of scope for Kubernetes, and that the application developer should have monitoring for flapping and deal with this as part of their normal application monitoring

@thockin
Copy link
Member

thockin commented Aug 27, 2015

I think it might be simpler to start with adding restart backoff to kubelet
(or do we have that already?) such that a pod which crashes more than X
times in Y seconds starts getting throttled. Don;t even make it
configurable. Second step could be to add a max-restarts parameter to
pod.spec, as a new field, not as a restart policy. How many times you can
restart is actually orthogonal to restartPolicy. Setting max-restarts to 0
is roughly equivalent to RestartNever.

On Thu, Aug 27, 2015 at 8:52 AM, Kamal Marhubi notifications@github.com
wrote:

Where would it make most sense to add this? A couple of options I can
think of:

a new pod restart policy maxFailures that is allowed to be used with a
replication controller

  • the restart policy would need to become a map to express the value

    • the kubelet would restart the pod up to maxFailures times, then
      the pod would be permanently failed (perhaps a new pod phase?)
    • the replication controller the replication controller would enter
      a failure phase if too many pods fail
    • this may require changes to the phases on both pods and
      replication controllers, which is non-ideal and perhaps even API breaking
      (I'm not sure how to read "New phase values should not be added to existing
      objects in the future" from the API conventions
      http://kubernetes.io/v1.0/docs/devel/api-conventions.html#typical-status-properties
      /

    introduce a new type of job controller for this so that the API
    breakage is avoided; it would have differences similar to the above from
    the replication controller


Reply to this email directly or view it on GitHub
#13225 (comment)
.

@bprashanth
Copy link
Contributor

The first part should already work (https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/dockertools/manager.go#L1719) we backoff based on the containers terminated time stamp (https://github.com/kubernetes/kubernetes/blob/master/pkg/api/types.go#L796) and the maxContainerBackoff (https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/kubelet.go#L84), so I think this boils down to the api change

@bprashanth bprashanth added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Aug 27, 2015
@bgrant0607
Copy link
Member

We should not add maxFailures. Backoff is the way to go.

@bgrant0607 bgrant0607 added the priority/backlog Higher priority than priority/awaiting-more-evidence. label Oct 8, 2015
@bgrant0607
Copy link
Member

We implemented backoff in Kubelet a long time ago. Backoff in controllers is #22298 and #33041.

Other related discussions about restart policy knobs were in #13385 and #127.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority/backlog Higher priority than priority/awaiting-more-evidence. sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
None yet
Development

No branches or pull requests

5 participants