Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pods that fail health checks always restarting on the same minion instead of others? #13385

Closed
joshm1 opened this issue Aug 31, 2015 · 16 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling.

Comments

@joshm1
Copy link

joshm1 commented Aug 31, 2015

Over the weekend the skydns container in the kube-dns pod died. The exact reason I'm not sure because I couldn't find much detail from the logs, but watching the etcd and skydns logs showed the root issue could've been etcd. A theory I have is that the /mnt/emphemeral/kubernetes filesystem was full (it's only 3.75GB and has a few large empty-dir volumes). It was showing 3/4 ready for kube-dns.

This caused all of my application pods across 4 minions to go down. I had to manually delete the kube-dns pod and when it launched on another minion it was fine and everything came back online.

On the same token. I had 1 minion that would never consider any of my pods "ready", even though the other the other 3 minions did. I didn't find out why and my logs weren't helpful, so I just had to manually terminate that minion (EC2 instance) and auto-scale a new one (which happened to work fine).

For both of these cases, if k8s automatically moved the pods that constantly failed to other minions I think the cluster would've healed itself. Is the fact that failing pods always try to restart on the same minion intentional or something in the works?

I'm sorry I don't have logs to show. I'm not sure how to retrieve them from 2 days ago after so many pods have been restarted.

@lavalamp
Copy link
Member

lavalamp commented Sep 1, 2015

There's two things here; first is to figure out what was wrong with your node and start detecting it. Second is the meta-problem of noticing when something is wrong with a node, even if we don't have a detection mechanism for that specific thing.

@lavalamp lavalamp added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/node Categorizes an issue or PR as relevant to SIG Node. labels Sep 1, 2015
@joshm1
Copy link
Author

joshm1 commented Sep 8, 2015

@lavalamp certainly, it's critical to be able to detect issues with nodes. Is this something on that roadmap that will be built into kubernetes/kubelets? In the meantime, I need someway to detect this internally and either automatically handle it and/or send alerts. What are some ways you'd advise to do this?

This issue bit me again over the weekend. I have a simple 3-node cluster in AWS that was provisioned with cluster/kube-up with only 3 non kube-system pods. Everything was healthy on Friday, and without any changes over the weekend, I checked it again a few days later and all pods on that particular node were failing [1], and had they restarted on another node everything would've been fine.

[1] is this is another Github issue I should create?

kubectl get pods --all-namespaces -o wide
NAMESPACE     NAME                                                 READY     STATUS                                                                   RESTARTS   AGE       NODE
default       xxx-xxxxxxxx-prd-06c055d7-7neib                      1/1       Running                                                                  0          52m       ip-172-20-0-141.ec2.internal
default       xxx-xxxxxxxx-prd-06c055d7-sxkyo                      0/1       Image: xxxxxxxx/xxx-xxxxxxxx:prd-06c055d7 is not ready on the node       0          30m       ip-172-20-0-84.ec2.internal
kube-system   elasticsearch-logging-v1-61a90                       1/1       Running                                                                  0          5d        ip-172-20-0-84.ec2.internal
kube-system   fluentd-elasticsearch-ip-172-20-0-141.ec2.internal   1/1       Running                                                                  2          5d        ip-172-20-0-141.ec2.internal
kube-system   fluentd-elasticsearch-ip-172-20-0-183.ec2.internal   1/1       Running                                                                  0          5d        ip-172-20-0-183.ec2.internal
kube-system   fluentd-elasticsearch-ip-172-20-0-84.ec2.internal    1/1       Running                                                                  0          5d        ip-172-20-0-84.ec2.internal
kube-system   kibana-logging-v1-mldpo                              1/1       Running                                                                  0          5d        ip-172-20-0-84.ec2.internal
kube-system   kube-dns-v8-3zel5                                    4/4       Running                                                                  1          4d        ip-172-20-0-141.ec2.internal
kube-system   kube-dns-v8-ud0oq                                    1/4       API error (500): Cannot start container 060b46a4a91716cecc1e4cbe60a66450d33a8f7947db95577c9104cc849d744b: [8] System error: too many open files in system
              33                                                   5d        ip-172-20-0-84.ec2.internal
kube-system   kube-ui-v1-yq9an                                     1/1       Running   0         4d        ip-172-20-0-183.ec2.internal
kube-system   monitoring-heapster-v6-ckogk                         0/1       API error (500): Cannot start container 0e7daac3182af32b2867072b72b5d324b7e4177d1136df9d389fb671e8f280bf: [8] System error: too many open files in system
              11                                                   5d        ip-172-20-0-84.ec2.internal
kube-system   monitoring-influx-grafana-v1-4ubv9                   2/2       Running   2         4d        ip-172-20-0-84.ec2.internal
kubectl get events => Error syncing pod, skipping: API error (500): Cannot start container 8b987eaa17cade98a4ba702381d88b52e7344e03af2f6f9f157ad4049ef35c2f: [8] System error: too many open files in system

@lavalamp
Copy link
Member

lavalamp commented Sep 8, 2015

@joshm1 We already detect various problems with nodes (disk full, docker down, etc).

It looks like you're running out of file handles, so something is leaking them or you have that system setting too low (we raise it for master, but I'm not sure about nodes).

@dchen1107 Can we detect out-of-FDs and make the node not ready?

@joshm1
Copy link
Author

joshm1 commented Sep 9, 2015

The minion that ran out of file handles didn't have any running pods on it
at the time of seeing that error. It's using a the AWS auto scaling group
on Ubuntu created by kube-up.

On Tuesday, September 8, 2015, Daniel Smith notifications@github.com
wrote:

@joshm1 https://github.com/joshm1 We already detect various problems
with nodes (disk full, docker down, etc).

It looks like you're running out of file handles, so something is leaking
them or you have that system setting too low (we raise it for master, but
I'm not sure about nodes).

@dchen1107 https://github.com/dchen1107 Can we detect out-of-FDs and
make the node not ready?


Reply to this email directly or view it on GitHub
#13385 (comment)
.

@vishh
Copy link
Contributor

vishh commented Sep 9, 2015

Detecting total number of fds open should be possible. In addition to that
we should impose fd limits on containers to prevent leaks. #3595

@dchen1107
Copy link
Member

@lavalamp We can detect out-of-FDs and mark the node is not ready.

@vishh I don't think we can impose fd limits on containers today yet given the current implementation from docker. I will explain it in #3595.

@davidopp
Copy link
Member

@dchen1107 Should we rename this issue to "detect out-of-FDs and mark node not ready when it happens"? I thought maybe we already had an issue open for that, but I can't find one.

@caesarxuchao
Copy link
Member

How should the user work around the error? There is a new report of hitting this issue in stackoverflow: http://stackoverflow.com/questions/37067434/kubernetes-cant-start-due-to-too-many-open-files-in-system.

@bobintornado
Copy link
Contributor

Also reported in issue #26246

@GreatSUN
Copy link

GreatSUN commented Aug 8, 2016

Hi all, additionally to this, there can be other problems, like in virtualized environment the reported sizes that we can detect might not be those we can work with and we might infact be running on swap and therefore services might not react in time and though should be moved to other hosts.
I suggest a possibility to define an amount of healthcheck related restarts until the pod should be rescheduled.

@Bekt
Copy link

Bekt commented Feb 9, 2017

We experience this often. All nodes report healthy, but a pod gets stuck in restart loop (for whatever reason). However, if I delete the pod, it is recreated just fine (managed by RC or Deployment).

Is there any way to kill a pod after some threshold of restart?

@davidopp
Copy link
Member

This seems like a reasonable request though it's tricky to pick the right policy.

@kubernetes/sig-node-feature-requests

@bgrant0607
Copy link
Member

Original thread was in #127.

We previously discussed moving anomalously crashlooping pods (if all pods of a controller are crashlooping, on multiple nodes, then there's no point in moving any) in the rescheduler:
https://github.com/kubernetes/community/blob/master/contributors/design-proposals/rescheduling.md

@bgrant0607 bgrant0607 added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. and removed area/usability sig/node Categorizes an issue or PR as relevant to SIG Node. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Feb 23, 2017
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 21, 2017
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle rotten
/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 20, 2018
@bgrant0607
Copy link
Member

Closing in favor of kubernetes-sigs/descheduler#62

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling.
Projects
None yet
Development

No branches or pull requests