Integrate node-problem-detector with e2e test infrastructure #30811

justinsb · 2016-08-17T20:52:31Z

Do the existing e2e tests fail if a node panicked during the test?

We are seeing issues like #30706, which makes me think we might not be detecting this.

It is actually surprisingly difficult to tell that a node has panicked with k8s, because k8s self-heals very quickly. Also, I haven't yet figured out how to get a kernel panic into journald (on AWS), so the only place to see them is on the AWS console output. But hopefully they would just be in the syslog on containervm. A simple way to detect this is to look at the uptime of the nodes.

justinsb · 2016-08-17T20:53:38Z

cc @dchen1107

dchen1107 · 2016-08-19T00:12:56Z

@justinsb We built node-problem-detector to make the kernel / filesystem issues visible. Also in our jenkins for aws, we eneabled node-problem-detector through https://github.com/kubernetes/test-infra/pull/392/files

And the support specifically for the kernel panic like #30706 is addressed by kubernetes/node-problem-detector#22

dchen1107 · 2016-08-19T00:45:08Z

@justinsb I re-title the issue to reflect what we really want here.

@Random-Liu To finish this, we need

Get Kernel Monitor: Add look back support and kernel panic handling node-problem-detector#22 merged and build a new image for this daemonset
Tried to fix / workaround We should label master nodes #28687, so that node-problem-detector daemonset will not be scheduled on our master node
Updates e2e test infrastructure to always get node conditions and events, and logs those meaningful information.
Of course, a lot of existing tests might have to be changed since they assume some deterministic testing environment.

dims · 2016-11-15T18:23:05Z

switching this to 1.6 as it's too late for 1.5. ok? (please switch it right back if you disagree)

dchen1107 · 2017-03-09T23:40:01Z

In 1.6 release, we have done the following:

deploy node-problem-detector daemonset with e2e test by default
collect node-problem-detector log for the debugging.

I punt this to 1.7.

Random-Liu · 2017-03-09T23:44:05Z

Initially we planed to integrate node problem detector with the e2e framework to:

Detect node problems during e2e test.
Make sure NPD is actually working.

In 1.6 release, for 1) we collected NPD log, which is good enough for debugging #41949; For 2) we added the real case NPD e2e test to make sure it is working properly #42454.

I think it is safe to punt this to 1.7.

dchen1107 · 2017-06-09T02:45:52Z

In 1.7 release, we added /dev/kmsg support for NPD 0.4, and upgrade e2e to that release.

Move this to 1.8 for better integrating NPD with testing infrastructure.

k8s-github-robot · 2017-09-09T08:01:38Z

[MILESTONENOTIFIER] Milestone Removed

@Random-Liu @dchen1107 @justinsb

Important:
This issue was missing labels required for the v1.8 milestone for more than 7 days:

kind: Must specify exactly one of [kind/bug, kind/cleanup, kind/feature].
priority: Must specify exactly one of [priority/critical-urgent, priority/important-longterm, priority/important-soon].

Removing it from the milestone.

Additional instructions available here The commands available for adding these labels are documented here

fejta-bot · 2018-01-05T02:00:13Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

fejta-bot · 2018-02-09T06:08:10Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

fejta-bot · 2018-03-11T06:54:33Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

k8s-github-robot added area/node-e2e team/control-plane labels Aug 17, 2016

dchen1107 added area/test-infra sig/node Categorizes an issue or PR as relevant to SIG Node. and removed team/control-plane labels Aug 19, 2016

dchen1107 assigned Random-Liu and dchen1107 Aug 19, 2016

dchen1107 added the priority/backlog Higher priority than priority/awaiting-more-evidence. label Aug 19, 2016

dchen1107 added this to the v1.4 milestone Aug 19, 2016

dchen1107 changed the title ~~e2e tests should check whether nodes panicked~~ Integrate node-problem-detector with e2e test infrastructure Aug 19, 2016

k8s-github-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed sig/node Categorizes an issue or PR as relevant to SIG Node. labels Aug 25, 2016

goltermann modified the milestones: v1.5, v1.4 Sep 6, 2016

dims modified the milestones: v1.6, v1.5 Nov 15, 2016

Random-Liu mentioned this issue Jan 7, 2017

NPD Kubernetes 1.6 Planning kubernetes/node-problem-detector#58

Closed

11 tasks

dchen1107 modified the milestones: v1.7, v1.6 Mar 9, 2017

dchen1107 modified the milestones: v1.7, v1.8 Jun 9, 2017

k8s-github-robot added milestone-labels-incomplete labels Sep 1, 2017

k8s-github-robot added milestone/removed and removed milestone-labels-incomplete labels Sep 9, 2017

k8s-github-robot removed this from the v1.8 milestone Sep 9, 2017

spxtr removed the milestone-labels-**incomplete** label Sep 11, 2017

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 5, 2018

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 9, 2018

k8s-ci-robot closed this as completed Mar 11, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate node-problem-detector with e2e test infrastructure #30811

Integrate node-problem-detector with e2e test infrastructure #30811

justinsb commented Aug 17, 2016

justinsb commented Aug 17, 2016

dchen1107 commented Aug 19, 2016

dchen1107 commented Aug 19, 2016 •

edited

dims commented Nov 15, 2016

dchen1107 commented Mar 9, 2017

Random-Liu commented Mar 9, 2017 •

edited

dchen1107 commented Jun 9, 2017

k8s-github-robot commented Sep 9, 2017

fejta-bot commented Jan 5, 2018

fejta-bot commented Feb 9, 2018

fejta-bot commented Mar 11, 2018

Integrate node-problem-detector with e2e test infrastructure #30811

Integrate node-problem-detector with e2e test infrastructure #30811

Comments

justinsb commented Aug 17, 2016

justinsb commented Aug 17, 2016

dchen1107 commented Aug 19, 2016

dchen1107 commented Aug 19, 2016 • edited

dims commented Nov 15, 2016

dchen1107 commented Mar 9, 2017

Random-Liu commented Mar 9, 2017 • edited

dchen1107 commented Jun 9, 2017

k8s-github-robot commented Sep 9, 2017

fejta-bot commented Jan 5, 2018

fejta-bot commented Feb 9, 2018

fejta-bot commented Mar 11, 2018

dchen1107 commented Aug 19, 2016 •

edited

Random-Liu commented Mar 9, 2017 •

edited