Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate node-problem-detector with e2e test infrastructure #30811

Closed
justinsb opened this issue Aug 17, 2016 · 11 comments
Closed

Integrate node-problem-detector with e2e test infrastructure #30811

justinsb opened this issue Aug 17, 2016 · 11 comments
Assignees
Labels
area/node-e2e area/test-infra lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. milestone/removed priority/backlog Higher priority than priority/awaiting-more-evidence. sig/node Categorizes an issue or PR as relevant to SIG Node.

Comments

@justinsb
Copy link
Member

Do the existing e2e tests fail if a node panicked during the test?

We are seeing issues like #30706, which makes me think we might not be detecting this.

It is actually surprisingly difficult to tell that a node has panicked with k8s, because k8s self-heals very quickly. Also, I haven't yet figured out how to get a kernel panic into journald (on AWS), so the only place to see them is on the AWS console output. But hopefully they would just be in the syslog on containervm. A simple way to detect this is to look at the uptime of the nodes.

@justinsb
Copy link
Member Author

cc @dchen1107

@dchen1107
Copy link
Member

@justinsb We built node-problem-detector to make the kernel / filesystem issues visible. Also in our jenkins for aws, we eneabled node-problem-detector through https://github.com/kubernetes/test-infra/pull/392/files

And the support specifically for the kernel panic like #30706 is addressed by kubernetes/node-problem-detector#22

@dchen1107 dchen1107 added area/test-infra sig/node Categorizes an issue or PR as relevant to SIG Node. and removed team/control-plane labels Aug 19, 2016
@dchen1107 dchen1107 added the priority/backlog Higher priority than priority/awaiting-more-evidence. label Aug 19, 2016
@dchen1107 dchen1107 added this to the v1.4 milestone Aug 19, 2016
@dchen1107 dchen1107 changed the title e2e tests should check whether nodes panicked Integrate node-problem-detector with e2e test infrastructure Aug 19, 2016
@dchen1107
Copy link
Member

dchen1107 commented Aug 19, 2016

@justinsb I re-title the issue to reflect what we really want here.

@Random-Liu To finish this, we need

@k8s-github-robot k8s-github-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed sig/node Categorizes an issue or PR as relevant to SIG Node. labels Aug 25, 2016
@goltermann goltermann modified the milestones: v1.5, v1.4 Sep 6, 2016
@dims
Copy link
Member

dims commented Nov 15, 2016

switching this to 1.6 as it's too late for 1.5. ok? (please switch it right back if you disagree)

@dchen1107
Copy link
Member

In 1.6 release, we have done the following:

  1. deploy node-problem-detector daemonset with e2e test by default
  2. collect node-problem-detector log for the debugging.

I punt this to 1.7.

@dchen1107 dchen1107 modified the milestones: v1.7, v1.6 Mar 9, 2017
@Random-Liu
Copy link
Member

Random-Liu commented Mar 9, 2017

Initially we planed to integrate node problem detector with the e2e framework to:

  1. Detect node problems during e2e test.
  2. Make sure NPD is actually working.

In 1.6 release, for 1) we collected NPD log, which is good enough for debugging #41949; For 2) we added the real case NPD e2e test to make sure it is working properly #42454.

I think it is safe to punt this to 1.7.

@dchen1107
Copy link
Member

In 1.7 release, we added /dev/kmsg support for NPD 0.4, and upgrade e2e to that release.

Move this to 1.8 for better integrating NPD with testing infrastructure.

@k8s-github-robot
Copy link

[MILESTONENOTIFIER] Milestone Removed

@Random-Liu @dchen1107 @justinsb

Important:
This issue was missing labels required for the v1.8 milestone for more than 7 days:

kind: Must specify exactly one of [kind/bug, kind/cleanup, kind/feature].
priority: Must specify exactly one of [priority/critical-urgent, priority/important-longterm, priority/important-soon].

Removing it from the milestone.

Additional instructions available here The commands available for adding these labels are documented here

@k8s-github-robot k8s-github-robot removed this from the v1.8 milestone Sep 9, 2017
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 5, 2018
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 9, 2018
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/node-e2e area/test-infra lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. milestone/removed priority/backlog Higher priority than priority/awaiting-more-evidence. sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
None yet
Development

No branches or pull requests

9 participants