-
Notifications
You must be signed in to change notification settings - Fork 38.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrate node-problem-detector with e2e test infrastructure #30811
Comments
cc @dchen1107 |
@justinsb We built node-problem-detector to make the kernel / filesystem issues visible. Also in our jenkins for aws, we eneabled node-problem-detector through https://github.com/kubernetes/test-infra/pull/392/files And the support specifically for the kernel panic like #30706 is addressed by kubernetes/node-problem-detector#22 |
@justinsb I re-title the issue to reflect what we really want here. @Random-Liu To finish this, we need
|
switching this to 1.6 as it's too late for 1.5. ok? (please switch it right back if you disagree) |
In 1.6 release, we have done the following:
I punt this to 1.7. |
Initially we planed to integrate node problem detector with the e2e framework to:
In 1.6 release, for 1) we collected NPD log, which is good enough for debugging #41949; For 2) we added the real case NPD e2e test to make sure it is working properly #42454. I think it is safe to punt this to 1.7. |
In 1.7 release, we added /dev/kmsg support for NPD 0.4, and upgrade e2e to that release. Move this to 1.8 for better integrating NPD with testing infrastructure. |
[MILESTONENOTIFIER] Milestone Removed @Random-Liu @dchen1107 @justinsb Important: kind: Must specify exactly one of [ Removing it from the milestone. |
Issues go stale after 90d of inactivity. Prevent issues from auto-closing with an If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Rotten issues close after 30d of inactivity. Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Do the existing e2e tests fail if a node panicked during the test?
We are seeing issues like #30706, which makes me think we might not be detecting this.
It is actually surprisingly difficult to tell that a node has panicked with k8s, because k8s self-heals very quickly. Also, I haven't yet figured out how to get a kernel panic into journald (on AWS), so the only place to see them is on the AWS console output. But hopefully they would just be in the syslog on containervm. A simple way to detect this is to look at the uptime of the nodes.
The text was updated successfully, but these errors were encountered: