-
Notifications
You must be signed in to change notification settings - Fork 38.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
e2e flake: pull-kubernetes-e2e-gce-etcd3 fails [sig-apps] Deployment and others with dial tcp (a node addr):10250: getsockopt: connection refused #50695
Comments
/kind flake |
/remove-kind bug |
@kubernetes/sig-apps-test-failures |
seeing kubelets (and possibly the entire node) restarting during e2e runs which is disrupting any log/exec/scheduling tests using that node at the time: the test fails with: from the apiserver log:
from the kubelet log:
you can see the almost 90 second gap and the startup logging occur in the kubelet during that window |
interesting-looking things from the logs on that kubelet around that time:
docker shows a gap in logs:
|
cc @kubernetes/sig-node-test-failures @kubernetes/sig-node-bugs for ideas on chasing down the kernel issue |
found the same panic in one of the kubelet logs in the original linked failure in this issue:
|
kernel panic with same stack mentioned in #45368 |
@liggitt also here moby/moby#30402 |
It looks like it is known kernel bug and it seems there is a fix but I do not think it was pushed upstream. See this post and the link to the proposed diff. if somebody has access to the test bed where this issue happens, would be interesting to patch the kernel and rerun this test to reconfirm. |
this continues to affect us in myriad ways. https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/pr-logs/pull/51915/pull-kubernetes-e2e-gce-etcd3/52532/ failed namespace cleanup because the node hosting an add-on server crashed with this bug https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/51915/pull-kubernetes-e2e-gce-etcd3/52532/artifacts/e2e-52532-minion-group-7rs9/kubelet.log shows that was the kubelet hosting the metrics apiserver pod https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/51915/pull-kubernetes-e2e-gce-etcd3/52532/artifacts/e2e-52532-minion-group-7rs9/serial-1.log shows the null pointer dereference error and reboot hit that node |
What might help is to change the default testing OS image to COS since CVM is being deprecated (part of #51487). We can still keep the CVM test coverage but make it non-blocking until it's officially retired. |
Yep, agreed. We should flip to COS. Even @mtaufen agreed. I might try and flip it this week, although I'm pretty swamped, so anyone else feel free to do it. A more surgical fix you could do right now would be to amend |
Automatic merge from submit-queue (batch tested with PRs 52227, 52120) Use COS for nodes in testing clusters by default, and bump COS. Addresses part of issue #51487. May assist with #51961 and #50695. CVM is being deprecated, and falls out of support on 2017/10/01. We shouldn't run test jobs on it. So start using COS for all test jobs. The default value of `KUBE_NODE_OS_DISTRIBUTION` for clusters created for testing will now be gci. Testjobs that do not specify this value will now run on clusters using COS (aka GCI) as the node OS, instead of CVM, the previous default. This change only affects testing; non-testing clusters already use COS by default. In addition, bump the version of COS from `cos-stable-60-9592-84-0` to `cos-stable-60-9592-90-0`. ```release-note NONE ``` /cc @yujuhong, @mtaufen, @fejta, @krzyzacy
was #52120 intended to switch all jobs to a known good image? http://gcsweb.k8s.io/gcs/kubernetes-jenkins/pr-logs/pull/53158/pull-kubernetes-e2e-gce-bazel/32563 |
This Issue hasn't been active in 61 days. It will be closed in 28 days (Dec 28, 2017). You can add 'keep-open' label to prevent this from happening, or add a comment to keep it open another 90 days |
The panicking kernel image has been retired |
Is this a BUG REPORT or FEATURE REQUEST?:
/kind bug
What happened:
A version of PR #47262 failed one run of pull-kubernetes-e2e-gce-etcd3 and passed another. Earlier versions also got varied results. See the whole testing history at https://k8s-gubernator.appspot.com/pr/47262 .
For the 9a64d88 commit, the failed run included this in the build log:
It is worth noting that the build log also showed 6 minutes earlier that all the nodes were up:
And
artifacts/nodes.yaml
showed that e2e-46505-minion-group-qqjs had address 10.128.0.4.What you expected to happen:
Consistent test results for a given commit.
How to reproduce it (as minimally and precisely as possible):
I have no good suggestion here.
Anything else we need to know?:
Environment:
kubectl version
): masteruname -a
):The text was updated successfully, but these errors were encountered: