New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kubernetes-e2e-gci-gke-slow: broken test run #34665
Comments
Failed: https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubernetes-e2e-gci-gke-slow/711/ Run so broken it didn't make JUnit output! |
Failed: https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubernetes-e2e-gci-gke-slow/714/ Run so broken it didn't make JUnit output! |
Not test-infra |
Failed: https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubernetes-e2e-gci-gke-slow/722/ Run so broken it didn't make JUnit output! |
Failed: https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubernetes-e2e-gci-gke-slow/740/ Run so broken it didn't make JUnit output! |
Failed: https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubernetes-e2e-gci-gke-slow/776/ Run so broken it didn't make JUnit output! |
Failed: https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubernetes-e2e-gci-gke-slow/781/ Run so broken it didn't make JUnit output! |
Failed: https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubernetes-e2e-gci-gke-slow/799/ Run so broken it didn't make JUnit output! |
[FLAKE-PING] @apelisse This flaky-test issue would love to have more attention. |
Failed: https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubernetes-e2e-gci-gke-slow/801/ Run so broken it didn't make JUnit output! |
Failed: https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubernetes-e2e-gci-gke-slow/813/ Run so broken it didn't make JUnit output! |
Failed: https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubernetes-e2e-gci-gke-slow/816/ Run so broken it didn't make JUnit output! |
Failed: https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubernetes-e2e-gci-gke-slow/837/ Run so broken it didn't make JUnit output! |
Failed: https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubernetes-e2e-gci-gke-slow/843/ Run so broken it didn't make JUnit output! |
Failed: https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubernetes-e2e-gci-gke-slow/853/ Run so broken it didn't make JUnit output! |
[FLAKE-PING] @apelisse This flaky-test issue would love to have more attention. |
Failed: https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubernetes-e2e-gci-gke-slow/856/ Run so broken it didn't make JUnit output! |
Failed: https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubernetes-e2e-gci-gke-slow/866/ Run so broken it didn't make JUnit output! |
Failed: https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubernetes-e2e-gci-gke-slow/872/ Run so broken it didn't make JUnit output! |
There's a good chance that's the same issue with missing image layers that was causing failures on 1.4.5 kubelet. cc @dchen1107 |
Hmm the node unhealthy issue is basically not debuggable without a live repro, no logs in artifacts: http://gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/kubernetes-e2e-gci-gke-slow/1391 ? (I took the links Saad posted and replaced the prefix with http://gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/kubernetes-e2e-gci-gke-slow, unless we have a new situation for log aggregation) |
I'm seeing
@dchen1107 @yujuhong Can you help confirm? |
Also potentially related,
I'm going to see if I can hunt down a PR that is responsible and revert it. |
Fwiw from those logs (#34665 (comment)), the node is definitely created by the MIG, but we see no kubernetes node object. Probably means the vm didn't come up, kept rebooting, was disconnected from master, or kubelet had some startup issue. First 3 sound like general gce bugs. Importantly, we don't see a NotReady node. |
Failed: https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubernetes-e2e-gci-gke-slow/1396/ Run so broken it didn't make JUnit output! |
@kubernetes/test-infra-admins Is there a good way to identify if this is a GCE issue or k8s? Whenever this issue is hit, the logs are not copied from the node machines. |
1.4 tests do not seem to be hitting this ( |
The submit queue has been red most of the day due to this issue. I'm trying to cut the 1.5 branch and I can't find a green build after Nov 1. If we are not able to resolve this by tomorrow morning we will convene a "war room" to try and tackle it. @kubernetes/test-infra-admins Is there something we can do to get logs off the node and master machines when this issue is hit? When this issue is hit |
This seems to be a startup problem on gci @kubernetes/goog-node @vishh I think the crash is in one of these functions: https://github.com/kubernetes/kubernetes/blob/master/cluster/gce/gci/configure-helper.sh#L1211-L1216 Here's a bunch of debug info to support the theory:
unit file looks sane
defaults are messed up
health monitors are working, at least
node installation passed
node configuration failed
It's suspicious that it stopped at kube-proxy but the tar file exists
and so does its kubeconfig
and there are only 3 calls between that line, and the next in a success run, ie
Hence the theory about per-warm-mounter, the mounter exists:
So just executing the configure helper
And kubelet starts up with the right defaults
|
anyway I have #36202 for a couple of lines of logging |
Excellent debugging @bprashanth! The evidence you provided strongly indicates PR #35821 is the culprit. The merge queue was blocked most of the day due to this issue (currently 58 outstanding PRs in the queue). So unless we have some other quick fix, we'll need to revert that PR. @vishh @mtaufen @jingxu97 thoughts? CC @gmarek Warsaw on-call |
This failed pulling the mounter image, something you definitely want to retry, especially during node bringup
|
Submit queue is red again and we have 55 unmerged PRs. I'm going to revert PR #35821. |
Thanks. On Fri, Nov 4, 2016 at 4:09 AM, Saad Ali notifications@github.com wrote:
|
fwiw, unretried image pull during node bringup is russian roulette. one of the reasons we dockerload master components/kube-proxy from a .gz. |
Now regarding the weird timeouts.
but they should all run in parallel, so I don't think that's a big deal. We should probably optimize anything that isn't specifically soaking and still takes > 15m though. Anyway I've strictified some timeouts in the nodePort test based on a previous guess by Jeff (#34665 (comment)), lets see if they produce more isolated failures: #36208 |
Found this thread in docker, so it seems other users also experience this random timeout issue. Is that possible to build docker image locally instead of using remote repository to make it more reliably? |
Failed: https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubernetes-e2e-gci-gke-slow/1415/ Run so broken it didn't make JUnit output! |
@jingxu97 io timeout is probably just normal network connectivity flake during node startup. There are 2 issues conflated into this bug, one is the mounter problem for which we reverted the pr, the second is a suite timeout, which is probably one of our tests or the framework itself doing something weird. The first problem was hitting more often and across more suites than the second, so I'd expect overall flakyness to go down now. |
Automatic merge from submit-queue Stricter timeouts for nodePort curling If the timeouts are indeed because of #34665 (comment), stricter timeouts will probably surface as a more isolated failure
I was able to grab a strace on a hanging curl, I think we're hitting a keepalive established connection that avoids nat to the endpoints when we toggle nodePort on/off. My timeout pr should at least kill the connection, we should try and get https://github.com/kubernetes/kubernetes/pull/36130/files in. Debug info:
Toggle nodeport off
Toggle nodeport back on
On the node
strace it
What's behind fd 3
what's behind inode 32557556
What does netfilter say about that connection
what about iptables for the nodeport
does a new curl work
is the old curl sending anything
So that's a keepalive probe. |
I don't think this has failed since the revert and my last pr, closing (https://k8s-testgrid.appspot.com/google-gke#gci-gke-slow). autofiler will reopen anyway. |
Failed: https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubernetes-e2e-gci-gke-slow/710/
Run so broken it didn't make JUnit output!
The text was updated successfully, but these errors were encountered: