-
Notifications
You must be signed in to change notification settings - Fork 38.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cluster setup in e2e tests not reliable enough nor transparent enough #31273
Comments
This has been very frustrating because it conflates a bunch of underlying issues into a single flake that gets assigned to one person. It's also pretty common that the underlying issue that caused a particular failure is not even discernible from the leftovers (gcs artifiacts). |
/cc @kubernetes/sig-testing |
Is this really a P0 item that is going to get fixed properly in 1.4? this seems more like P0 item for 1.5. |
The work here is combing through the O(10) issues that have the same failure mode, bucketing them into classes and adding clarity. eg: #20916, #28641, #22819 Probably the most important is startup log from the master: #27551 (comment) which is a sinch to get via ssh. This is also hardest to debug on gke because we need to context switch to bigquery and follow up after the fact, vs having master logs and startupscript right there. Agree that this is not 1.4, we've lived with it for so long. |
I saw @mikedanese kube-anywhere project and am wondering if that is the future of things? |
I plan on making |
Ok, we have a finer signal when cluster up fails compared to other failure causes. However, it's still very hard to know what to do when that 300 second barrier is hit. I don't think @rmmh and I are the appropriate assignees here. |
Suggest grabbing all this into one dump, if you can ssh into a bad node #34665 (comment) |
This needs to be triaged as a release-blocker or not for 1.5 @bprashanth @alex-mohr @bprashanth |
An infra change is probably not going to block the release though it's probably slowing down debugging of various cluster-up failures. Oddly this release we haven't seen as many of those as we did in previous releases (he says, as they enter the stabilization period). |
@alex-mohr Is it appropriate to move this to the next milestone or clear the 1.5 milestone? (and remove the non-release-blocker tag as well) |
@fejta has been bugging us about cluster creation failures with his recent test flake reports, so I think we can safely close this issue (which hasn't been updated this calendar year). |
There are a large number of e2e failures where the cluster fails to come up in 300 seconds, e.g. https://storage.googleapis.com/kubernetes-jenkins/logs/kubernetes-e2e-gce-examples/14127/build-log.txt
"""
Waiting up to 300 seconds for cluster initialization.
This will continually check to see if the API for kubernetes is reachable.
This may time out if there was some uncaught error during start up.
...........................................Cluster failed to initialize within 300 seconds.
2016/08/22 07:55:20 e2e.go:453: Error running up: exit status 2
2016/08/22 07:55:20 e2e.go:449: Step 'up' finished in 9m43.290764216s
"""
A bunch of issues:
(1) How do we know if 300 seconds is enough time? Should it add another post-failure check period so we can tell if it's slowness vs. complete failure?
(2) Printing a bunch of dots is insufficient transparency -- what's the status of various components it's checking for?
(3) Given master VM isn't sshable after 300 seconds, it might be useful to print serial console output of the master after failure (see https://cloud.google.com/compute/docs/reference/beta/instances/getSerialPortOutput) to see what's going on. Is it the network not getting set up? sshd not running? The account setup script not working? And if existing kernel logs that are dumped to serial console aren't enough, maybe setup process should dump more of its logs there too.
Anyone interested in helping out here from @kubernetes/sig-cluster-lifecycle ?
The text was updated successfully, but these errors were encountered: