Cluster setup in e2e tests not reliable enough nor transparent enough #31273

alex-mohr · 2016-08-23T17:47:05Z

There are a large number of e2e failures where the cluster fails to come up in 300 seconds, e.g. https://storage.googleapis.com/kubernetes-jenkins/logs/kubernetes-e2e-gce-examples/14127/build-log.txt

"""
Waiting up to 300 seconds for cluster initialization.

This will continually check to see if the API for kubernetes is reachable.
This may time out if there was some uncaught error during start up.

...........................................Cluster failed to initialize within 300 seconds.
2016/08/22 07:55:20 e2e.go:453: Error running up: exit status 2
2016/08/22 07:55:20 e2e.go:449: Step 'up' finished in 9m43.290764216s
"""

A bunch of issues:
(1) How do we know if 300 seconds is enough time? Should it add another post-failure check period so we can tell if it's slowness vs. complete failure?
(2) Printing a bunch of dots is insufficient transparency -- what's the status of various components it's checking for?
(3) Given master VM isn't sshable after 300 seconds, it might be useful to print serial console output of the master after failure (see https://cloud.google.com/compute/docs/reference/beta/instances/getSerialPortOutput) to see what's going on. Is it the network not getting set up? sshd not running? The account setup script not working? And if existing kernel logs that are dumped to serial console aren't enough, maybe setup process should dump more of its logs there too.

Anyone interested in helping out here from @kubernetes/sig-cluster-lifecycle ?

mikedanese · 2016-08-23T19:18:48Z

This has been very frustrating because it conflates a bunch of underlying issues into a single flake that gets assigned to one person. It's also pretty common that the underlying issue that caused a particular failure is not even discernible from the leftovers (gcs artifiacts).

timothysc · 2016-08-23T20:05:16Z

/cc @kubernetes/sig-testing

timothysc · 2016-08-25T15:26:35Z

Is this really a P0 item that is going to get fixed properly in 1.4? this seems more like P0 item for 1.5.

bprashanth · 2016-09-07T23:19:54Z

The work here is combing through the O(10) issues that have the same failure mode, bucketing them into classes and adding clarity. eg: #20916, #28641, #22819

Probably the most important is startup log from the master: #27551 (comment) which is a sinch to get via ssh. This is also hardest to debug on gke because we need to context switch to bigquery and follow up after the fact, vs having master logs and startupscript right there.

Agree that this is not 1.4, we've lived with it for so long.

jayunit100 · 2016-09-08T15:41:23Z

I saw @mikedanese kube-anywhere project and am wondering if that is the future of things?

fejta · 2016-09-09T23:36:32Z

@rmmh @spxtr I seem to recall one of you saying you plan to add more detailed info into testgrid for where a suite failure occurs? I believe that is the relevant test-infra piece

spxtr · 2016-09-10T01:42:14Z

I plan on making hack/e2e.go produce JUnit output. kubernetes/test-infra#76

spxtr · 2016-09-28T00:02:15Z

Ok, we have a finer signal when cluster up fails compared to other failure causes. However, it's still very hard to know what to do when that 300 second barrier is hit.

I don't think @rmmh and I are the appropriate assignees here.

bprashanth · 2016-11-04T00:47:32Z

Suggest grabbing all this into one dump, if you can ssh into a bad node #34665 (comment)

dims · 2016-11-16T15:20:04Z

This needs to be triaged as a release-blocker or not for 1.5 @bprashanth @alex-mohr @bprashanth

bprashanth · 2016-11-16T16:45:46Z

An infra change is probably not going to block the release though it's probably slowing down debugging of various cluster-up failures. Oddly this release we haven't seen as many of those as we did in previous releases (he says, as they enter the stabilization period).

dims · 2016-12-09T15:09:35Z

@alex-mohr Is it appropriate to move this to the next milestone or clear the 1.5 milestone? (and remove the non-release-blocker tag as well)

roberthbailey · 2017-04-25T06:53:01Z

@fejta has been bugging us about cluster creation failures with his recent test flake reports, so I think we can safely close this issue (which hasn't been updated this calendar year).

alex-mohr added help-wanted priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. labels Aug 23, 2016

alex-mohr added this to the v1.4 milestone Aug 23, 2016

This was referenced Aug 23, 2016

kubernetes-e2e-gce-examples: broken test run #30303

Closed

kubernetes-kubemark-500-gce: broken test run #30682

Closed

k8s-github-robot added area/kubelet area/test-infra labels Aug 23, 2016

alex-mohr mentioned this issue Aug 23, 2016

kubernetes-kubemark-5-gce: broken test run #27822

Closed

timothysc added the area/test label Aug 23, 2016

bgrant0607 removed the help-wanted label Aug 30, 2016

ixdy modified the milestones: v1.5, v1.4 Sep 7, 2016

spxtr mentioned this issue Sep 8, 2016

kubernetes-e2e-gce-slow: broken test run #32256

Closed

fejta assigned rmmh and spxtr Sep 9, 2016

spxtr unassigned rmmh and spxtr Oct 3, 2016

saad-ali added the non-release-blocker label Nov 18, 2016

saad-ali added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. labels Nov 18, 2016

mikedanese mentioned this issue Nov 22, 2016

Up {e2e.go} #36895

Closed

dims removed the non-release-blocker label Dec 13, 2016

mikedanese removed this from the v1.5 milestone Jan 3, 2017

roberthbailey closed this as completed Apr 25, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster setup in e2e tests not reliable enough nor transparent enough #31273

Cluster setup in e2e tests not reliable enough nor transparent enough #31273

alex-mohr commented Aug 23, 2016

mikedanese commented Aug 23, 2016

timothysc commented Aug 23, 2016

timothysc commented Aug 25, 2016

bprashanth commented Sep 7, 2016 •

edited

jayunit100 commented Sep 8, 2016

fejta commented Sep 9, 2016

spxtr commented Sep 10, 2016

spxtr commented Sep 28, 2016

bprashanth commented Nov 4, 2016

dims commented Nov 16, 2016

bprashanth commented Nov 16, 2016

dims commented Dec 9, 2016

roberthbailey commented Apr 25, 2017

Cluster setup in e2e tests not reliable enough nor transparent enough #31273

Cluster setup in e2e tests not reliable enough nor transparent enough #31273

Comments

alex-mohr commented Aug 23, 2016

mikedanese commented Aug 23, 2016

timothysc commented Aug 23, 2016

timothysc commented Aug 25, 2016

bprashanth commented Sep 7, 2016 • edited

jayunit100 commented Sep 8, 2016

fejta commented Sep 9, 2016

spxtr commented Sep 10, 2016

spxtr commented Sep 28, 2016

bprashanth commented Nov 4, 2016

dims commented Nov 16, 2016

bprashanth commented Nov 16, 2016

dims commented Dec 9, 2016

roberthbailey commented Apr 25, 2017

bprashanth commented Sep 7, 2016 •

edited