Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster setup in e2e tests not reliable enough nor transparent enough #31273

Closed
alex-mohr opened this issue Aug 23, 2016 · 13 comments
Closed

Cluster setup in e2e tests not reliable enough nor transparent enough #31273

alex-mohr opened this issue Aug 23, 2016 · 13 comments
Labels
area/kubelet area/test area/test-infra priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle.

Comments

@alex-mohr
Copy link
Contributor

There are a large number of e2e failures where the cluster fails to come up in 300 seconds, e.g. https://storage.googleapis.com/kubernetes-jenkins/logs/kubernetes-e2e-gce-examples/14127/build-log.txt

"""
Waiting up to 300 seconds for cluster initialization.

This will continually check to see if the API for kubernetes is reachable.
This may time out if there was some uncaught error during start up.

...........................................Cluster failed to initialize within 300 seconds.
2016/08/22 07:55:20 e2e.go:453: Error running up: exit status 2
2016/08/22 07:55:20 e2e.go:449: Step 'up' finished in 9m43.290764216s
"""

A bunch of issues:
(1) How do we know if 300 seconds is enough time? Should it add another post-failure check period so we can tell if it's slowness vs. complete failure?
(2) Printing a bunch of dots is insufficient transparency -- what's the status of various components it's checking for?
(3) Given master VM isn't sshable after 300 seconds, it might be useful to print serial console output of the master after failure (see https://cloud.google.com/compute/docs/reference/beta/instances/getSerialPortOutput) to see what's going on. Is it the network not getting set up? sshd not running? The account setup script not working? And if existing kernel logs that are dumped to serial console aren't enough, maybe setup process should dump more of its logs there too.

Anyone interested in helping out here from @kubernetes/sig-cluster-lifecycle ?

@alex-mohr alex-mohr added help-wanted priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. labels Aug 23, 2016
@alex-mohr alex-mohr added this to the v1.4 milestone Aug 23, 2016
@mikedanese
Copy link
Member

This has been very frustrating because it conflates a bunch of underlying issues into a single flake that gets assigned to one person. It's also pretty common that the underlying issue that caused a particular failure is not even discernible from the leftovers (gcs artifiacts).

@timothysc
Copy link
Member

/cc @kubernetes/sig-testing

@timothysc
Copy link
Member

Is this really a P0 item that is going to get fixed properly in 1.4? this seems more like P0 item for 1.5.

@bprashanth
Copy link
Contributor

bprashanth commented Sep 7, 2016

The work here is combing through the O(10) issues that have the same failure mode, bucketing them into classes and adding clarity. eg: #20916, #28641, #22819

Probably the most important is startup log from the master: #27551 (comment) which is a sinch to get via ssh. This is also hardest to debug on gke because we need to context switch to bigquery and follow up after the fact, vs having master logs and startupscript right there.

Agree that this is not 1.4, we've lived with it for so long.

@ixdy ixdy modified the milestones: v1.5, v1.4 Sep 7, 2016
@jayunit100
Copy link
Member

I saw @mikedanese kube-anywhere project and am wondering if that is the future of things?

@fejta
Copy link
Contributor

fejta commented Sep 9, 2016

@rmmh @spxtr I seem to recall one of you saying you plan to add more detailed info into testgrid for where a suite failure occurs? I believe that is the relevant test-infra piece

@spxtr
Copy link
Contributor

spxtr commented Sep 10, 2016

I plan on making hack/e2e.go produce JUnit output. kubernetes/test-infra#76

@spxtr
Copy link
Contributor

spxtr commented Sep 28, 2016

Ok, we have a finer signal when cluster up fails compared to other failure causes. However, it's still very hard to know what to do when that 300 second barrier is hit.

I don't think @rmmh and I are the appropriate assignees here.

@spxtr spxtr unassigned rmmh and spxtr Oct 3, 2016
@bprashanth
Copy link
Contributor

Suggest grabbing all this into one dump, if you can ssh into a bad node #34665 (comment)

@dims
Copy link
Member

dims commented Nov 16, 2016

This needs to be triaged as a release-blocker or not for 1.5 @bprashanth @alex-mohr @bprashanth

@bprashanth
Copy link
Contributor

An infra change is probably not going to block the release though it's probably slowing down debugging of various cluster-up failures. Oddly this release we haven't seen as many of those as we did in previous releases (he says, as they enter the stabilization period).

@saad-ali saad-ali added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. labels Nov 18, 2016
@dims
Copy link
Member

dims commented Dec 9, 2016

@alex-mohr Is it appropriate to move this to the next milestone or clear the 1.5 milestone? (and remove the non-release-blocker tag as well)

@mikedanese mikedanese removed this from the v1.5 milestone Jan 3, 2017
@roberthbailey
Copy link
Contributor

@fejta has been bugging us about cluster creation failures with his recent test flake reports, so I think we can safely close this issue (which hasn't been updated this calendar year).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/kubelet area/test area/test-infra priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle.
Projects
None yet
Development

No branches or pull requests