Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"master upgrade should maintain a functioning cluster" failing #103697

Closed
liggitt opened this issue Jul 14, 2021 · 34 comments
Closed

"master upgrade should maintain a functioning cluster" failing #103697

liggitt opened this issue Jul 14, 2021 · 34 comments
Assignees
Labels
kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/testing Categorizes an issue or PR as relevant to SIG Testing. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@liggitt
Copy link
Member

liggitt commented Jul 14, 2021

Which jobs are failing:

ci-kubernetes-e2e-gce-stable1-beta-upgrade-master

Which test(s) are failing:

"master upgrade should maintain a functioning cluster"

Since when has it been failing:

Since #99857 merged

On 2021-07-09, the extract step also started failing.

Testgrid link:

testgrid titles are misleading, these are upgrading from 1.21 to 1.22

Reason for failure:

Refactor broke ginkgo usage.

/assign @wojtek-t
/cc @zshihang

Anything else we need to know:

@liggitt liggitt added the kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. label Jul 14, 2021
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jul 14, 2021
@liggitt liggitt added this to the v1.22 milestone Jul 14, 2021
@liggitt liggitt added the sig/testing Categorizes an issue or PR as relevant to SIG Testing. label Jul 14, 2021
@k8s-ci-robot k8s-ci-robot removed the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Jul 14, 2021
@liggitt
Copy link
Member Author

liggitt commented Jul 14, 2021

current blocking failure is Failed to set release from https://storage.googleapis.com/k8s-release-dev/ci/k8s-beta.txt (Unexpected HTTP status code: 404)

@liggitt
Copy link
Member Author

liggitt commented Jul 14, 2021

that looks related to kubernetes/release@908c081

cc @spiffxp

@liggitt
Copy link
Member Author

liggitt commented Jul 14, 2021

for reference, the auth upgrade jobs which upgrade from ci/latest-1.20 to ci/latest are currently green and are upgrading from v1.20.9-rc.0.20+66e6d5ee1fa946 to v1.22.0-beta.1.157+e375563732a6f5

https://testgrid.k8s.io/sig-auth-gce#upgrade-tests

@liggitt liggitt added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Jul 14, 2021
@k8s-ci-robot k8s-ci-robot removed the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Jul 14, 2021
@wojtek-t
Copy link
Member

This is super strange. From example failure:
https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-stable1-beta-upgrade-master/1413528078591725568

Kubernetes e2e suite: [sig-cloud-provider-gcp] Upgrade [Feature:Upgrade] master upgrade should maintain a functioning cluster [Feature:MasterUpgrade] expand_less | 7m25s
-- | --
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/cloud/gcp/cluster_upgrade.go:57 You may only call BeforeEach from within a Describe, Context or When /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/framework/framework.go:185

So clearly, the problem is connected from calling NewDefaultFramework.

Now - this is the upgrade auth upgrade job looks:

33 var _ = SIGDescribe("ServiceAccount admission controller migration [Feature:BoundServiceAccountTokenVolume]", func() {
34	f := framework.NewDefaultFramework("serviceaccount-admission-controller-migration")
35	testFrameworks := upgrades.CreateUpgradeFrameworks(upgradeTests)
36
37	ginkgo.Describe("master upgrade", func() {
38		ginkgo.It("should maintain a functioning cluster", func() {
39 			upgCtx, err := common.GetUpgradeContext(f.ClientSet.Discovery())
                        ...

vs the failing cluster upgrade:

51 var _ = SIGDescribe("Upgrade [Feature:Upgrade]", func() {
52 	f := framework.NewDefaultFramework("cluster-upgrade")
53 	testFrameworks := upgrades.CreateUpgradeFrameworks(upgradeTests)
54 
55 	// Create the frameworks here because we can only create them
56	// in a "Describe".
57	ginkgo.Describe("master upgrade", func() {
58		ginkgo.It("should maintain a functioning cluster [Feature:MasterUpgrade]", func() {
59			upgCtx, err := common.GetUpgradeContext(f.ClientSet.Discovery())
                        ..

The literally look the same for me with one failing and the other passing. I will keep looking...

@wojtek-t
Copy link
Member

Ironically - the tests seem to actually be running, e.g. from the run above from logs:

...
I0709 16:06:37.901] �[0m[sig-cloud-provider-gcp] Upgrade [Feature:Upgrade]�[0m �[90mmaster upgrade�[0m 
I0709 16:06:37.901]   �[1mshould maintain a functioning cluster [Feature:MasterUpgrade]�[0m
I0709 16:06:37.901]   �[37m/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/cloud/gcp/cluster_upgrade.go:57�[0m
I0709 16:06:37.901] [BeforeEach] [sig-cloud-provider-gcp] Upgrade [Feature:Upgrade]
I0709 16:06:37.902]   /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/framework/framework.go:185
I0709 16:06:37.902] �[1mSTEP�[0m: Creating a kubernetes client
I0709 16:06:37.902] Jul  9 16:06:37.892: INFO: >>> kubeConfig: /workspace/.kube/config
I0709 16:06:37.902] �[1mSTEP�[0m: Building a namespace api object, basename cluster-upgrade
I0709 16:06:38.042] Jul  9 16:06:38.042: INFO: No PodSecurityPolicies found; assuming PodSecurityPolicy is disabled.
I0709 16:06:38.042] W0709 16:06:38.042033   10421 warnings.go:70] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
...
I0709 16:14:00.610] Jul  9 16:14:00.610: INFO: Checking control plane version
I0709 16:14:00.719] Jul  9 16:14:00.718: INFO: Control plane is at version 1.21.3-rc.0.10+5f8f5ab3268b41
I0709 16:14:00.719] �[1mSTEP�[0m: Disruption complete; stopping async validations
I0709 16:14:00.719] �[1mSTEP�[0m: Waiting for async validations to complete
I0709 16:14:00.719] [AfterEach] [sig-cloud-provider-gcp] Upgrade [Feature:Upgrade]
I0709 16:14:00.719]   /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/framework/framework.go:186

So it seems that the tests are actually running (and passing).
It's just at the end we have this strange error reported...

@spiffxp
Copy link
Member

spiffxp commented Jul 15, 2021

Failed to set release from https://storage.googleapis.com/k8s-release-dev/ci/k8s-beta.txt (Unexpected HTTP status code: 404)

I can't find a job that sets k8s-beta, so I will copy it, but it will be pointing to a stale build anyway:

$ gsutil cat gs://kubernetes-release-dev/ci/k8s-beta.txt
v1.21.3-rc.0.10+5f8f5ab3268b41
$ gsutil cat gs://kubernetes-release-dev/ci/latest-1.21.txt
v1.21.3-rc.0.26+f872796713459c
$ gsutil cat gs://k8s-release-dev/ci/latest-1.21.txt; echo
v1.21.3-rc.0.26+f872796713459c
$ gsutil cp gs://kubernetes-release-dev/ci/k8s-beta.txt gs://k8s-release-dev/ci
Copying gs://kubernetes-release-dev/ci/k8s-beta.txt [Content-Type=text/plain]...
/ [1 files][   31.0 B/   31.0 B]
Operation completed over 1 objects/31.0 B.

This is our usual fun dance of k8s-beta not meaning anything until a release branch gets cut, ref: kubernetes/sig-release#850

@spiffxp
Copy link
Member

spiffxp commented Jul 15, 2021

$ for m in latest latest-1.22 k8s-master; do echo $m: $(gsutil cat gs://k8s-release-dev/ci/$m.txt); done
latest: v1.22.0-beta.2.3+f5bc129a9916a1
latest-1.22: v1.22.0-beta.2.3+f5bc129a9916a1
k8s-master: v1.22.0-beta.2.3+f5bc129a9916a1

I forget exactly how this all plays with job rotation once release-branch jobs are cut. I would personally be fine with moving away from the k8s-foo markers towards hardcoded numbers, but I would rather defer to @kubernetes/release-engineering

/assign @cpanato @puerco @justaugustus

@spiffxp
Copy link
Member

spiffxp commented Jul 15, 2021

It looks like kubernetes/test-infra#22790 is what stopped writing gs://kubernetes-release-dev/ci/k8s-beta.txt

@spiffxp
Copy link
Member

spiffxp commented Jul 15, 2021

that looks related to kubernetes/release@908c081

cc @spiffxp

That wasn't it, but thanks for tagging me in, because I was definitely the final straw on this particular camel's back. The extract step failing on 2021-07-09 means it's either kubernetes/test-infra#22840 or one of its followup PRs mentioned in kubernetes/k8s.io#2318 (comment)

@spiffxp
Copy link
Member

spiffxp commented Jul 15, 2021

#99857 (comment) maybe this is the culprit for the first problem?

@wojtek-t
Copy link
Member

See #103697 (comment) above [that change was reverted in one of subsequent PRs]

@wojtek-t
Copy link
Member

Namely - this one: #101118

@wojtek-t
Copy link
Member

I guess I know what's happening - #101118 has to be cherrypicked back to 1.21.
Will open a cherrypick later today [need to run now.]

@wojtek-t
Copy link
Member

@liggitt - #103712 is out for review

@liggitt
Copy link
Member Author

liggitt commented Jul 15, 2021

I can't find a job that sets k8s-beta, so I will copy it, but it will be pointing to a stale build anyway

Thanks, latest run passed the extract step and we're back to the ginkgo describe error. That should be fixed by #103712.

Once https://storage.googleapis.com/k8s-release-dev/ci/k8s-stable1.txt updates to v1.21.3-rc.0.28+4aa451e8458a7c, can you push that to k8s-beta.txt as a stop gap while we figure out whether we should start pushing k8s-beta from CI again or update the job config?

@neolit123
Copy link
Member

neolit123 commented Jul 15, 2021 via email

@liggitt
Copy link
Member Author

liggitt commented Jul 15, 2021

it looks like the ci/k8s-stable1.txt marker has the HEAD build for the most recent released minor version branch... isn't that what you want?

@neolit123
Copy link
Member

I guess it would. I may have been mislead by the 'stable' in the name. In practice the tooling can now do these calculations and its easy to say 'this upgrade job does latest-1 to latest upgrades', but one still needs to PR test infra. And IIRC the original intent of the *stable markers was mainly to not PR test infra on each release.

@liggitt
Copy link
Member Author

liggitt commented Jul 15, 2021

looks like CI build v1.21.3-rc.0.28+4aa451e8458a7c is available now

@spiffxp
Copy link
Member

spiffxp commented Jul 15, 2021

can you push that to k8s-beta.txt as a stop gap while we figure out whether we should start pushing k8s-beta from CI again or update the job config?

Done

$ gsutil cat gs://k8s-release-dev/ci/latest-1.21.txt; echo
v1.21.3-rc.0.28+4aa451e8458a7c
$ gsutil cp gs://k8s-release-dev/ci/latest-1.21.txt gs://k8s-release-dev/ci/k8s-beta.txt
Copying gs://k8s-release-dev/ci/latest-1.21.txt [Content-Type=text/plain]...
/ [1 files][   30.0 B/   30.0 B]
Operation completed over 1 objects/30.0 B.

@liggitt
Copy link
Member Author

liggitt commented Jul 15, 2021

awesome, thanks

@spiffxp
Copy link
Member

spiffxp commented Jul 15, 2021

And IIRC the original intent of the *stable markers was mainly to not PR test infra on each release.

If we're going to make a decision on what we think version markers should look like, that's what kubernetes/sig-release#850 is for. IMO the contributor clarity gained by hardcoding version numbers would far outweigh the toil of updating them every N months.

@neolit123
Copy link
Member

IMO the contributor clarity gained by hardcoding version numbers would far outweigh the toil of updating them every N months.

I agree since some markers have been a bit cryptic. Tooling can help with the regular updates too.

@spiffxp
Copy link
Member

spiffxp commented Jul 16, 2021

/assign
keeping an eye on the latest runs

@wojtek-t
Copy link
Member

OK - so cluster upgrade tests become green again after the change:
https://testgrid.k8s.io/google-gce-upgrade#gce-1.13-1.14-upgrade-cluster&width=25

But it seems that master upgrade is panicing (I don't know how it worked for me before). It's a type - going to send out fix soon.

@wojtek-t
Copy link
Member

fix in master branch: #103733
cherrypick to 1.21: #103734

@liggitt
Copy link
Member Author

liggitt commented Jul 16, 2021

#103734 is merged, should resolve the last failure

as long as the job points to ci/k8s-beta.txt, I guess we'll need one more bump of that file once https://storage.googleapis.com/k8s-release-dev/ci/k8s-stable1.txt updates to v1.21.4-rc.0.3+0e1bd6ab564...

@liggitt
Copy link
Member Author

liggitt commented Jul 16, 2021

also opened kubernetes/test-infra#22915 to fix up the testgrid tab names and make these tests use ci/latest instead of ci/k8s-beta (happy to re-rework that in the future if k8s-beta starts being automatically populated again)

@liggitt
Copy link
Member Author

liggitt commented Jul 16, 2021

I guess we'll need one more bump of that file once https://storage.googleapis.com/k8s-release-dev/ci/k8s-stable1.txt updates to v1.21.4-rc.0.3+0e1bd6ab564...

looks like it's ready now

@spiffxp
Copy link
Member

spiffxp commented Jul 16, 2021

I got pulled away from keyboard, I'll sync shortly though the test-infra PR will probably obviate

@spiffxp
Copy link
Member

spiffxp commented Jul 16, 2021

$ gsutil cp gs://kubernetes-release-dev/ci/latest-1.21.txt gs://k8s-release-dev/ci/k8s-beta.txt
Copying gs://kubernetes-release-dev/ci/latest-1.21.txt [Content-Type=text/plain]...
/ [1 files][   30.0 B/   30.0 B]
Operation completed over 1 objects/30.0 B.

$ for m in k8s-beta latest-1.21 k8s-stable1; do echo $m: $(gsutil cat gs://k8s-release-dev/ci/$m.txt); done
k8s-beta: v1.21.4-rc.0.3+0e1bd6ab564a0a
latest-1.21: v1.21.4-rc.0.3+0e1bd6ab564a0a
k8s-stable1: v1.21.4-rc.0.3+0e1bd6ab564a0a

@wojtek-t
Copy link
Member

OK - so the upgrade itself works now.

However, there are gazilions of storage tests that started failing after Jordan upgraded to use latest.

@liggitt - should we close this one and open a separate bug for the failing storage tests?

@liggitt
Copy link
Member Author

liggitt commented Jul 21, 2021

@liggitt - should we close this one and open a separate bug for the failing storage tests?

probably so

@wojtek-t
Copy link
Member

Opened #103822

Closing this one as resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/testing Categorizes an issue or PR as relevant to SIG Testing. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

8 participants