Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 Fixed a bug in newGVKFixupWatcher which caused the metadata informer to hang #1790

Conversation

wallrj
Copy link
Contributor

@wallrj wallrj commented Feb 3, 2022

Fixes: #1789

I copied some tests from client-go and observed that the original implementation of gvkFixupWatcher fails those tests:

go test ./pkg/cache/internal/... -timeout 2s
panic: test timed out after 2s

goroutine 34 [running]:
testing.(*M).startAlarm.func1()
	/home/richard/sdk/go1.17/src/testing/testing.go:1788 +0x8e
created by time.goFunc
	/home/richard/sdk/go1.17/src/time/sleep.go:180 +0x31

goroutine 1 [chan receive]:
testing.(*T).Run(0xc0003f49c0, {0x128a598, 0x46c533}, 0x1327bb0)
	/home/richard/sdk/go1.17/src/testing/testing.go:1307 +0x375
testing.runTests.func1(0xc0003f49c0)
	/home/richard/sdk/go1.17/src/testing/testing.go:1598 +0x6e
testing.tRunner(0xc0003f49c0, 0xc000187d18)
	/home/richard/sdk/go1.17/src/testing/testing.go:1259 +0x102
testing.runTests(0xc0003a9980, {0x1d4e810, 0x1, 0x1}, {0x48986d, 0x1285e22, 0x1d6a3c0})
	/home/richard/sdk/go1.17/src/testing/testing.go:1596 +0x43f
testing.(*M).Run(0xc0003a9980)
	/home/richard/sdk/go1.17/src/testing/testing.go:1504 +0x51d
main.main()
	_testmain.go:43 +0x14b

goroutine 19 [chan receive]:
k8s.io/klog/v2.(*loggingT).flushDaemon(0x0)
	/home/richard/go/pkg/mod/k8s.io/klog/v2@v2.30.0/klog.go:1181 +0x6a
created by k8s.io/klog/v2.init.0
	/home/richard/go/pkg/mod/k8s.io/klog/v2@v2.30.0/klog.go:420 +0xfb

goroutine 20 [sleep]:
time.Sleep(0x6fc23ac00)
	/home/richard/sdk/go1.17/src/runtime/time.go:193 +0x12e
sigs.k8s.io/controller-runtime/pkg/log.init.0.func1()
	/home/richard/projects/kubernetes-sigs/controller-runtime/pkg/log/log.go:63 +0x38
created by sigs.k8s.io/controller-runtime/pkg/log.init.0
	/home/richard/projects/kubernetes-sigs/controller-runtime/pkg/log/log.go:62 +0x25

goroutine 21 [chan receive]:
sigs.k8s.io/controller-runtime/pkg/cache/internal.TestGVKFixupWatcher.func1({0x141ab60, 0xc00045ad80})
	/home/richard/projects/kubernetes-sigs/controller-runtime/pkg/cache/internal/informers_map_test.go:105 +0x85
sigs.k8s.io/controller-runtime/pkg/cache/internal.TestGVKFixupWatcher(0xc0003f4b60)
	/home/richard/projects/kubernetes-sigs/controller-runtime/pkg/cache/internal/informers_map_test.go:121 +0x52b
testing.tRunner(0xc0003f4b60, 0x1327bb0)
	/home/richard/sdk/go1.17/src/testing/testing.go:1259 +0x102
created by testing.(*T).Run
	/home/richard/sdk/go1.17/src/testing/testing.go:1306 +0x35a
FAIL	sigs.k8s.io/controller-runtime/pkg/cache/internal	2.026s
FAIL

I've now re-implemented the gvkFixupWatcher as a watch.FilterFunc along with watch.Filter which seems to have the desired behaviour.

go test ./pkg/cache/internal/... -timeout 2s
ok  	sigs.k8s.io/controller-runtime/pkg/cache/internal	0.023s

This change allows the reflector to know when the ResultChannel is closed at which point it breaks from its watch loop and establishes a new watch:

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 3, 2022
@k8s-ci-robot
Copy link
Contributor

Hi @wallrj. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Feb 3, 2022
@@ -434,12 +434,12 @@ func (w *gvkFixupWatcher) run() {
w.ch <- e
}
w.wg.Done()
close(w.ch)
Copy link
Member

@joelanford joelanford Feb 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be moved above w.wg.Done() so that there's a guarantee that w.ch is closed by the time w.Stop() returns?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found watch.Filter which has the correct channel closing behaviour so I've re-implemented. See what you think.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! TIL about watch.Filter.

@wallrj wallrj force-pushed the 1789-close-metadata-watch-result-channel branch from 0359498 to cab8dd1 Compare February 3, 2022 21:17
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Feb 3, 2022
@wallrj wallrj changed the title 🐛 WIP: Close the metadata watch result channel earlier 🐛 Close the metadata watch result channel earlier Feb 3, 2022
watcher,
func(in watch.Event) (out watch.Event, keep bool) {
in.DeepCopyInto(&out)
out.Object.GetObjectKind().SetGroupVersionKind(gvk)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original implementation mutated the watch event.
I've used DeepCopy here, but I'm not sure if it's necessary.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think DeepCopy is necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The in watch.Event is passed in by value, but it has a pointer to a runtime.Object (e.g. &metav1.PartialObjectMetadata ) which might be mutated elsewhere, so I thought it was safest to DeepCopy.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, no strong objections to DeepCopy from me!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't think it is necessary, but no strong objections from me too :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to remove deepcopy

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Done.

pkg/cache/internal/informers_map_test.go Outdated Show resolved Hide resolved
f.Modify(newTestType("bar"))
f.Delete(newTestType("bar"))
f.Error(newTestType("error: blah"))
f.Stop()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: The test sends events to the wrapped watcher and then checks the results that come from the wrapper.

@joelanford
Copy link
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 4, 2022
// TestGVKFixupWatcher tests that gvkFixupWatcher behaves like the watch that it
// wraps and that it overrides the GVK.
// Adapted from https://github.com/kubernetes/kubernetes/blob/adbda068c1808fcc8a64a94269e0766b5c46ec41/staging/src/k8s.io/apimachinery/pkg/watch/watch_test.go#L33-L78
func TestGVKFixupWatcher(t *testing.T) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK, the rest of controller-runtime's tests use Ginkgo/Gomega. Would you mind refactoring this test to be consistent with the rest of the repo?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've re-implemented the tests using Gomega Expect , but tried to retain the spirit of the original tests so that they can be easily compared to the original.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also replaced the convoluted fake type with actual metav1.PartialObjectMetadata

Signed-off-by: Richard Wall <richard.wall@jetstack.io>
Signed-off-by: Richard Wall <richard.wall@jetstack.io>
@wallrj wallrj force-pushed the 1789-close-metadata-watch-result-channel branch from cab8dd1 to 5f4904e Compare February 4, 2022 09:54
@wallrj wallrj changed the title 🐛 Close the metadata watch result channel earlier 🐛 Fixed a bug in newGVKFixupWatcher which caused the metadata informer to hang Feb 4, 2022
@wallrj
Copy link
Contributor Author

wallrj commented Feb 4, 2022

Some evidence that the problem has been fixed:

Test fails with the old implementation
 richard   0fbdc4b  ~  projects  kubernetes-sigs  controller-runtime  1  go test ./pkg/cache/internal/... -v --ginkgo.v
=== RUN   TestSource
Running Suite: Cache Internal Suite
===================================
Random Seed: 1643975186
Will run 1 of 1 specs

gvkFixupWatcher 
  behaves like watch.FakeWatcher
  /home/richard/projects/kubernetes-sigs/controller-runtime/pkg/cache/internal/informers_map_test.go:35
STEP: Fixing up watch.EventType: ADDED and passing it on
STEP: Fixing up watch.EventType: MODIFIED and passing it on
STEP: Fixing up watch.EventType: MODIFIED and passing it on
STEP: Fixing up watch.EventType: DELETED and passing it on
STEP: Fixing up watch.EventType: ERROR and passing it on

• Failure [1.002 seconds]
gvkFixupWatcher
/home/richard/projects/kubernetes-sigs/controller-runtime/pkg/cache/internal/informers_map_test.go:34
  behaves like watch.FakeWatcher [It]
  /home/richard/projects/kubernetes-sigs/controller-runtime/pkg/cache/internal/informers_map_test.go:35

  Timed out after 1.001s.
  Expected
      <<-chan watch.Event | len:0, cap:0>: 0xc0000be540
  to be closed

  /home/richard/projects/kubernetes-sigs/controller-runtime/pkg/cache/internal/informers_map_test.go:79
------------------------------



Summarizing 1 Failure:

[Fail] gvkFixupWatcher [It] behaves like watch.FakeWatcher 
/home/richard/projects/kubernetes-sigs/controller-runtime/pkg/cache/internal/informers_map_test.go:79

Ran 1 of 1 Specs in 1.002 seconds
FAIL! -- 0 Passed | 1 Failed | 0 Pending | 0 Skipped
--- FAIL: TestSource (1.00s)
FAIL
FAIL	sigs.k8s.io/controller-runtime/pkg/cache/internal	1.039s
FAIL
Test succeeds with the new implementation
richard   1789-close-metadata-watch-result-channel  ~  projects  kubernetes-sigs  controller-runtime  1  go test ./pkg/cache/internal/... -v --ginkgo.v
=== RUN   TestSource
Running Suite: Cache Internal Suite
===================================
Random Seed: 1643975277
Will run 1 of 1 specs

gvkFixupWatcher 
  behaves like watch.FakeWatcher
  /home/richard/projects/kubernetes-sigs/controller-runtime/pkg/cache/internal/informers_map_test.go:35
STEP: Fixing up watch.EventType: ADDED and passing it on
STEP: Fixing up watch.EventType: MODIFIED and passing it on
STEP: Fixing up watch.EventType: MODIFIED and passing it on
STEP: Fixing up watch.EventType: DELETED and passing it on
STEP: Fixing up watch.EventType: ERROR and passing it on
•

Ran 1 of 1 Specs in 0.010 seconds
SUCCESS! -- 1 Passed | 0 Failed | 0 Pending | 0 Skipped
--- PASS: TestSource (0.01s)
PASS
ok  	sigs.k8s.io/controller-runtime/pkg/cache/internal	0.027s
A sample informer now closes and re-establishes watches regularly

https://github.com/wallrj/tmp-controller-runtime-partial-object-experiment/blob/44e2c5d088d891252edf36056d9297bd66a17d7d/cmd/ctrl-metadata-informer/main.go

diff --git a/go.mod b/go.mod
index fb929dc..99e989c 100644
--- a/go.mod
+++ b/go.mod
@@ -14,6 +14,8 @@ require (
        sigs.k8s.io/controller-runtime v0.11.0
 )
 
+replace sigs.k8s.io/controller-runtime => ../../kubernetes-sigs/controller-runtime
+
 require (
        cloud.google.com/go v0.81.0 // indirect
        github.com/Azure/go-autorest v14.2.0+incompatible // indirect
richard   master  ~  projects  wallrj  partial-object-watch  1  go run ./cmd/ctrl-metadata-informer --zap-log-level 8 --zap-time-encoding iso8601 --namespace test1

2022-02-04T10:21:54.413Z        LEVEL(-6)       Config loaded from file:  /home/richard/.kube/config
...


2022-02-04T11:38:59.523Z        INFO    add     {"name": "test1/foo-9673"}
2022-02-04T11:39:23.998Z        LEVEL(-4)       pkg/mod/k8s.io/client-go@v0.23.0/tools/cache/reflector.go:167: Watch close - *v1.PartialObjectMetadata total 15 items received

2022-02-04T11:39:23.998Z        INFO    GET https://127.0.0.1:46255/api/v1/namespaces/test1/secrets?allowWatchBookmarks=true&resourceVersion=78217&timeout=8m3s&timeoutSeconds=483&watch=true

2022-02-04T11:39:23.998Z        INFO    Request Headers:

2022-02-04T11:39:23.998Z        INFO        Accept: application/vnd.kubernetes.protobuf;as=PartialObjectMetadata;g=meta.k8s.io;v=v1,application/json;as=PartialObjectMetadata;g=meta.k8s.io;v=v1,application/json

2022-02-04T11:39:23.998Z        INFO        User-Agent: ctrl-metadata-informer/v0.0.0 (linux/amd64) kubernetes/$Format

2022-02-04T11:39:24.000Z        INFO    Response Status: 200 OK in 1 milliseconds

2022-02-04T11:39:24.000Z        INFO    Response Headers:

2022-02-04T11:39:24.000Z        INFO        Content-Type: application/vnd.kubernetes.protobuf;stream=watch

2022-02-04T11:39:24.000Z        INFO        Date: Fri, 04 Feb 2022 11:39:24 GMT

2022-02-04T11:39:24.000Z        INFO        Cache-Control: no-cache, private

2022-02-04T11:39:59.621Z        INFO    add     {"name": "test1/foo-25053"}
2022-02-04T11:40:59.744Z        INFO    add     {"name": "test1/foo-10768"}
2022-02-04T11:41:59.849Z        INFO    add     {"name": "test1/foo-7385"}
2022-02-04T11:42:59.958Z        INFO    add     {"name": "test1/foo-24315"}
2022-02-04T11:44:00.062Z        INFO    add     {"name": "test1/foo-26661"}
2022-02-04T11:45:00.151Z        INFO    add     {"name": "test1/foo-21396"}
2022-02-04T11:46:00.299Z        INFO    add     {"name": "test1/foo-23017"}
2022-02-04T11:47:00.451Z        INFO    add     {"name": "test1/foo-21648"}
2022-02-04T11:47:27.000Z        LEVEL(-4)       pkg/mod/k8s.io/client-go@v0.23.0/tools/cache/reflector.go:167: Watch close - *v1.PartialObjectMetadata total 17 items received

2022-02-04T11:47:27.001Z        INFO    GET https://127.0.0.1:46255/api/v1/namespaces/test1/secrets?allowWatchBookmarks=true&resourceVersion=79033&timeout=8m2s&timeoutSeconds=482&watch=true

2022-02-04T11:47:27.001Z        INFO    Request Headers:

2022-02-04T11:47:27.001Z        INFO        Accept: application/vnd.kubernetes.protobuf;as=PartialObjectMetadata;g=meta.k8s.io;v=v1,application/json;as=PartialObjectMetadata;g=meta.k8s.io;v=v1,application/json

2022-02-04T11:47:27.001Z        INFO        User-Agent: ctrl-metadata-informer/v0.0.0 (linux/amd64) kubernetes/$Format

2022-02-04T11:47:27.004Z        INFO    Response Status: 200 OK in 2 milliseconds

2022-02-04T11:47:27.004Z        INFO    Response Headers:

2022-02-04T11:47:27.004Z        INFO        Cache-Control: no-cache, private

2022-02-04T11:47:27.004Z        INFO        Content-Type: application/vnd.kubernetes.protobuf;stream=watch

2022-02-04T11:47:27.004Z        INFO        Date: Fri, 04 Feb 2022 11:47:27 GMT

2022-02-04T11:48:00.596Z        INFO    add     {"name": "test1/foo-207"}
2022-02-04T11:49:00.692Z        INFO    add     {"name": "test1/foo-5199"}

Copy link
Member

@joelanford joelanford left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

Thanks for all the help on this!

Comment on lines +412 to +428
// newGVKFixupWatcher adds a wrapper that preserves the GVK information when
// events come in.
//
// This works around a bug where GVK information is not passed into mapping
// functions when using the OnlyMetadata option in the builder.
// This issue is most likely caused by kubernetes/kubernetes#80609.
// See kubernetes-sigs/controller-runtime#1484.
//
// This was originally implemented as a cache.ResourceEventHandler wrapper but
// that contained a data race which was resolved by setting the GVK in a watch
// wrapper, before the objects are written to the cache.
// See kubernetes-sigs/controller-runtime#1650.
//
// The original watch wrapper was found to be incompatible with
// k8s.io/client-go/tools/cache.Reflector so it has been re-implemented as a
// watch.Filter which is compatible.
// See kubernetes-sigs/controller-runtime#1789.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉 Thanks for capturing all of that history!

watcher,
func(in watch.Event) (out watch.Event, keep bool) {
in.DeepCopyInto(&out)
out.Object.GetObjectKind().SetGroupVersionKind(gvk)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, no strong objections to DeepCopy from me!

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 4, 2022
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: joelanford, wallrj

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@joelanford joelanford removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 4, 2022
@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 4, 2022
@wallrj
Copy link
Contributor Author

wallrj commented Feb 5, 2022

/cc @vincepri because I see that you introduced the metadata only watches in ClusterAPI and which may have been affected by this bug since upgrading to controller-runtime v0.10.1 in kubernetes-sigs/cluster-api#5249

Copy link
Contributor

@FillZpp FillZpp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly LG. Copyright should be 2022.

@@ -0,0 +1,94 @@
/*
Copyright 2018 The Kubernetes Authors.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/2018/2022/

@@ -0,0 +1,31 @@
/*
Copyright 2018 The Kubernetes Authors.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/2018/2022/

watcher,
func(in watch.Event) (out watch.Event, keep bool) {
in.DeepCopyInto(&out)
out.Object.GetObjectKind().SetGroupVersionKind(gvk)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't think it is necessary, but no strong objections from me too :)

Signed-off-by: Richard Wall <richard.wall@jetstack.io>
@wallrj wallrj requested a review from FillZpp February 7, 2022 18:39
Signed-off-by: Richard Wall <richard.wall@jetstack.io>
@wallrj wallrj requested a review from vincepri February 7, 2022 21:03
@alvaroaleman alvaroaleman added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Feb 7, 2022
@FillZpp
Copy link
Contributor

FillZpp commented Feb 8, 2022

/lgtm

Thanks for all the help on this!

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 8, 2022
@k8s-ci-robot k8s-ci-robot merged commit 273e608 into kubernetes-sigs:master Feb 8, 2022
@k8s-ci-robot k8s-ci-robot added this to the v0.10.x milestone Feb 8, 2022
@wallrj wallrj deleted the 1789-close-metadata-watch-result-channel branch February 8, 2022 07:38
@joelanford
Copy link
Member

Let's see if this works:

/cherry-pick release-0.11

@k8s-infra-cherrypick-robot

@joelanford: new pull request created: #1801

In response to this:

Let's see if this works:

/cherry-pick release-0.11

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Reconciler stops responding after some minutes when using the builder.OnlyMetadata watch option
7 participants