Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

source.Kind and source.Channel interpret the ctx arg to their Start method differently #1346

Closed
alvaroaleman opened this issue Jan 19, 2021 · 10 comments · Fixed by #1428
Closed
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@alvaroaleman
Copy link
Member

source.Channel will shutdown when the context.Context passed to its Start gets cancelled, whereas source.Kind uses that arg go get an Informer and then ignores it.

Because of that, it is currently not possible to implement a proper cacheSyncTimeout without breaking source.Channel.

We should:

  • Clarify on the interface definition what the meaning of the context.Context arg to Start is
  • Update either source.Kind or source.Channel accordingly

History:

@alvaroaleman
Copy link
Member Author

/kind bug

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Jan 19, 2021
@charith-elastic
Copy link
Contributor

Hi @alvaroaleman I am most likely missing something; I don't see why #1345 breaks waitForCacheSync. Isn't that case being handled here:

for _, watch := range c.startWatches {
syncingSource, ok := watch.src.(source.SyncingSource)
if !ok {
continue
}
if err := func() error {
// use a context with timeout for launching sources and syncing caches.
sourceStartCtx, cancel := context.WithTimeout(ctx, c.CacheSyncTimeout)
defer cancel()
// WaitForSync waits for a definitive timeout, and returns if there
// is an error or a timeout
if err := syncingSource.WaitForSync(sourceStartCtx); err != nil {
err := fmt.Errorf("failed to wait for %s caches to sync: %w", c.Name, err)
c.Log.Error(err, "Could not wait for Cache to sync")
return err
}
return nil
}(); err != nil {
return err
}

@alvaroaleman
Copy link
Member Author

@charith-elastic because if the cache is already started, which is usually the case as the manager does that, we will:

@charith-elastic
Copy link
Contributor

I see. Thanks for the explanation.

@charith-elastic
Copy link
Contributor

I have a couple of ideas about supporting cache sync timeout:

  • Before the controller starts any watches, get the cache and call WaitForCacheSync on it so that we know any subsequent use of it by the different Source implementations will work.
  • Pass the cache sync timeout as a context value (context.WithValue). Whoever calls WaitForCacheSync should check for the existence of the timeout key and construct a proper context using that value to ensure that the call times out eventually.

I am not very familiar with the codebase so there might be a more elegant solution that I am missing. Just wanted to give my $0.02 here. WDYT?

@alvaroaleman
Copy link
Member Author

alvaroaleman commented Jan 21, 2021

Regarding the first option, the problem is that we have to first get an informer for the kind we are interested in, otherwise synced doesn't mean "synced for the object kind you are care about".

The second option sounds like a great idea to me.

Today I also realized that when you changed the context to not have a timeout, you did not remove the testcase that verifies that we will error when the timeout is set too short. So that means at least one of:

  • That testcase is broken
  • What I wrote above is not correct

If you have a bit of time to dig into that, that would be great!

@charith-elastic
Copy link
Contributor

The test cases only check sources that are SyncingSources so my change wouldn't affect them. I should have added a case for Channel though. I didn't notice it was missing at that point.

I am a bit busy at the moment so it might take a little while for me to start working on this. If anybody else wants to take it on in the mean time, leave a comment here so that I know you're working on it.

@varshaprasad96
Copy link
Member

I have been following this thread. I can help with this and update the findings here.

@varshaprasad96
Copy link
Member

varshaprasad96 commented Feb 6, 2021

I have created a PR which adds test cases to verify the controller with source.Channel. Digging up the code, the observations discussed here seem to be right. When a source.Kind is passed, we get the informer during which we wait for caches to sync. If we dont pass the context with a timeout we would still not be solving issue #1219. When source.Channel is used, and cacheSyncTimeout is very less, then we error while starting source.Channel.
To verify this, when I remove the cacheSyncTimeout implementation, the test case here does fail. And instead, if we pass a context with timeout while starting the event source, the other test in PR fails.

@alvaroaleman
Copy link
Member Author

alvaroaleman commented Feb 16, 2021

I have created a PR which adds test cases to verify the controller with source.Channel. Digging up the code, the observations discussed here seem to be right. When a source.Kind is passed, we get the informer during which we wait for caches to sync. If we dont pass the context with a timeout we would still not be solving issue #1219. When source.Channel is used, and cacheSyncTimeout is very less, then we error while starting source.Channel.
To verify this, when I remove the cacheSyncTimeout implementation, the test case here does fail. And instead, if we pass a context with timeout while starting the event source, the other test in PR fails.

The initial reason for wanting the cache sync is to make binaries crash if they can not establish a watch, which in most cases is caused due to misconfigured rbac. Right now there are two reasons why the test doesn't cover this properly:

  • Contrary to the Manager, it doesn't start the Cache before the controller starts the Source. This makes us not hit the code that blocks in *kind.Start
  • Even if we add that, the test would still succeed because it times out in syncingSource.WaitForSync, presumably because establishing a watch takes more than 10 nanoseconds and we have no way of making source.Start block without a setup with rbac

I didn't manage to get the "misconfigured rbac" scenario running with the envtest based tests in 10 mins, because they use insecure port which implies the AlwaysAllow authorizer and changing that and getting a kubeconfig that is valid but does not have authz to list deployments is not trivial. The issue can be reproduced quickly via kind though by:

  1. applying this patch:
--- a/pkg/internal/controller/controller_test.go
+++ b/pkg/internal/controller/controller_test.go
@@ -32,6 +32,7 @@ import (
        "k8s.io/client-go/util/workqueue"
        "sigs.k8s.io/controller-runtime/pkg/cache"
        "sigs.k8s.io/controller-runtime/pkg/cache/informertest"
+       "sigs.k8s.io/controller-runtime/pkg/client/config"
        "sigs.k8s.io/controller-runtime/pkg/controller/controllertest"
        "sigs.k8s.io/controller-runtime/pkg/handler"
        ctrlmetrics "sigs.k8s.io/controller-runtime/pkg/internal/controller/metrics"
@@ -122,12 +123,20 @@ var _ = Describe("controller", func() {
                        close(done)
                })
 
-               It("should error when cache sync timeout occurs", func(done Done) {
+               FIt("should error when cache sync timeout occurs", func(done Done) {
                        ctrl.CacheSyncTimeout = 10 * time.Nanosecond
 
+                       cfg := config.GetConfigOrDie()
                        c, err := cache.New(cfg, cache.Options{})
                        Expect(err).NotTo(HaveOccurred())
 
+                       go func() {
+                               Expect(c.Start(context.Background())).NotTo(HaveOccurred())
+                       }()
+
+                       c.WaitForCacheSync(context.Background())
+                       println("Inital cache sync in test succeeded")
+
                        ctrl.startWatches = []watchDescription{{
                            
  1. Using kind to start a cluster: kind create cluster
  2. Get a kubeconfig for the default SA that does not have perms to list deployments: oc serviceaccounts create-kubeconfig default >/tmp/default.kk; export KUBECONFIG=/tmp/default.kk
  3. Watch the test time out:
$ go test -v ./pkg/internal/controller/
=== RUN   TestSource
Running Suite: Controller internal Suite
========================================
Random Seed: 1613448182
Will run 1 of 33 specs

SSSInital cache sync in test succeeded
E0215 23:03:09.060780  499651 reflector.go:138] pkg/mod/k8s.io/client-go@v0.21.0-alpha.2/tools/cache/reflector.go:167: Failed to watch *v1.Deployment: failed to list *v1.Deployment: deployments.apps is forbidden: User "system:serviceaccount:default:default" cannot list resource "deployments" in API group "apps" at the cluster scope

------------------------------
•... Timeout [1.000 seconds]
controller
/home/alvaro/git/golang/src/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller_test.go:45
  Start
  /home/alvaro/git/golang/src/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller_test.go:89
    should error when cache sync timeout occurs [It]
    /home/alvaro/git/golang/src/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller_test.go:126

    Timed out

    /home/alvaro/git/golang/src/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller_test.go:126

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants