Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Webhooks are not guaranteed to start before cache sync is started #1685

Closed
randomvariable opened this issue Oct 7, 2021 · 13 comments
Closed

Comments

@randomvariable
Copy link
Member

randomvariable commented Oct 7, 2021

In the process of debugging kubernetes-sigs/cluster-api-provider-aws#2834 we spot the following behaviour:

If resources are present in API Server/etcd at an older storage version AND --leader-election is set to false when starting controllers with a newer CRD version, then there is a high probability that conversion webhooks will not start before cache syncs start, which due to conversion webhooks not being present and a mutex being held, indefinitely delay the start of the webhook server. On some infrastructures, this is closer to 90% (e.g. Amazon EC2 t3.large (2 CPU, 8GB RAM)).

When leader election is enabled, then the probability drops to 20%, and when using healthchecks for the webhooks, sufficient restarts by kubelet lead to the conversion webhooks running sufficiently early to allow cache syncs to occur.

4e548fa was intended to ensure that webhooks start first, but this doesn't seem to have had the desired effect.

We believe that the occurrence is lower when leader election happens since the leader election process delays the call of cm.startLeaderElectionRunnables() until after election is completed.

Example logs are here: capi-logs.txt

Controller-runtime-version: v0.10.2
Affected consumers: Cluster API v1.0.0

cc @fabriziopandini @sbueringer

@sbueringer
Copy link
Member

sbueringer commented Oct 7, 2021

A concrete example how we end up with a deadlock:

  • we have a CRD with v1alpha3 and v1beta1 versions including a conversion webhook
  • the old controller uses the v1alpha3 version of the CRD
  • the new controller uses the v1beta1 version of the CRD
  • the controllers have leader election disabled

Now the following happens:

  • we delete the old controller (and wait until the Pod is gone)
  • we deploy the new controller

In the new controller:

  • webhooks and reconcilers are set up and mgr.Start() is called
  • The interesting part is l.479++:
    • a go routine for cm.startNonLeaderElectionRunnables() is started
    • a go routine for l.482 ++ is started, in our case (leader election is disabled): cm.startLeaderElectionRunnables()
  • Both cm.startLeaderElectionRunnables() and cm.startNonLeaderElectionRunnables() try to cm.mu.Lock()
  • In almost all cases cm.startLeaderElectionRunnables() is faster (we had pods with >22 restarts and always had the deadlock)
  • cm.startLeaderElectionRunnables() will now try to start the controllers but will cm.waitForCache(cm.internalCtx) first
  • cm.waitForCache(cm.internalCtx) fails because of
    Failed to watch *v1beta1.Machine: failed to list *v1beta1.Machine: conversion webhook for cluster.x-k8s.io/v1alpha3,
    Kind=Machine failed: Post "https://capi-webhook-service.capi-system.svc:443/convert?timeout=30s": dial tcp 
    10.96.138.92:443: connect: connection refused`
    because the webhook is not started yet, and it will never be because of the mutex

>> deadlock

(xref: log of a CAPI controller stuck in that deadlock: manager.log (notably absent: a log with starting webhook server), Job)

Enabling leader election is a valid workaround as it delays cm.startLeaderElectionRunnables() enough so that cm.startNonLeaderElectionRunnables() is able to lock the mutex first. As a result the webhooks are started and the cache can be filled.

@alvaroaleman
Copy link
Member

That is bad. Can we try to find a way to reliable start the conversion webhooks before anything else?

@sbueringer
Copy link
Member

sbueringer commented Oct 7, 2021

Yeah would be good to find a way. Two naive ideas (I don't really now enough about CR to assess if those will lead to new issues):

  1. Do we have to call cm.startNonLeaderElectionRunnables() in a separate go routine async, why not just sync? (link) As far as I can see it shouldn't be necessary. If it's only safe for the webhook server we can create a separate func which just starts the webhook server sync.
  2. I wonder if starting the webhook has to be in the critical section, i.e. could we move Lock/ defer Unlock to l.591 (link)

@alvaroaleman
Copy link
Member

  1. I think the main reason is performance
  2. Yeah I was just going over that lock as well. It looks to me as if the lock actually protects two somewhat different things:
  • The waitForCache checks if the cache was already started and does nothing in that case
  • Add() and the range cm.nonLeaderElectionRunnables racing

We could probably add a new lock just for the cache starting and use a RW lock for the two runnable starter funcs and Add() whereas only the latter needs the write lock? Caveat: I only looked five minutes, might be I am missing something. The lock seems to generally be used for anything that could race with something else, which means there is a good chance we have or will add locking issues if we don't use more granular locks and describe clearly what each lock protects.

@fabriziopandini
Copy link
Member

I have played a little bit around @alvaroaleman idea in #1689, I hope this can help

@FillZpp
Copy link
Contributor

FillZpp commented Oct 15, 2021

However, some webhooks need to get/list resources from cache during handling requests. For example, a webhook that injects fields or validates for a CR may depend on another resource (e.g. configmap). So if we start webhooks before cache synced, it will cause some requests failed because of ErrCacheNotStarted when get or list.

The ideal sequence might be conversion webhooks started -> cache synced -> validating/mutating webhooks started -> controllers started?

WDYT @fabriziopandini @alvaroaleman @vincepri

@randomvariable
Copy link
Member Author

randomvariable commented Oct 18, 2021

Would probably need a bigger refactor to separate conversion webhooks from the other ones, and add the validator & mutators after the webhook server is started. The ErrCacheNotStarted condition should be transitory right? In which case solving the deadlock is the most important thing right now.

@alvaroaleman
Copy link
Member

I would also like to mention that webhooks (both conversion and validation/mutating) have different availability requirements than controllers and I would strongly recommend to run multiple replicas of the webhook servers at all times whereas this isn't needed for controllers. While I am ok improving the situation for the use-case of resource-constrained env where everything runs in a single replica, this is par definition never going to be particularly resilient (kube waits six full minutes to reschedule workloads after a node becomes unavailable, this is six full minutes your api is entirely unavailable) and I would very much recommend to not do that.

@vincepri
Copy link
Member

@FillZpp @randomvariable If you have a test environment, could you run some tests on top of #1695 ?

@randomvariable
Copy link
Member Author

@FillZpp @randomvariable If you have a test environment, could you run some tests on top of #1695 ?

Ah, cool, I missed that PR. I'll give it a go later today.

@FillZpp
Copy link
Contributor

FillZpp commented Oct 20, 2021

@FillZpp @randomvariable If you have a test environment, could you run some tests on top of #1695 ?

Sure. I have fetch #1695 into a branch in my fork repo and packaged a test tag. Then I upgrade dependences in openkruise/kruise, including k8s to 1.22, controller-runtime to v0.10 and replace it with the test tag.

Turns out all unit test2 and e2e tests work fine. https://github.com/FillZpp/kruise/actions?query=branch%3Atest-controller-runtime-pr-1695

@sbueringer
Copy link
Member

sbueringer commented Nov 10, 2021

Fix on main: #1695
Fix on release-0.10: #1690

Thx, everyone!

/close

@k8s-ci-robot
Copy link
Contributor

@sbueringer: Closing this issue.

In response to this:

Fix on main: #1695
Fix on release-0.10: #1690

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants