Webhooks are not guaranteed to start before cache sync is started #1685

randomvariable · 2021-10-07T12:50:37Z

In the process of debugging kubernetes-sigs/cluster-api-provider-aws#2834 we spot the following behaviour:

If resources are present in API Server/etcd at an older storage version AND --leader-election is set to false when starting controllers with a newer CRD version, then there is a high probability that conversion webhooks will not start before cache syncs start, which due to conversion webhooks not being present and a mutex being held, indefinitely delay the start of the webhook server. On some infrastructures, this is closer to 90% (e.g. Amazon EC2 t3.large (2 CPU, 8GB RAM)).

When leader election is enabled, then the probability drops to 20%, and when using healthchecks for the webhooks, sufficient restarts by kubelet lead to the conversion webhooks running sufficiently early to allow cache syncs to occur.

4e548fa was intended to ensure that webhooks start first, but this doesn't seem to have had the desired effect.

We believe that the occurrence is lower when leader election happens since the leader election process delays the call of cm.startLeaderElectionRunnables() until after election is completed.

Example logs are here: capi-logs.txt

Controller-runtime-version: v0.10.2
Affected consumers: Cluster API v1.0.0

cc @fabriziopandini @sbueringer

The text was updated successfully, but these errors were encountered:

sbueringer · 2021-10-07T15:18:52Z

A concrete example how we end up with a deadlock:

we have a CRD with v1alpha3 and v1beta1 versions including a conversion webhook
the old controller uses the v1alpha3 version of the CRD
the new controller uses the v1beta1 version of the CRD
the controllers have leader election disabled

Now the following happens:

we delete the old controller (and wait until the Pod is gone)
we deploy the new controller

In the new controller:

webhooks and reconcilers are set up and mgr.Start() is called
The interesting part is l.479++:
- a go routine for cm.startNonLeaderElectionRunnables() is started
- a go routine for l.482 ++ is started, in our case (leader election is disabled): cm.startLeaderElectionRunnables()
Both cm.startLeaderElectionRunnables() and cm.startNonLeaderElectionRunnables() try to cm.mu.Lock()
In almost all cases cm.startLeaderElectionRunnables() is faster (we had pods with >22 restarts and always had the deadlock)
cm.startLeaderElectionRunnables() will now try to start the controllers but will cm.waitForCache(cm.internalCtx) first

cm.waitForCache(cm.internalCtx) fails because of

Failed to watch *v1beta1.Machine: failed to list *v1beta1.Machine: conversion webhook for cluster.x-k8s.io/v1alpha3,
Kind=Machine failed: Post "https://capi-webhook-service.capi-system.svc:443/convert?timeout=30s": dial tcp 
10.96.138.92:443: connect: connection refused`

because the webhook is not started yet, and it will never be because of the mutex

>> deadlock

(xref: log of a CAPI controller stuck in that deadlock: manager.log (notably absent: a log with starting webhook server), Job)

Enabling leader election is a valid workaround as it delays cm.startLeaderElectionRunnables() enough so that cm.startNonLeaderElectionRunnables() is able to lock the mutex first. As a result the webhooks are started and the cache can be filled.

alvaroaleman · 2021-10-07T16:30:56Z

That is bad. Can we try to find a way to reliable start the conversion webhooks before anything else?

sbueringer · 2021-10-07T17:02:39Z

Yeah would be good to find a way. Two naive ideas (I don't really now enough about CR to assess if those will lead to new issues):

Do we have to call cm.startNonLeaderElectionRunnables() in a separate go routine async, why not just sync? (link) As far as I can see it shouldn't be necessary. If it's only safe for the webhook server we can create a separate func which just starts the webhook server sync.
I wonder if starting the webhook has to be in the critical section, i.e. could we move Lock/ defer Unlock to l.591 (link)

alvaroaleman · 2021-10-07T17:16:25Z

I think the main reason is performance
Yeah I was just going over that lock as well. It looks to me as if the lock actually protects two somewhat different things:

The waitForCache checks if the cache was already started and does nothing in that case
Add() and the range cm.nonLeaderElectionRunnables racing

We could probably add a new lock just for the cache starting and use a RW lock for the two runnable starter funcs and Add() whereas only the latter needs the write lock? Caveat: I only looked five minutes, might be I am missing something. The lock seems to generally be used for anything that could race with something else, which means there is a good chance we have or will add locking issues if we don't use more granular locks and describe clearly what each lock protects.

fabriziopandini · 2021-10-08T13:31:59Z

I have played a little bit around @alvaroaleman idea in #1689, I hope this can help

FillZpp · 2021-10-15T04:15:51Z

However, some webhooks need to get/list resources from cache during handling requests. For example, a webhook that injects fields or validates for a CR may depend on another resource (e.g. configmap). So if we start webhooks before cache synced, it will cause some requests failed because of ErrCacheNotStarted when get or list.

The ideal sequence might be conversion webhooks started -> cache synced -> validating/mutating webhooks started -> controllers started?

WDYT @fabriziopandini @alvaroaleman @vincepri

randomvariable · 2021-10-18T13:55:41Z

Would probably need a bigger refactor to separate conversion webhooks from the other ones, and add the validator & mutators after the webhook server is started. The ErrCacheNotStarted condition should be transitory right? In which case solving the deadlock is the most important thing right now.

alvaroaleman · 2021-10-18T14:16:50Z

I would also like to mention that webhooks (both conversion and validation/mutating) have different availability requirements than controllers and I would strongly recommend to run multiple replicas of the webhook servers at all times whereas this isn't needed for controllers. While I am ok improving the situation for the use-case of resource-constrained env where everything runs in a single replica, this is par definition never going to be particularly resilient (kube waits six full minutes to reschedule workloads after a node becomes unavailable, this is six full minutes your api is entirely unavailable) and I would very much recommend to not do that.

vincepri · 2021-10-18T14:25:42Z

@FillZpp @randomvariable If you have a test environment, could you run some tests on top of #1695 ?

randomvariable · 2021-10-18T14:28:27Z

@FillZpp @randomvariable If you have a test environment, could you run some tests on top of #1695 ?

Ah, cool, I missed that PR. I'll give it a go later today.

FillZpp · 2021-10-20T02:57:23Z

@FillZpp @randomvariable If you have a test environment, could you run some tests on top of #1695 ?

Sure. I have fetch #1695 into a branch in my fork repo and packaged a test tag. Then I upgrade dependences in openkruise/kruise, including k8s to 1.22, controller-runtime to v0.10 and replace it with the test tag.

Turns out all unit test2 and e2e tests work fine. https://github.com/FillZpp/kruise/actions?query=branch%3Atest-controller-runtime-pr-1695

sbueringer · 2021-11-10T15:25:24Z

Fix on main: #1695
Fix on release-0.10: #1690

Thx, everyone!

/close

k8s-ci-robot · 2021-11-10T15:25:38Z

@sbueringer: Closing this issue.

In response to this:

Fix on main: #1695
Fix on release-0.10: #1690

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

fabriziopandini mentioned this issue Oct 8, 2021

[WIP] 🐛 Avoid deadlock on start #1689

Closed

fabriziopandini mentioned this issue Oct 11, 2021

🐛 Start web hooks first #1690

Merged

randomvariable mentioned this issue Oct 18, 2021

When leader election is disabled controllers fail to start conversion webhooks when upgrading from v1alpha3 to v1beta1 kubernetes-sigs/cluster-api-provider-aws#2834

Closed

k8s-ci-robot closed this as completed Nov 10, 2021

FillZpp mentioned this issue Nov 16, 2021

REQUEST: New membership for FillZpp kubernetes/org#3100

Closed

7 tasks

FillZpp mentioned this issue Dec 22, 2021

🌱 Add FillZpp as a reviewer #1753

Merged

jmendesky mentioned this issue Aug 31, 2022

Controller does not start due webhook race condition sky-uk/kfp-operator#156

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Webhooks are not guaranteed to start before cache sync is started #1685

Webhooks are not guaranteed to start before cache sync is started #1685

randomvariable commented Oct 7, 2021 •

edited

sbueringer commented Oct 7, 2021 •

edited

alvaroaleman commented Oct 7, 2021

sbueringer commented Oct 7, 2021 •

edited

alvaroaleman commented Oct 7, 2021

fabriziopandini commented Oct 8, 2021

FillZpp commented Oct 15, 2021 •

edited

randomvariable commented Oct 18, 2021 •

edited

alvaroaleman commented Oct 18, 2021

vincepri commented Oct 18, 2021

randomvariable commented Oct 18, 2021

FillZpp commented Oct 20, 2021

sbueringer commented Nov 10, 2021 •

edited

k8s-ci-robot commented Nov 10, 2021

Webhooks are not guaranteed to start before cache sync is started #1685

Webhooks are not guaranteed to start before cache sync is started #1685

Comments

randomvariable commented Oct 7, 2021 • edited

sbueringer commented Oct 7, 2021 • edited

alvaroaleman commented Oct 7, 2021

sbueringer commented Oct 7, 2021 • edited

alvaroaleman commented Oct 7, 2021

fabriziopandini commented Oct 8, 2021

FillZpp commented Oct 15, 2021 • edited

randomvariable commented Oct 18, 2021 • edited

alvaroaleman commented Oct 18, 2021

vincepri commented Oct 18, 2021

randomvariable commented Oct 18, 2021

FillZpp commented Oct 20, 2021

sbueringer commented Nov 10, 2021 • edited

k8s-ci-robot commented Nov 10, 2021

randomvariable commented Oct 7, 2021 •

edited

sbueringer commented Oct 7, 2021 •

edited

sbueringer commented Oct 7, 2021 •

edited

FillZpp commented Oct 15, 2021 •

edited

randomvariable commented Oct 18, 2021 •

edited

sbueringer commented Nov 10, 2021 •

edited