🌱 use cluster level lock instead of global lock for cluster accessor initialization #6380

fgutmann · 2022-04-05T17:42:48Z

What this PR does / why we need it:

Currently, initialization of a cluster accessor requires a global lock to be held. Initializing an accessor includes creating the dynamic rest mapper for the workload cluster and waiting for caches to populate. On a high latency connection to a workload cluster this can take a significant amount of time, because there are 10s of requests sent to the API-server for initializing the dynamic rest mapper and populating caches. During this time all reconciliation loops which require an accessor for any workload cluster are fully blocked, effectively blocking reconciliation of all clusters.

This PR allows multiple accessors for different clusters to be initialized in parallel by splitting the global lock into one lock per cluster. The implemented locking mechanism ensures that:

only one cluster accessor can exist for a particular cluster
initialization of cluster accessors for different clusters do not block each other

linux-foundation-easycla · 2022-04-05T17:42:52Z

The committers listed above are authorized under a signed CLA.

✅ login: fgutmann / name: Florian Gutmann (cb5d3dd, a3ca576)

k8s-ci-robot · 2022-04-05T17:42:56Z

Welcome @fgutmann!

It looks like this is your first PR to kubernetes-sigs/cluster-api 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/cluster-api has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot · 2022-04-05T17:42:57Z

Hi @fgutmann. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot · 2022-04-05T17:42:58Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign neolit123 after the PR has been reviewed.
You can assign the PR to them by writing /assign @neolit123 in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

sbueringer · 2022-04-05T18:39:08Z

/ok-to-test

fabriziopandini · 2022-04-19T14:54:08Z

this PR includes a separate commit to update docker/distribution from v2.8.0 to v2.8.1. This minor version upgrade was required to make the project build in my environment

Would be a problem to move this commit to a separated PR so we can get this merged ASAP/without waiting discussion on the other changes

controllers/remote/cluster_cache.go

util/sync/mutex/keyedmutex.go

controllers/remote/cluster_cache.go

go.mod

fgutmann · 2022-04-27T17:10:23Z

Thank you @fabriziopandini and @sbueringer very much for taking a look at this PR and providing feedback!

I will address the suggested changes next week. I'm currently on vacation and don't have access to my regular work environment.

sbueringer · 2022-04-27T17:36:27Z

Thank you @fabriziopandini and @sbueringer very much for taking a look at this PR and providing feedback!

I will address the suggested changes next week. I'm currently on vacation and don't have access to my regular work environment.

Sure, no rush! Enjoy your vacation

vincepri · 2022-05-23T12:29:47Z

@fgutmann Thanks for the PR! I looked through the code and it the keyed lock is definitely a good improvement on the existing global mutex. Have we thought also to add some sort of timeout to get the cluster accessor when populating the caches? If there is no timeout today, we can still incur in the same issue regardless of key-level locks or not, given that there is always a fixed number of workers that can reconcile requests.

fgutmann · 2022-05-25T22:02:29Z

@vincepri The dynamic rest mapper used by the discovery client gets a timeout from the rest config, which is 10 seconds per request. The discovery phase thus has a sensible timeout.

However, the cache.WaitForCacheSync(cacheCtx) call currently uses an unbounded context and can get stuck forever. What is a sensible timeout for initially syncing the caches? If my understanding is correct, initially only Nodes are synced (coming from the remote.DefaultIndexes) list. Maybe 5 minutes? That should be on the safe side even for clusters with lots of nodes.

Before this commit, workload cluster client initialization required a global lock to be held. If initialization of a single workload cluster client took time, all other reconcile-loops who require a workload cluster connection were blocked until initialization finished. Initialization of a workload cluster client can take a significant amount of time, because it requires to initialize the discovery client, which sends multiple request to the API-server. With this change initialization of a workload cluster client only requires to hold a lock for the specific cluster. This means reconciliation for other clusters is not affected by a long running workload cluster client initialization.

fgutmann · 2022-05-26T01:17:41Z

Pushed an updated version with the changes discussed above. It also now contains a timeout of 5 minutes for initially synchronizing the cache.

vincepri · 2022-06-09T16:02:11Z

controllers/remote/keyedmutex.go

+// keyedMutex is a mutex locking on the key provided to the Lock function.
+// Only one caller can hold the lock for a specific key at a time.
+type keyedMutex struct {


Was this copied from somewhere else? Can we reuse a library?

No, this was written for this CR. Did some research but didn't find any library that provides this functionality.

controllers/remote/cluster_cache.go

vincepri · 2022-06-09T16:03:05Z

controllers/remote/cluster_cache.go

 	a, err := t.newClusterAccessor(ctx, cluster, indexes...)
 	if err != nil {
+		log.V(4).Info("error creating new cluster accessor")


Remove this? If it's an error, it shouldn't be an info?

controllers/remote/cluster_cache.go

sbueringer · 2022-07-01T09:26:41Z

@fgutmann Do you have time to address the findings?

Co-authored-by: Vince Prignano <vince@vincepri.com>

fgutmann · 2022-07-05T20:55:42Z

Updated the log messages and replied to the comments by @vincepri.

k8s-ci-robot · 2022-07-27T17:35:04Z

@fgutmann: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-triage-robot · 2022-10-25T17:39:24Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

fgutmann · 2022-11-14T19:22:51Z

Superseded by #6380

fabriziopandini · 2022-11-14T19:49:08Z

@fgutmann thanks for this PR, it was really valuable to get 6380 merged
sorry again about this PR getting out of the radar...

k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 5, 2022

k8s-ci-robot added the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label Apr 5, 2022

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Apr 5, 2022

k8s-ci-robot requested review from stmcginnis and vincepri April 5, 2022 17:42

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Apr 5, 2022

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 5, 2022

fgutmann force-pushed the lock-splitting branch from a3ca576 to 4970495 Compare April 6, 2022 00:57

fgutmann marked this pull request as ready for review April 6, 2022 01:12

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 6, 2022

k8s-ci-robot requested a review from fabriziopandini April 6, 2022 01:12

fabriziopandini reviewed Apr 19, 2022

View reviewed changes

controllers/remote/cluster_cache.go Outdated Show resolved Hide resolved

util/sync/mutex/keyedmutex.go Outdated Show resolved Hide resolved

sbueringer reviewed Apr 27, 2022

View reviewed changes

util/sync/mutex/keyedmutex.go Outdated Show resolved Hide resolved

util/sync/mutex/keyedmutex.go Outdated Show resolved Hide resolved

controllers/remote/cluster_cache.go Outdated Show resolved Hide resolved

go.mod Outdated Show resolved Hide resolved

fgutmann force-pushed the lock-splitting branch 2 times, most recently from 14c7cc2 to 17cbd0d Compare May 26, 2022 00:13

k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 26, 2022

fgutmann force-pushed the lock-splitting branch from 17cbd0d to 35b554f Compare May 26, 2022 00:13

fgutmann force-pushed the lock-splitting branch from 35b554f to e1dff72 Compare May 26, 2022 00:15

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels May 26, 2022

vincepri reviewed Jun 9, 2022

View reviewed changes

Update log messages

dea643a

Co-authored-by: Vince Prignano <vince@vincepri.com>

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 27, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 25, 2022

sbueringer mentioned this pull request Nov 11, 2022

🌱 ClusterCacheTracker: use non-blocking per-cluster locking #7537

Merged

fgutmann closed this Nov 14, 2022

fgutmann deleted the lock-splitting branch November 14, 2022 19:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🌱 use cluster level lock instead of global lock for cluster accessor initialization #6380

🌱 use cluster level lock instead of global lock for cluster accessor initialization #6380

fgutmann commented Apr 5, 2022 •

edited

linux-foundation-easycla bot commented Apr 5, 2022 •

edited

k8s-ci-robot commented Apr 5, 2022

k8s-ci-robot commented Apr 5, 2022

k8s-ci-robot commented Apr 5, 2022

sbueringer commented Apr 5, 2022

fabriziopandini commented Apr 19, 2022

fgutmann commented Apr 27, 2022

sbueringer commented Apr 27, 2022 •

edited

vincepri commented May 23, 2022

fgutmann commented May 25, 2022

fgutmann commented May 26, 2022

vincepri Jun 9, 2022

fgutmann Jul 5, 2022

vincepri Jun 9, 2022

sbueringer commented Jul 1, 2022

fgutmann commented Jul 5, 2022

k8s-ci-robot commented Jul 27, 2022

k8s-triage-robot commented Oct 25, 2022

fgutmann commented Nov 14, 2022

fabriziopandini commented Nov 14, 2022

🌱 use cluster level lock instead of global lock for cluster accessor initialization #6380

🌱 use cluster level lock instead of global lock for cluster accessor initialization #6380

Conversation

fgutmann commented Apr 5, 2022 • edited

linux-foundation-easycla bot commented Apr 5, 2022 • edited

k8s-ci-robot commented Apr 5, 2022

k8s-ci-robot commented Apr 5, 2022

k8s-ci-robot commented Apr 5, 2022

sbueringer commented Apr 5, 2022

fabriziopandini commented Apr 19, 2022

fgutmann commented Apr 27, 2022

sbueringer commented Apr 27, 2022 • edited

vincepri commented May 23, 2022

fgutmann commented May 25, 2022

fgutmann commented May 26, 2022

vincepri Jun 9, 2022

Choose a reason for hiding this comment

fgutmann Jul 5, 2022

Choose a reason for hiding this comment

vincepri Jun 9, 2022

Choose a reason for hiding this comment

sbueringer commented Jul 1, 2022

fgutmann commented Jul 5, 2022

k8s-ci-robot commented Jul 27, 2022

k8s-triage-robot commented Oct 25, 2022

fgutmann commented Nov 14, 2022

fabriziopandini commented Nov 14, 2022

fgutmann commented Apr 5, 2022 •

edited

linux-foundation-easycla bot commented Apr 5, 2022 •

edited

sbueringer commented Apr 27, 2022 •

edited