Merge pull request #58 from kube-rs/leader-election-stub

Scaling and availability docs with leader election advice
kube-rs · Apr 24, 2024 · 8628d19 · 8628d19
2 parents 18d9d40 + e3e5e6a
commit 8628d19
Show file tree

Hide file tree

Showing 4 changed files with 174 additions and 5 deletions.
diff --git a/docs/controllers/availability.md b/docs/controllers/availability.md
@@ -0,0 +1,88 @@
+# Availability
+
+This chapter is about strategies for improving controller availability and tail latencies.
+
+## Motivation
+
+Despite the common goals often set forth for application deployments, most `kube` controllers:
+
+- can run in a single replica (default recommendation)
+- can handle being killed, and be shifted to another node
+- can handle minor downtime
+
+This is due to a couple of properties:
+
+- Controllers are queue consumers that do not require 100% uptime to meet a 100% SLO
+- Rust images are often very small and will reschedule quickly
+- watch streams re-initialise quickly with the current state on boot
+- [[reconciler#idempotency]] means multiple repeat reconciliations are not problematic
+- parallel execution mode of reconciliations makes restarts fast
+
+These properties combined creates a low-overhead system that is normally quick to catch-up after being rescheduled, and offers a traditional Kubernetes __eventual consistency__ guarantee.
+
+That said, this setup can struggle under strong consistency requirements. Ask yourself:
+
+- How quickly do you expect your reconciler to **respond** to changes on average?
+- Is a `30s` P95 downtime from reschedules acceptable?
+
+## Responsiveness
+
+If you want to improve __average responsiveness__, then traditional [[scaling]] and [[optimization]] strategies can help:
+
+- Configure controller concurrency to avoid waiting for a reconciler slot
+- Optimize the reconciler, avoid duplicated work
+- Satisfy CPU requirements to avoid cgroup throttling
+- Ensure your [[relations]] are setup right to avoid waiting for the next [requeue](https://docs.rs/kube/latest/kube/runtime/controller/struct.Action.html)
+
+You can plot heatmaps of reconciliation times in grafana using standard [[observability#What Metrics]].
+
+<!--TODO: can we measure time from watch event seen to watch event received by reconciler?-->
+
+## High Availability
+
+Scaling a controller beyond one replica for HA is different than for a regular load-balanced traffic receiving application.
+
+A controller is effectively a consumer of Kubernetes watch events, and these are themselves unsynchronised event streams whose watchers are unaware of each other. Adding another pod - without some form of external locking - will result in duplicated work.
+
+To avoid this, most controllers lean into the eventual consistency model and run with a single replica, accepting higher tail latencies due to reschedules. However, once the performance demands are strong enough, these pod reschedules will dominate the tail of your latency metrics, making scaling necessary.
+
+!!! warning "Scaling Replicas"
+
+    It not recommended to set `replicas: 2` for an [[application]] running a normal `Controller` without leaders/shards, as this will cause both controller pods to reconcile the same objects, creating duplicate work and potential race conditions.
+
+To safely operate with more than one pod, you must have __leadership of your domain__ and wait for such leadership to be __acquired__ before commencing. This is the concept of leader election.
+
+## Leader Election
+
+Leader election allows having control over resources managed in Kubernetes using [Leases](https://kubernetes.io/docs/concepts/architecture/leases/) as distributed locking mechanisms.
+
+The common solution to downtime based-problems is to use the `leader-with-lease` pattern, by having another controller replica in "standby mode", ready to takeover immediately without stepping on the toes of the other controller pod. We can do this by creating a `Lease`, and gating on the validity of the lease before doing the real work in the reconciler.
+
+!!! note "Unsynchronised Rollout Surges"
+
+    A 1 replica controller deployment without leader election might create short periods of duplicate work and racey writes during rollouts because of how [rolling updates surge](https://docs.rs/k8s-openapi/latest/k8s_openapi/api/apps/v1/struct.RollingUpdateDeployment.html) by default.
+
+The natural expiration of `leases` means that you are required to periodically update them while your main pod (the leader) is active. When your pod is about be replaced, you can initiate a step down (and expire the lease), ideally after receiving a `SIGTERM` after [draining your active work queue](https://docs.rs/kube/latest/kube/runtime/struct.Controller.html#method.shutdown_on_signal). If your pod crashes, then a replacement pod must wait for the scheduled lease expiry.
+
+### Third Party Crates
+
+At the moment, leader election support is not supported by `kube` itself, and requires 3rd party crates (see [kube#485](https://github.com/kube-rs/kube/issues/485#issuecomment-1837386565)). A brief list of popular crates:
+
+- [`kube-leader-election`](https://crates.io/crates/kube-leader-election/) via [hendrikmaus](https://github.com/hendrikmaus/kube-leader-election) ([examples](https://github.com/hendrikmaus/kube-leader-election/tree/master/examples) / [docs](https://docs.rs/kube-leader-election/) / [disclaimer](https://github.com/hendrikmaus/kube-leader-election?tab=readme-ov-file#kubernetes-lease-locking))
+- [`kube-coordinate`](https://crates.io/crates/kube-coordinate) via [thedodd](https://github.com/thedodd/kube-coordinate) ([docs](https://docs.rs/kube-coordinate/))
+- [`kubert`](https://crates.io/crates/kubert) -> [`kubert::lease`](https://docs.rs/kubert/latest/kubert/lease/index.html) via [olix0r](https://github.com/olix0r/kubert) ([example](https://github.com/olix0r/kubert/blob/main/examples/lease.rs) / [linkerd use](https://github.com/linkerd/linkerd2/blob/1f4f4d417c6d06c3bd5a372fc75064f967117886/policy-controller/src/main.rs))
+
+<!-- OTHER ALTERNATIVES???
+Know other alternatives? Feel free to raise a PR here with a new list entry.
+-->
+
+### Elected Shards
+
+Leader election can in-theory be used on top of explicit [[scaling#sharding]] to ensure you have at most one replica managing one shard by using one lease per shard. This could reduce the number of excess replicas standing-by in a sharded scenario.
+
+<!-- Have examples?
+Feel free to raise a PR here with information.
+-->
+
+--8<-- "includes/abbreviations.md"
+--8<-- "includes/links.md"
diff --git a/docs/controllers/scaling.md b/docs/controllers/scaling.md
@@ -0,0 +1,72 @@
+# Scaling
+
+This chapter is about strategies for scaling controllers and the tradeoffs these strategies make.
+
+## Motivating Questions
+
+- Why is the reconciler lagging? Are there too many resources being reconciled?
+- What happens when your controller starts managing resource sets so large that it starts significantly impacting your CPU or memory use?
+
+Scaling an efficient Rust application that spends most of its time waiting for network changes might not seem like a complicated affair, and indeed, you can scale a controller in many ways and achieve good outcomes. But in terms of costs, not all solutions are created equal; are you improving your algorithm, or are you throwing more expensive machines at the problem to cover up inefficiencies?
+
+## Scaling Strategies
+
+We recommend trying the following scaling strategies in order:
+
+1. [[#Controller Optimizations]] (minimize expensive work to allow more work)
+2. [[#Vertical Scaling]] (more headroom for the single pod)
+3. [[#Sharding]] (horizontal scaling)
+
+In other words, try to improve your algorithm first, and once you've reached a reasonable limit of what you can achieve with that approach, allocate more resources to the problem.
+
+### Controller Optimizations
+Ensure you look at common controller [[optimization]] to get the most out of your resources:
+
+* minimize network intensive operations
+* avoid caching large manifests unnecessarily, and prune unneeded data
+* cache/memoize expensive work
+* checkpoint progress on `.status` objects to avoid repeating work
+
+When checkpointing, care should be taken to not accidentally break [[reconciler#idempotency]].
+
+### Vertical Scaling
+
+* increase CPU/memory limits
+* configure controller concurrency (as a multiple of CPU limits)
+
+The [controller::Config] currently[**](https://github.com/kube-rs/kube/issues/1473) defaults to __unlimited concurrency__ and may need tuning for large workloads.
+
+It is __possible__ to compute an optimal `concurrency` number based the CPU `resources` you assign to your container, but this would require specific measurement against your workload.
+
+!!! note "Agressiveness meets fairness"
+
+    A highly parallel reconciler might be eventually throttled by [apiserver flow-control rules](https://kubernetes.io/docs/concepts/cluster-administration/flow-control/), and this can clearly degrade your controller's performance. Measurements, calculations, and [[observability]] (particularly for error rates) are useful to identifying such scenarios.
+
+### Sharding
+
+If you are unable to meet latency/resource requirements using techniques above, you may need to consider **partitioning/sharding** your resources.
+
+Sharding is splitting your workload into mutually exclusive groups that you grant exclusive access to. In Kubernetes, shards are commonly seen as a side-effect of certain deployment strategies:
+
+* sidecars :: pods are shards
+* daemonsets :: nodes are shards
+
+!!! note "Sidecars and Daemonsets"
+
+    Several big agents use daemonsets and sidecars in situations that require higher than average performance, and is commonly found in network components, service meshes, and sometimes observability collectors that benefit from co-location with a resource. This choice creates a very broad and responsive sharding strategy, but one that incurs a larger overhead using more containers than is technically necessary.
+
+Sharding can also be done in a more explicit way:
+
+* 1 controller deployment per namespace (naive sharding)
+* 1 controller deployment per labelled shard (precice, but requires labelling work)
+
+Explicitly labelled shards is less common, but is a powerful option. It is used by [fluxcd](https://fluxcd.io/) via their [sharding.fluxcd.io/key label](https://fluxcd.io/flux/installation/configuration/sharding/) to associate a resource with a shard. Flux's Stefan talks about [scaling flux controllers at KubeCon 2024](https://www.youtube.com/watch?v=JFLNFJT59DY).
+
+!!! note "Automatic Labelling"
+
+    A mutating admission policy can help automatically assign/label partitions cluster-wide based on constraints and rebalancing needs.
+
+In cases where HA is required, a leases can be used gate access to a particular shard. See [[availability#Leader Election]]
+
+--8<-- "includes/abbreviations.md"
+--8<-- "includes/links.md"
diff --git a/includes/abbreviations.md b/includes/abbreviations.md
@@ -9,3 +9,10 @@
 *[SBOM]: Software Bill of Materials
 *[LTS]: Long Term Support
 *[CEL]: Common Expression Language
+*[HA]: High Availability
+*[PVC]: Persistent Volume Claim
+*[RPS]: Requests Per Second
+*[SLI]: Service Level Indicator
+*[SLA]: Service Level Agreement
+*[SLO]: Service Level Objective
+*[P95]: 95th Percentile
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -73,15 +73,17 @@ nav:
     - controllers/schemas.md
     Advanced:
     - controllers/testing.md
-    - controllers/manifests.md
-    - controllers/observability.md
-    - controllers/optimization.md
     - controllers/admission.md
-    - controllers/security.md
     - controllers/streams.md
     - controllers/generics.md
     - controllers/internals.md
-
+    Operational:
+    - controllers/observability.md
+    - controllers/optimization.md
+    - controllers/manifests.md
+    - controllers/security.md
+    - controllers/scaling.md
+    - controllers/availability.md
   #- comunity.md
   #- proposal-process.md
   #- mentorship.md