Skip to content

Kubernetes

Mike Perham edited this page Oct 31, 2023 · 18 revisions

From a network architecture perspective, Sidekiq runs as a cluster of worker processes "around" a Redis server. This page provides information on how to operate a Sidekiq cluster using Kubernetes.

Basic Setup

Running and connecting to Redis...

Safe Shutdown

How to configure k8s to get a clean shutdown with TSTP and TERM...

During the termination lifecycle, k8s will send a SIGTERM to the root process on each pod:

In practice, this means your application needs to handle the SIGTERM message and begin shutting down when it receives it. This means saving all data that needs to be saved, closing down network connections, finishing any work that is left, and other similar tasks.

Sidekiq uses the TERM signal to start shutting down within the timeout/-t seconds.

k8s also has its own concept of a grace period, indicated by the terminationGracePeriodSeconds in the pod's configuration. k8s will wait this amount of time before sending a SIGKILL to the container, forcefully terminating all processes.

It's recommend to have the Sidekiq timeout set to a value less than the k8s terminationGracePeriodSeconds. If using the default Sidekiq timeout of 25 seconds, setting terminationGracePeriodSeconds in k8s to 30 seconds is recommended (the k8s default).

ATTENTION

Do not run sidekiq with a shell command like this: command: ["/bin/sh","-c", "bundle exec sidekiq -t 100 -e production --config config/sidekiq_custom.yml"]

If you use this approach, the TERM signal gets delivered only to the shell process, not to sidekiq directly (and therefore sidekiq doesn't terminate reliably - but only via the SIGKILL command which is triggered after terminationGracePeriodSeconds (default 30 seconds)). If you have a longer terminationGracePeriodSeconds sidekiq will continue to run until it get's SIGKILL-ed!

Better use following approach:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: sidekiq-deployment
spec:
  ...
  template:
    ...
    spec:
      terminationGracePeriodSeconds: 120 # wait a bit longer to shut it down
      containers:
        - name: sidekiq
          ...
          command: ["bundle", "exec", "sidekiq"]
          args: ["-t", "100", "-e", "production", "-C", "config/custom_sidekiq_config.yml"]
      ...

With this setup, you also don't need the often mentioned preStop-hook:

# NOT NEEDED!
lifecycle:
  preStop:
    exec:
      # SIGTERM triggers a quick exit; gracefully terminate with TSTP instead
      command: 
        - /bin/sh
        - -c
        - echo 'kill -TERM $(ps aux | grep sidekiq | grep busy | awk '{ print $2 }')

Health Checks

Each time k8s brings up a new pod, it needs to assert the pod is "healthy" before considering the action complete. "Healthy" for something serving web requests might mean that that pod responds with 200 OK to a health check endpoint.

Sidekiq Enterprise

Sidekiq Enterprise 7.1.2 added support for a Kubernetes health check HTTP endpoint. Enable it like this:

Sidekiq.configure_server do |config|
  config.health_check(7433) # port Integer
  config.health_check("127.0.0.1:7433") # interface String
end

Or in config/sidekiq.yml:

---
health_check: "127.0.0.1:7433"

And then you can do this:

$ curl -v http://localhost:7433/
HTTP/1.0 200 OK
Server: Sidekiq::HttpServer (Ruby 3.2.0)
Content-type: application/json
Content-length: 366
Connection: close
Date: Mon, 11 Sep 2023 17:41:46 GMT

{"quiet":"false","info":{"hostname":"Mikes-MacBook-Pro.local","started_at":1694453901.020213,"pid":36477,"tag":"myapp","concurrency":5,"queues":["default"],"weights":[{"default":0}],"labels":["reliable"],"identity":"Mikes-MacBook-Pro.local:36477:8520a2e76d09","version":"7.1.3","embedded":false},"rss":"127808","beat":"1694453921.079236","busy":"0","rtt_us":"259"}
$

The health check service does not support TLS; it is meant to be used within a private network only and not exposed publicly.

Sidekiq

For Sidekiq processes, we need a different approach. Instead of using a web request, you can use a combination of Sidekiq lifecycle hooks and a file-based readinessProbe.

Keep in mind that Sidekiq will begin processing jobs from its Redis instance as soon as it is able to. It will not wait for a signal from k8s or another system to enable it to start processing jobs. Contrast this with k8s pods serving web traffic, where k8s might not route requests to them until a readiness probe has passed.

By writing to a file after the Sidekiq process starts, we're able to signal to k8s "Hey, we're healthy." Here's how that looks:

# config/initializers/sidekiq.rb
SIDEKIQ_WILL_PROCESSES_JOBS_FILE = Rails.root.join('tmp/sidekiq_process_has_started_and_will_begin_processing_jobs').freeze

Sidekiq.configure_server do |config|
  # We touch and destroy files in the Sidekiq lifecycle to provide a
  # signal to Kubernetes that we are ready to process jobs or not.
  #
  # Doing this gives us a better sense of when a process is actually
  # alive and healthy, rather than just beginning the boot process.
  config.on(:startup) do
    FileUtils.touch(SIDEKIQ_WILL_PROCESSES_JOBS_FILE)
  end

  config.on(:shutdown) do
    FileUtils.rm_f(SIDEKIQ_WILL_PROCESSES_JOBS_FILE)
  end
end

And then later in your pod's YAML, assuming your Rails application is deployed to /var/www within the pod and takes about 10 seconds to start:

readinessProbe:
  failureThreshold: 10
  exec:
    command:
    - cat
    - /var/www/tmp/sidekiq_process_has_started_and_will_begin_processing_jobs
  initialDelaySeconds: 10
  periodSeconds: 2
  successThreshold: 2
  timeoutSeconds: 1

Autoscaling

One of the primitives k8s provides is the Horizontal Pod Autoscaler (HPA).

This is a mechanism to add capacity to a set of pods "horizontally" a.k.a by adding more machines. Things can scale up and down based on "observed CPU utilization (or, with custom metrics support, on some other application-provided metrics)".

Sidekiq's one-job-per-thread model along with Ruby's Global Virtual Machine Lock (GVL) is an assumption that Sidekiq workloads are not CPU-bound, so an observed CPU utilization metric is inappropriate for autoscaling.

Autoscaling adds a new dimension of complexity to an application, and should not be deployed lightly. Sidekiq will often be easier and more predictable to operate without autoscaling. Because Sidekiq processes jobs off of a queue, it's usually better to keep a fixed capacity around to ensure consistent progress through the queue.

If you do decide to implement autoscaling, there are several things to keep in mind:

  • Implementing a HorizonalPodAutoscaler is likely best done on an individual queue based on that queue's latency. When the latency begins to spike, you may choose to scale up a given queue's k8s deployment. This assumes that each queue has latency thresholds of "okay" and "we need to scale up or we will miss our latency target."
  • Scaling up or down within k8s is not an instant operation. There will be latency when scaling up from k8s and the Sidekiq process starting. Sometimes, a scale up operation will finish but the queue has been empty for a few seconds (or minutes).
  • Sidekiq at high levels of concurrency can be dangerous to other systems, like your database or third parties. In the event of a scale-up event, you may start processing jobs faster at the expense of the health of your database. Be careful with write-heavy workloads and more than ~10 k8s pods consuming from a queue.

Your HPA will need to use an "external" metric. The configuration of these metrics will vary from setup to setup, but you'll likely need something reporting the latency of each queue to a system that k8s can read from. A Sidekiq Enterprise Leader with a small snippet of code is a good place to publish latency metrics from.

Cloud Providers

Notes about using AWS, GCP, DO and other KaaS providers...?