Skip to content

Reliability

Mike Perham edited this page Feb 13, 2023 · 72 revisions

Reliability encompasses Sidekiq's ability to withstand network problems without loss of data. There are three aspects of reliability with Sidekiq and Redis:

  1. pushing jobs to Redis, see the client reliability page.
  2. fetching jobs from Redis, see below.
  3. scheduling jobs, see below.

Setup

TL;DR To use the Reliability features in Sidekiq Pro, add this to your initializer:

Sidekiq::Client.reliable_push! unless Rails.env.test?

Sidekiq.configure_server do |config|
  config.super_fetch!
  config.reliable_scheduler!
end

Read on for more detail. This screencast gives a quick overview:

Reliability

Using super_fetch

Sidekiq uses BRPOP to fetch a job from the queue in Redis. This is very efficient and simple but it has one drawback: the job is now removed from Redis. If Sidekiq crashes while processing that job, it is lost forever. This is not a problem for many but some businesses need absolute reliability when processing jobs.

Sidekiq does its best to never lose jobs but it can't guarantee it; the only way to guarantee job durability is to not remove it from Redis until it is complete. For instance, if Sidekiq is restarted mid-job, it will try to push the unfinished jobs back to Redis but networking issues can prevent this.

Sidekiq Pro offers an alternative fetch strategy, super_fetch, for job processing using Redis' RPOPLPUSH command which keeps jobs in Redis. To enable super_fetch:

Sidekiq.configure_server do |config|
  # This needs to be within the configure_server block
  config.super_fetch!
end

When Sidekiq starts, you should see SuperFetch activated:

INFO: Sidekiq Pro 3.5.0, commercially licensed.  Thanks for your support!
INFO: Booting Sidekiq 5.0.0 with redis options {:url=>nil}
INFO: Starting processing, hit Ctrl-C to stop
INFO: SuperFetch activated

Recovering Jobs

When a Sidekiq process dies without warning (e.g. kill -9 or a Ruby VM crash), its jobs in progress become orphans. On process startup, super_fetch will look for orphaned jobs:

  1. if the process's heartbeat has expired (it takes 60 seconds to expire); AND
  2. if an hour has passed since the last orphan check

The orphan check requires a complete SCAN of the Redis database; it can take a substantial amount of time (i.e. over a few seconds) if your Redis database has a lot of keys. As always, I recommend using a separate Redis database or instance for cache data vs job data. The hour buffer prevents Sidekiq from slamming Redis with constant SCANs and ensures that you don't have a continual cycle of process death due to recovered jobs which are poison pills.

In summary, super_fetch might recover jobs in 5 minutes or 3 hours, there's no guarantee. Restarting a process is the best way to signal Sidekiq Pro to look for orphans.

Notification

As of v5.2, Sidekiq Pro will fire a callback when super_fetch rescues an orphaned job.

config.super_fetch! do |jobstr, pill|
  # jobstr is a raw String of JSON. Sidekiq does not parse the JSON string
  # as this could have been the cause of the crash!
  puts "Uh oh, Sidekiq Pro just recovered this job! #{jobstr}"
end

Poison Pills

A job which triggers a process crash is known as a "poison pill". When super_fetch recovers an orphaned job, it notes this recovery. If the same job is recovered three times in 72 hours, it will be classified as a poison pill and automatically killed (i.e. placed in the Dead set). You will need to fix the crash and manually rerun the job. Sidekiq Pro will provide information about the pill in the callback. If the job has not yet been classified as a poison pill, pill will be nil.

config.super_fetch! do |jobstr, pill|
  puts "Killed poison pill: #{pill.jid} #{pill.klass}" if pill
end

Fetch algorithms

super_fetch supports the same two queue prioritization mechanisms as Sidekiq's basic fetch: strict priority and weighted random.

Strict ordering

sidekiq -e production -q critical -q default -q bulk

Beware that strict ordering can lead to starvation: bulk jobs will only be processed once the critical and default queues are empty. You can switch ordering for different processes to ensure everyone gets processed:

sidekiq -e production -q critical -q default -q bulk
sidekiq -e production -q bulk -q default -q critical

Weighted random

sidekiq -e production -q critical,3 -q default,2 -q bulk,1

When using weighted ordering, sidekiq will randomly choose a queue to check, without blocking, using weighted random choice. For example, in the command given above, sidekiq will sample from the array ["critical", "critical", "critical", "default", "default", "bulk"] so critical will be checked first 50% of the time.

Limitations

Because of Redis limits, super_fetch has to poll the queues in Redis for jobs, rather than blocking. If you have M queues being processed by N processes, you will get M * N rpoplpush Redis calls per second which can lead to a lot of Redis traffic and CPU burn.

The solution is to reduce the number of queues or specialize your Sidekiq processes: have each process only handle 3-4 queues.

Metrics

super_fetch increments the jobs.poison and jobs.recovered.fetch Statsd metrics when it kills a poison pill job and recovers an orphaned job. When killing a poison pill, super_fetch logs this:

warn("Killed poison pill #{klass} #{jid}")

Scheduler

Sidekiq's default scheduler is not atomic, it pops jobs off the scheduled queue and enqueues them with two network round trips. Sidekiq Pro offers a reliable scheduler which uses Lua to perform the same task atomically:

Sidekiq.configure_server do |config|
  config.reliable_scheduler!
end

This feature is optional but highly recommended to enable. It does have the drawback that client-side middleware is not invoked when enqueuing the scheduled jobs, since the entire operation takes place within Redis. It is not safe to enable if you are running Redis Cluster. More detail

Notes

  • super_fetch is more sensitive to Redis network latency than Sidekiq's default basic_fetch, especially if you have lots of queues and high concurrency. This can result in idle processor threads, starved for jobs. Check out Using Redis for tips on measuring Redis latency.
  • Older versions of Sidekiq Pro offered reliable_fetch and timed_fetch. These algorithms are now deprecated and no longer documented.