Skip to content

Error Handling

Mike Perham edited this page Apr 16, 2024 · 103 revisions

I hate to say it but some of your jobs will raise exceptions when executing. It's true.

Sidekiq has a number of features to handle errors of all types.

Error Handling

Best Practices

  1. Use an error service - Honeybadger, Airbrake, Rollbar, BugSnag, Sentry, Exceptiontrap, Raygun, etc. They're all similar in feature sets and pricing but pick one and use it. The error service will send you an email every time there is an exception in a job (Smarter ones like Honeybadger will send email on the 1st, 3rd and 10th identical error so your inbox won't be overwhelmed if 1000s of jobs are failing).
  2. Let Sidekiq catch errors raised by your jobs. Sidekiq's built-in retry mechanism will catch those exceptions and retry the jobs regularly. The error service will notify you of the exception. You fix the bug, deploy the fix and Sidekiq will retry your job successfully.
  3. If you don't fix the bug within 25 retries (about 21 days), Sidekiq will stop retrying and move your job to the Dead set. You can fix the bug and retry the job manually anytime within the next 6 months using the Web UI.
  4. After 6 months, Sidekiq will discard the job.

Error Handlers

Gems can attach to Sidekiq's global error handlers so they will be informed any time there is an error inside Sidekiq. Error services should all provide integration automatically by including their gem within your application's Gemfile.

You can create your own error handler by providing something which responds to call(exception, context_hash, config):

Sidekiq.configure_server do |config|
  config.error_handlers << proc {|ex,ctx_hash,config| MyErrorService.notify(ex, ctx_hash) }
end

ex is the actual Exception raised. context_hash is an optional hash with the job payload and any additional context for the error. config gives you access to Sidekiq's configuration.

Note that error handlers are only relevant to the Sidekiq server process. They aren't active in Rails console, for instance.

Backtrace Logging

Enabling backtrace logging for a job will cause the backtrace to be persisted throughout the lifetime of the job. Beware: backtraces can take 1-4k of memory in Redis each so large amounts of failing jobs can significantly increase your Redis memory usage.

sidekiq_options backtrace: true

You should use caution when enabling backtrace by limiting it to a couple of lines, or use an error service to keep track of failures and associated backtraces.

sidekiq_options backtrace: 20 # top 20 lines

Automatic job retry

Sidekiq will retry failures with an exponential backoff using the formula (retry_count ** 4) + 15 + (rand(10) * (retry_count + 1)) (i.e. 15, 16, 31, 96, 271, ... seconds + a random amount of time). It will perform 25 retries over approximately 20 days. Assuming you deploy a bug fix within that time, the job will get retried and successfully processed. After 25 times, Sidekiq will move that job to the Dead Job queue, assuming that it will need manual intervention to work.

The maximum number of retries can be globally configured by adding the following to your sidekiq.yml:

:max_retries: 1
This table contains approximate retry waiting times (click to expand).
 # | Next retry backoff | Total waiting time
 -------------------------------------------
 1 |       0d 0h 0m 20s |       0d 0h 0m 20s
 2 |       0d 0h 0m 26s |       0d 0h 0m 46s
 3 |       0d 0h 0m 46s |       0d 0h 1m 32s
 4 |       0d 0h 1m 56s |       0d 0h 3m 28s
 5 |       0d 0h 4m 56s |       0d 0h 8m 24s
 6 |      0d 0h 11m 10s |      0d 0h 19m 34s
 7 |      0d 0h 22m 26s |       0d 0h 42m 0s
 8 |      0d 0h 40m 56s |      0d 1h 22m 56s
 9 |       0d 1h 9m 16s |      0d 2h 32m 12s
10 |      0d 1h 50m 26s |      0d 4h 22m 38s
11 |      0d 2h 47m 50s |      0d 7h 10m 28s
12 |       0d 4h 5m 16s |     0d 11h 15m 44s
13 |      0d 5h 46m 56s |      0d 17h 2m 40s
14 |      0d 7h 57m 26s |        1d 1h 0m 6s
15 |     0d 10h 41m 46s |     1d 11h 41m 52s
16 |      0d 14h 5m 20s |      2d 1h 47m 12s
17 |     0d 18h 13m 56s |       2d 20h 1m 8s
18 |     0d 23h 13m 46s |     3d 19h 14m 54s
19 |      1d 5h 11m 26s |      5d 0h 26m 20s
20 |     1d 12h 13m 56s |     6d 12h 40m 16s
21 |     1d 20h 28m 40s |       8d 9h 8m 56s
22 |       2d 6h 3m 26s |    10d 15h 12m 22s
23 |      2d 17h 6m 26s |     13d 8h 18m 48s
24 |      3d 5h 46m 16s |      16d 14h 5m 4s
25 |     3d 20h 11m 56s |     20d 10h 17m 0s
Hint: This table was calculated under the assumption that `rand(10)` always returns 5. See `Sidekiq::JobRetry#delay_for` for the current formula.

Web UI

The Sidekiq Web UI has a "Retries" and "Dead" tab which lists failed jobs and allows you to run them, inspect them or delete them.

Dead set

The Dead set is a holding pen for jobs which have failed all their retries. Sidekiq will not retry those jobs, you must manually retry them via the UI. The Dead set is limited by default to 10,000 jobs or 6 months so it doesn't grow infinitely. Only jobs configured with 0 or greater retries will go to the Dead set. Use retry: false if you want a particular type of job to be executed only once, no matter what happens.

Configuration

You can specify the number of retries for a particular worker if 25 is too many:

class LessRetryableJob
  include Sidekiq::Job
  sidekiq_options retry: 5 # Only five retries and then to the Dead Job Queue

  def perform(...)
  end
end

Configure job retries to use a lower priority queue so new jobs take precedence:

class LowPriorityRetryJob
  include Sidekiq::Job
  sidekiq_options queue: 'default', retry_queue: 'bulk' # send retries to the 'bulk' queue

  def perform(...)
  end
end

You can disable retry support for a particular worker.

class NonRetryableJob
  include Sidekiq::Job
  sidekiq_options retry: false # job will be discarded if it fails

  def perform(...)
  end
end

Skip retries, send a failed job straight to the Dead set:

class NonRetryableJob
  include Sidekiq::Job
  sidekiq_options retry: 0

  def perform(...)
  end
end

As of Sidekiq 7.1.3, you can retry for a period of time:

class NonRetryableJob
  include Sidekiq::Job
  sidekiq_options retry_for: 48.hours

  def perform(...)
  end
end

You can disable a job going to the Dead set:

class NoDeathJob
  include Sidekiq::Job
  sidekiq_options retry: 5, dead: false # will retry 5 times and then disappear

  def perform(...)
  end
end

The retry delay can be dynamically calculated by defining a sidekiq_retry_in method in your job class. Support for :kill and :discard was added in v6.5.2. Support for the third block parameter, jobhash, was added in v7.0.8.

class JobWithCustomRetry
  include Sidekiq::Job
  sidekiq_options retry: 5

  # The current retry count, exception and job hash is yielded. The return value of the
  # block can be an integer to be used as the the delay in seconds, :kill to
  # send the job to the DeadSet, or :discard  to throw away the job. A
  # return value of nil will use the default delay. 
  sidekiq_retry_in do |count, exception, jobhash|
    case exception
    when SpecialException
      10 * (count + 1) # (i.e. 10, 20, 30, 40, 50)
    when ExceptionToKillFor
      :kill
    when ExceptionToForgetAbout
      :discard
    end
  end

  def perform(...)
  end
end

After retrying so many times, Sidekiq will call the sidekiq_retries_exhausted hook on your Job if you've defined it. The hook receives the queued job hash as an argument and is called right before Sidekiq moves the job to the Dead set.

class FailingJob
  include Sidekiq::Job

  sidekiq_retries_exhausted do |job, ex|
    Sidekiq.logger.warn "Failed #{job['class']} with #{job['args']}: #{job['error_message']}"
  end
  
  def perform(*args)
    raise "or I don't work"
  end
end

Death Notification

The sidekiq_retries_exhausted callback is specific to a Job class. Starting in v5.1, Sidekiq can also fire a global callback when a job dies:

# this goes in your initializer
Sidekiq.configure_server do |config|
  config.death_handlers << ->(job, ex) do
    puts "Uh oh, #{job['class']} #{job["jid"]} just died with error #{ex.message}."
  end
end

With this callback, you can email yourself, send a Slack message, etc so you know there is something wrong.

Process Crashes

Sidekiq uses the exact same Redis logic as Resque for fetching jobs. This has a serious consequence: If the Sidekiq process segfaults or crashes the Ruby VM, any jobs that were executing will be lost. If the Sidekiq process is killed due to CPU or memory limits, any jobs that were executing will be lost. Sidekiq Pro offers a reliable queueing feature which does not lose those jobs.

No More Bike Shedding

Sidekiq's retry mechanism is a set of best practices but many people have suggested various knobs and options to tweak in order to handle their own edge case. This way lies madness. Design your code to work well with Sidekiq's retry mechanism as it exists today or patch the JobRetry class to add your own logic. I'm no longer accepting any functional changes to the retry mechanism unless you make an extremely compelling case for why Sidekiq's thousands of users would want that change.

Previous: Using Redis Next: Advanced Options