Replies: 1 comment 2 replies
-
Wow, that's great stuff. 👍🏼 |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I feel like the wiki is lacking information on how to scale Sidekiq effectively and problems that can be encountered along the way. I've drafted a piece that I think might be a good fit, and I'm sharing it here for feedback before adding it to the wiki. It's largely based on the scaling issues we've seen customers face at Judoscale.
Let me know what you think!
(Note: There are a few placeholders for diagrams I haven't made yet.)
Scaling Sidekiq
Sidekiq’s architecture makes it easy to scale up to thousands of jobs per second and millions of jobs per day. Scaling Sidekiq can simply be a matter of “adding more servers”, but how do you optimize each server, how “big” do the servers need to be, and how do you know when to add more? Those are the questions this guide will answer.
Concepts and terms
Let’s start with an overview of Sidekiq’s architecture and the various “levers” we have available to us. We’ll also define some terms we’ll use throughout this guide.
Here’s a diagram that shows the relationship between these concepts:
[Relationship between concurrency, containers, and process swarms]
Sidekiq is of course all about queues, so let’s clarify some terms here.
[Relationship between Sidekiq queue assignments and priority]
And finally we have our connection pools. Yes, multiple connection pools.
database.yml
.[Diagram of connection pools used by Sidekiq]
In total this is a lot of concepts and configurations. The good news is most of them are handled for us or are straightforward to configure ourselves.
A Sidekiq starting point
These are some general recommendations that will help things run smoothly in the beginning of an app and prepare you to scale later.
The fewer queues the better. Don’t make your life harder than it needs to be. Two or three queues are plenty for a new app. We’ll talk later about when it makes sense to add more queues, but scaling will generally be more challenging the more queues you have.
Name your queues based on priority or urgency. Some teams name their queues using domain specific terms that are no help at all when it comes to planning queue priority or latency requirements. “Urgent”, “default”, and “low” are much easier to work with. You might take a step further and embrace Gusto’s approach of latency-based queue names such as “within_30_seconds”, “within_5_minutes”, etc. This approach makes it very clear which queues have priority and when queue latency is unacceptable.
Keep your jobs as small as possible! Embrace the fan-out approach [ref?]. [say more here] Smaller jobs are much easier to scale, but we’ll talk later about strategies to use when this isn’t possible.
Run a single Sidekiq process per container. You can add Sidekiq Swarm later, but don’t assume you’ll need it. This is one less variable to juggle when scaling. Keep it simple.
Choose a container size based on memory. If you’re working with a lot of large files, such as generating PDF’s or importing large CSV files, you’ll need more memory. If you’re not doing that, you can probably get away with 1GB or less.
Start with five threads per process (concurrency). This is just a starting point—you will need to tweak it. Many teams get too ambitious with their concurrency, saturating their CPU and slowing down all jobs. The good news is five is the Sidekiq default, so if you don’t do anything, you’ll have a good starting point.
These guidelines will get you started, but what about optimizing your configuration and scaling beyond the basics? That’s what we’ll tackle in the following sections.
Find your concurrency sweet spot
Depending on your container CPU and the type of work your jobs are doing (mainly the percentage of time spent in I/O), you’ll probably need to tweak your concurrency setting. As a very simple rule, you want to CPU usage to be high but not 100% when all threads are in use.
If CPU is hitting 100%, you need to reduce your concurrency. If your CPU usage never goes above 50% as max throughput, you probably want to increases your concurrency.
Use
RAILS_MAX_THREADS
to tweak concurrency. When you decide to tweak your concurrency, you could configure it with the-c
CLI flag, but Sidekiq will also respect theRAILS_MAX_THREADS
environment variable. This is what Rails uses by default to configure your database pool indatabase.yml
, so by embracing this convention, your database pool will always be correctly sized for your Sidekiq process.Autoscale your Sidekiq containers
Don’t waste your energy calculating how many containers you need to run. Sidekiq loads are highly variable by nature, and you don’t want to pay for a cluster of 10 containers when no jobs are enqueued. Autoscaling solves this problem by automatically scaling your containers up and down, but what metric should you use for autoscaling?
Sidekiq workloads are more often I/O-bound than CPU-bound [TODO: back this up somehow], so CPU is an inappropriate (and frustrating) metric to use for autoscaling. There’s always an implicit (or hopefully explicit) expectation that jobs are picked up within a certain amount of time, which makes queue latency the perfect metric for autoscaling. (And if you’re using latency-based queue names, you’ve already identified those latency expectations!)
Several services exist for autoscaling Sidekiq based on queue latency:
(*) You’ll need to measure queue latency yourself and report it to CloudWatch or HPA.
Assign queues to dedicated processes
Sometimes it makes sense to add a queue for a specific job or a particular “shape” of job. Some examples:
These aren’t ideal scenarios, but they’re real-world scenarios that many apps will encounter. It’s best to treat these queues as the anomalies they are and dedicate them to their own Sidekiq process. This way your long-running jobs will only block other long-running jobs, and your memory-hungry jobs won’t require all of your jobs to run on larger, higher-priced containers.
This isolation makes scaling easier because you’re scaling your “special” queues separately from your “normal” queues. Here’s what it might look like in a
Procfile
, usingRAILS_MAX_THREADS
to force the memory-hungry jobs to be processed single-threaded (reducing memory bloat):Scaling problems & solutions
The best way to make scaling easy is by keeping it simple: a few queues with small jobs. But of course keeping it simple isn’t always easy, especially in a legacy codebase or a large team. Here are some of the problems or anti-patterns you’ll generally want to avoid:
ActiveRecord::ConnectionTimeoutError
in your Sidekiq jobs, chances are you’ve misconfigured your database connection pool. Make sure yourdatabase.yml
is usingRAILS_MAX_THREADS
as the pool size, and useRAILS_MAX_THREADS
instead of-c
to configure your concurrency.Scaling Redis
The short answer is here is that Redis is almost never the problem when scaling Sidekiq. But for very high-scale apps, you might hit the limits of what’s possible with a single Redis server. The sharding wiki article walks you through some options here, and now Dragonfly might be an even better option.
Just remember that most apps don’t need this! Make sure you’ve worked through the earlier suggestions and confirmed that Redis is your bottleneck before proceeding down these paths.
Beta Was this translation helpful? Give feedback.
All reactions