Sidekiq Pro Request: Set SuperFetch recovery destination #4594

buddhistpirate · 2020-06-11T17:23:13Z

Ruby version: 2.6.6
Sidekiq / Pro / Enterprise version(s): 5.2.7/3.7.1/1.7.2

Sometimes (up to a few times a week), we have jobs that balloon the memory of the worker and eventually get killed by the system with an OOM. We have spent much time tracking down these jobs and either A) fixing them to not use so much memory or B) having them go on a different queue to be processed by a worker with sufficient memory.

Inevitably some other job runs into the same behavior and we do the cycle again. It has become a fact of life for us. One of the biggest pains is removing a "poison pill" job. #4069 Currently we scale down the worker, wait for it to show up in the queue again, and then remove it.

A proposal: Be able to set the queue that jobs go into when they are recovered from SuperFetch.

Since these jobs get killed via OOM, they are recovered via SuperFetch and end up back in the regular queue, only to be picked up again and cause the issue. If we could send all super-fetch recovered jobs to a specific queue it would open up a lot of options for us:

We could automate what currently burns a lot of our DevOps time by simply having the worker that listens on the new queue have a lot of memory. We could keep this new worker scaled down if we wanted to, so we could investigate the issue before executing again. We can do forensics over all the jobs in this new queue to more easily determine which jobs caused the OOM. Many options.

Due to the way that SuperFetch is currently written, there is not a good extension point to implement this logic ourselves (via prepend or extend). Also, since Sidekiq-Pro is "closed", we can't maintain our own fork (I think?).

I think the easiest would be a sidekiq config like "superfetch_recovery_queue" that would simply override the normal destination if set.

If there was a method that extracted the destination queue from the private queue name (in check_for_orphans and cleanup_the_dead for example), then we could override that method to always go to a specific queue, or even implement per-source-queue logic if we really wanted to.

Thoughts?

The text was updated successfully, but these errors were encountered:

mperham · 2020-06-11T21:46:47Z

I’m torn on what to do here because a very common use case is Kubernetes deploys killing Sidekiq processes without warning thus leading to orphaned jobs, and if you rescue those jobs they should execute just fine. If we were to push them into an orphan queue or push them on the dead set, those jobs could require manual intervention for no reason. What kind of memory limits are you imposing on Sidekiq?

…

On Jun 11, 2020, at 10:23, Tim Johnson ***@***.***> wrote: Ruby version: 2.6.6 Sidekiq / Pro / Enterprise version(s): 5.2.7/3.7.1/1.7.2 Sometimes (up to a few times a week), we have jobs that balloon the memory of the worker and eventually get killed by the system with an OOM. We have spent much time tracking down these jobs and either A) fixing them to not use so much memory or B) having them go on a different queue to be processed by a worker with sufficient memory. Inevitably some other job runs into the same behavior and we do the cycle again. It has become a fact of life for us. One of the biggest pains is removing a "poison pill" job. #4069 Currently we scale down the worker, wait for it to show up in the queue again, and then remove it. A proposal: Be able to set the queue that jobs go into when they are recovered from SuperFetch. Since these jobs get killed via OOM, they are recovered via SuperFetch and end up back in the regular queue, only to be picked up again and cause the issue. If we could send all super-fetch recovered jobs to a specific queue it would open up a lot of options for us: We could automate what currently burns a lot of our DevOps time by simply having the worker that listens on the new queue have a lot of memory. We could keep this new worker scaled down if we wanted to, so we could investigate the issue before executing again. We can do forensics over all the jobs in this new queue to more easily determine which jobs caused the OOM. Many options. Due to the way that SuperFetch is currently written, there is not a good extension point to implement this logic ourselves (via prepend or extend). Also, since Sidekiq-Pro is "closed", we can't maintain our own fork (I think?). I think the easiest would be a sidekiq config like "superfetch_recovery_queue" that would simply override the normal destination if set. If there was a method that extracted the destination queue from the private queue name (in check_for_orphans and cleanup_the_dead for example), then we could override that method to always go to a specific queue, or even implement per-source-queue logic if we really wanted to. Thoughts? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

mperham · 2020-06-11T21:47:45Z

I’m torn on what to do here because a very common use case is Kubernetes deploys killing Sidekiq processes without warning thus leading to orphaned jobs, and if you rescue those jobs they should execute just fine. If we were to push them into an orphan queue or push them on the dead set, those jobs could require manual intervention for no reason. What kind of memory limits are you imposing on Sidekiq?

…

> On Jun 11, 2020, at 10:23, Tim Johnson ***@***.***> wrote: Ruby version: 2.6.6 Sidekiq / Pro / Enterprise version(s): 5.2.7/3.7.1/1.7.2 Sometimes (up to a few times a week), we have jobs that balloon the memory of the worker and eventually get killed by the system with an OOM. We have spent much time tracking down these jobs and either A) fixing them to not use so much memory or B) having them go on a different queue to be processed by a worker with sufficient memory. Inevitably some other job runs into the same behavior and we do the cycle again. It has become a fact of life for us. One of the biggest pains is removing a "poison pill" job. #4069 Currently we scale down the worker, wait for it to show up in the queue again, and then remove it. A proposal: Be able to set the queue that jobs go into when they are recovered from SuperFetch. Since these jobs get killed via OOM, they are recovered via SuperFetch and end up back in the regular queue, only to be picked up again and cause the issue. If we could send all super-fetch recovered jobs to a specific queue it would open up a lot of options for us: We could automate what currently burns a lot of our DevOps time by simply having the worker that listens on the new queue have a lot of memory. We could keep this new worker scaled down if we wanted to, so we could investigate the issue before executing again. We can do forensics over all the jobs in this new queue to more easily determine which jobs caused the OOM. Many options. Due to the way that SuperFetch is currently written, there is not a good extension point to implement this logic ourselves (via prepend or extend). Also, since Sidekiq-Pro is "closed", we can't maintain our own fork (I think?). I think the easiest would be a sidekiq config like "superfetch_recovery_queue" that would simply override the normal destination if set. If there was a method that extracted the destination queue from the private queue name (in check_for_orphans and cleanup_the_dead for example), then we could override that method to always go to a specific queue, or even implement per-source-queue logic if we really wanted to. Thoughts? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

buddhistpirate · 2020-06-11T23:29:31Z

I agree with your concern, however I think this solution, or similar ones, would let us decide what to do with superfetch recovered jobs. Our mostly likely approach would be to have one of our larger-memory workers listen to the queue, so no intervention would be required. We could then alert on entries in that queue and investigate why jobs are ending up there. It leaves a lot of options open while reducing a lot of interruption and pain.

Generally our memory limits are 1gb-3b. We have a few workers with limits like 6gb, 8gb and 14gb for dealing with jobs that need it.

In the past month we've run about 390M sidekiq jobs. According to our logs, 271 jobs have been recovered by SuperFetch in that time.

Our 99th percentile duration for our sidekiq jobs is 2.91 sec, so if kubernetes gives us the SIGQUIT before the SIGKILL, then it seems we have enough time to finish any active jobs and exit gracefully.

Overall, we're looking for an approach to handle OOMing/Poison Pill jobs that doesn't take so much manual intervention. My initial hope was that I could find a good extension point in the code, so I wouldn't have to bother you with a feature request. Most other functionality we have wanted we've been able to achieve through middleware.

We're also experimenting with creating a server middleware which will check some filters (class, args, jid, other metadata we attach) before starting a job. This way if we have identified a poison pill job (which is difficult), we can stop its execution. This will add overhead and redis calls to every sidekiq job, but it will at least save us the pain of having to scale the worker down and wait for the job to show up in the queue. Scaling down takes time, but also interrupts our users, as there are queues whose latency they are sensitive to.

buddhistpirate · 2020-06-15T15:02:17Z

In thinking through uses of the "filtering" middleware I mentioned, another approach to this problem presented itself. If we can identify which jobs were recovered by SuperFetch, we could use our "filtering" middleware to take our chosen action on them.

Since SuperFetch uses rpoplpush, there is not an opportunity to modify the job to add metadata like superfetched: true. However, rpoplpush does return the item that it is moving. So if we were able to register a "logger" to receive the output of the rpoplpush calls, then we could:

More accurately log the jobs that are being recovered by superfetch
Put them in a list for viewing in our admin console
Either automatically or manually add filter rules to move them to a different queue

Thoughts?

mperham · 2020-06-16T15:55:58Z

I think logging them in a more useful form is good first step. Today Sidekiq Pro does this:

Sidekiq.logger.warn { "SuperFetch recovered job: #{job}" }

Note that job is the String returned from rpoplpush, Sidekiq does not do any processing on it today because that processing could have been what triggered a process crash, e.g. JSON.parse(job). Should we take the risk to parse it so we can log something like #{job.class} #{job.jid}?

buddhistpirate · 2020-06-16T19:58:23Z

The basic logging in 5.0.1 is a good step; another reason for us to do the upgrade. I don't know that the additional parsing would give us much more, since we can probably parse the logged string downstream.

However, it still falls one step short of us being able to automate the problem in some way. The logging in 5.0.1 lets us set up alerts through our log provider which is helpful. If we could supply a block, or patch a method to get access to that job string, we could automate the problem in various ways, which would cut down on our toil big time. Other examples of things we could do: we could send a metric to our metrics system instead of our logger for more real time alerts, or post a message to slack.

Just yesterday for instance, we had a job that would balloon because too much data was being pulled out of the database. This particular job is around syncing so when it wasn't successfully synced to the downstream, it was re-enqueued again. This persisted until we were able to modify the underlying data causing the issue. With the additional logging in 5.0.1 it would have made it easier to identify the JID and database record causing the issue, however we would still need to take the manual step of adding it to some kind of filter/deny list, possibly each time it got enqueued.

mperham · 2020-06-16T20:09:02Z

Sidekiq Pro already fires a Statsd metric whenever it recovers a job via super_fetch:

https://github.com/mperham/sidekiq/wiki/Pro-Metrics#commercial-metrics

I just modified it to include the queue name as a tag so you can slice the metrics that way.

mperham · 2020-06-16T20:10:09Z

I suspect a lot of this issue boils down to "please upgrade" to pick up the polish I've added over the last 1-2 years.

buddhistpirate · 2020-06-16T20:29:14Z

Thanks, I think having the queue tag on those metrics will help us alert and track patterns of what queues cause the issue.

However, I still don't see a way to automate dealing with these poison pills. Being able to redirect the jobs to a different queue, or intercept the job information in code would unlock automating for this, for us.

mperham · 2020-06-16T23:40:46Z

What about something like this?

config.super_fetch! do |jobstr|
  job = JSON.parse(jobstr)
  Slack.message("Yo, #{job["class"]} #{job["jid"]} just rescued")
end

buddhistpirate · 2020-06-17T13:27:02Z

I think that would be perfect.

mperham · 2020-06-17T23:13:50Z

Ok, this will be part of #4602.

lfv89 · 2021-06-21T18:49:07Z

hey @mperham,

When I try to pass a block to super_fetch!, even though the SuperFetch activated log is shown, my job is not recovered. If I don't pass the block everything works fine. In the example below, not only hey is never shown, but also the job is never recovered... Any ideas?

config.super_fetch! do |_jobstr|
  puts('>>> hey')
end

mperham · 2021-06-21T19:00:55Z

@lfv89 please open your own issue so we can debug the problem.

mperham closed this as completed Jun 17, 2020

utay mentioned this issue Mar 14, 2023

Route rescued jobs to a specific queue #5831

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sidekiq Pro Request: Set SuperFetch recovery destination #4594

Sidekiq Pro Request: Set SuperFetch recovery destination #4594

buddhistpirate commented Jun 11, 2020

mperham commented Jun 11, 2020 via email

mperham commented Jun 11, 2020 via email

buddhistpirate commented Jun 11, 2020

buddhistpirate commented Jun 15, 2020

mperham commented Jun 16, 2020

buddhistpirate commented Jun 16, 2020

mperham commented Jun 16, 2020

mperham commented Jun 16, 2020

buddhistpirate commented Jun 16, 2020

mperham commented Jun 16, 2020

buddhistpirate commented Jun 17, 2020

mperham commented Jun 17, 2020

lfv89 commented Jun 21, 2021 •

edited

mperham commented Jun 21, 2021

Sidekiq Pro Request: Set SuperFetch recovery destination #4594

Sidekiq Pro Request: Set SuperFetch recovery destination #4594

Comments

buddhistpirate commented Jun 11, 2020

mperham commented Jun 11, 2020 via email

mperham commented Jun 11, 2020 via email

buddhistpirate commented Jun 11, 2020

buddhistpirate commented Jun 15, 2020

mperham commented Jun 16, 2020

buddhistpirate commented Jun 16, 2020

mperham commented Jun 16, 2020

mperham commented Jun 16, 2020

buddhistpirate commented Jun 16, 2020

mperham commented Jun 16, 2020

buddhistpirate commented Jun 17, 2020

mperham commented Jun 17, 2020

lfv89 commented Jun 21, 2021 • edited

mperham commented Jun 21, 2021

lfv89 commented Jun 21, 2021 •

edited