Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sidekiq Pro Request: Set SuperFetch recovery destination #4594

Closed
buddhistpirate opened this issue Jun 11, 2020 · 14 comments
Closed

Sidekiq Pro Request: Set SuperFetch recovery destination #4594

buddhistpirate opened this issue Jun 11, 2020 · 14 comments

Comments

@buddhistpirate
Copy link

Ruby version: 2.6.6
Sidekiq / Pro / Enterprise version(s): 5.2.7/3.7.1/1.7.2

Sometimes (up to a few times a week), we have jobs that balloon the memory of the worker and eventually get killed by the system with an OOM. We have spent much time tracking down these jobs and either A) fixing them to not use so much memory or B) having them go on a different queue to be processed by a worker with sufficient memory.

Inevitably some other job runs into the same behavior and we do the cycle again. It has become a fact of life for us. One of the biggest pains is removing a "poison pill" job. #4069 Currently we scale down the worker, wait for it to show up in the queue again, and then remove it.

A proposal: Be able to set the queue that jobs go into when they are recovered from SuperFetch.

Since these jobs get killed via OOM, they are recovered via SuperFetch and end up back in the regular queue, only to be picked up again and cause the issue. If we could send all super-fetch recovered jobs to a specific queue it would open up a lot of options for us:

We could automate what currently burns a lot of our DevOps time by simply having the worker that listens on the new queue have a lot of memory. We could keep this new worker scaled down if we wanted to, so we could investigate the issue before executing again. We can do forensics over all the jobs in this new queue to more easily determine which jobs caused the OOM. Many options.

Due to the way that SuperFetch is currently written, there is not a good extension point to implement this logic ourselves (via prepend or extend). Also, since Sidekiq-Pro is "closed", we can't maintain our own fork (I think?).

I think the easiest would be a sidekiq config like "superfetch_recovery_queue" that would simply override the normal destination if set.

If there was a method that extracted the destination queue from the private queue name (in check_for_orphans and cleanup_the_dead for example), then we could override that method to always go to a specific queue, or even implement per-source-queue logic if we really wanted to.

Thoughts?

@mperham
Copy link
Collaborator

mperham commented Jun 11, 2020 via email

@mperham
Copy link
Collaborator

mperham commented Jun 11, 2020 via email

@buddhistpirate
Copy link
Author

I agree with your concern, however I think this solution, or similar ones, would let us decide what to do with superfetch recovered jobs. Our mostly likely approach would be to have one of our larger-memory workers listen to the queue, so no intervention would be required. We could then alert on entries in that queue and investigate why jobs are ending up there. It leaves a lot of options open while reducing a lot of interruption and pain.

Generally our memory limits are 1gb-3b. We have a few workers with limits like 6gb, 8gb and 14gb for dealing with jobs that need it.

In the past month we've run about 390M sidekiq jobs. According to our logs, 271 jobs have been recovered by SuperFetch in that time.

Our 99th percentile duration for our sidekiq jobs is 2.91 sec, so if kubernetes gives us the SIGQUIT before the SIGKILL, then it seems we have enough time to finish any active jobs and exit gracefully.

Overall, we're looking for an approach to handle OOMing/Poison Pill jobs that doesn't take so much manual intervention. My initial hope was that I could find a good extension point in the code, so I wouldn't have to bother you with a feature request. Most other functionality we have wanted we've been able to achieve through middleware.

We're also experimenting with creating a server middleware which will check some filters (class, args, jid, other metadata we attach) before starting a job. This way if we have identified a poison pill job (which is difficult), we can stop its execution. This will add overhead and redis calls to every sidekiq job, but it will at least save us the pain of having to scale the worker down and wait for the job to show up in the queue. Scaling down takes time, but also interrupts our users, as there are queues whose latency they are sensitive to.

@buddhistpirate
Copy link
Author

In thinking through uses of the "filtering" middleware I mentioned, another approach to this problem presented itself. If we can identify which jobs were recovered by SuperFetch, we could use our "filtering" middleware to take our chosen action on them.

Since SuperFetch uses rpoplpush, there is not an opportunity to modify the job to add metadata like superfetched: true. However, rpoplpush does return the item that it is moving. So if we were able to register a "logger" to receive the output of the rpoplpush calls, then we could:

  • More accurately log the jobs that are being recovered by superfetch
  • Put them in a list for viewing in our admin console
  • Either automatically or manually add filter rules to move them to a different queue

Thoughts?

@mperham
Copy link
Collaborator

mperham commented Jun 16, 2020

I think logging them in a more useful form is good first step. Today Sidekiq Pro does this:

Sidekiq.logger.warn { "SuperFetch recovered job: #{job}" }

Note that job is the String returned from rpoplpush, Sidekiq does not do any processing on it today because that processing could have been what triggered a process crash, e.g. JSON.parse(job). Should we take the risk to parse it so we can log something like #{job.class} #{job.jid}?

@buddhistpirate
Copy link
Author

The basic logging in 5.0.1 is a good step; another reason for us to do the upgrade. I don't know that the additional parsing would give us much more, since we can probably parse the logged string downstream.

However, it still falls one step short of us being able to automate the problem in some way. The logging in 5.0.1 lets us set up alerts through our log provider which is helpful. If we could supply a block, or patch a method to get access to that job string, we could automate the problem in various ways, which would cut down on our toil big time. Other examples of things we could do: we could send a metric to our metrics system instead of our logger for more real time alerts, or post a message to slack.

Just yesterday for instance, we had a job that would balloon because too much data was being pulled out of the database. This particular job is around syncing so when it wasn't successfully synced to the downstream, it was re-enqueued again. This persisted until we were able to modify the underlying data causing the issue. With the additional logging in 5.0.1 it would have made it easier to identify the JID and database record causing the issue, however we would still need to take the manual step of adding it to some kind of filter/deny list, possibly each time it got enqueued.

@mperham
Copy link
Collaborator

mperham commented Jun 16, 2020

Sidekiq Pro already fires a Statsd metric whenever it recovers a job via super_fetch:

https://github.com/mperham/sidekiq/wiki/Pro-Metrics#commercial-metrics

I just modified it to include the queue name as a tag so you can slice the metrics that way.

@mperham
Copy link
Collaborator

mperham commented Jun 16, 2020

I suspect a lot of this issue boils down to "please upgrade" to pick up the polish I've added over the last 1-2 years.

@buddhistpirate
Copy link
Author

Thanks, I think having the queue tag on those metrics will help us alert and track patterns of what queues cause the issue.

However, I still don't see a way to automate dealing with these poison pills. Being able to redirect the jobs to a different queue, or intercept the job information in code would unlock automating for this, for us.

@mperham
Copy link
Collaborator

mperham commented Jun 16, 2020

What about something like this?

config.super_fetch! do |jobstr|
  job = JSON.parse(jobstr)
  Slack.message("Yo, #{job["class"]} #{job["jid"]} just rescued")
end

@buddhistpirate
Copy link
Author

I think that would be perfect.

@mperham
Copy link
Collaborator

mperham commented Jun 17, 2020

Ok, this will be part of #4602.

@mperham mperham closed this as completed Jun 17, 2020
@lfv89
Copy link

lfv89 commented Jun 21, 2021

hey @mperham,

When I try to pass a block to super_fetch!, even though the SuperFetch activated log is shown, my job is not recovered. If I don't pass the block everything works fine. In the example below, not only hey is never shown, but also the job is never recovered... Any ideas?

config.super_fetch! do |_jobstr|
  puts('>>> hey')
end

@mperham
Copy link
Collaborator

mperham commented Jun 21, 2021

@lfv89 please open your own issue so we can debug the problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants