:until_executed jobs get stuck every now and then #379

blarralde · 2019-03-04T18:38:17Z

Describe the bug
New unique jobs can't be added to the queue despite the last one being complete and no other similar job being in the queue. This seems to be related to the keys not being properly cleared.

The jobs in question are from the CRON and ran every minute. The issue happens at random; sometimes spaced by a week, sometimes multiple times in a day.

Our way to go around this is to wipe out all unique keys when this happens.

Worker class

  sidekiq_options queue: :critical, retry: 0, unique: :until_executed

Sidekiq conf

Sidekiq.configure_server do |config|
  config.failures_default_mode = :exhausted
  config.redis = REDIS_MAIN_CONFIG.merge({ size: sidekiq_redis_size, namespace: "sidekiq:hacksterio", network_timeout: 8 })

  config.server_middleware do |chain|
    chain.add Sidekiq::Status::ServerMiddleware, expiration: 1.minute
  end
  config.client_middleware do |chain|
    chain.add Sidekiq::Status::ClientMiddleware
  end

  config.death_handlers << ->(job, _ex) do
    SidekiqUniqueJobs::Digests.del(digest: job['unique_digest']) if job['unique_digest']
  end

  if defined?(ActiveRecord::Base)
    Rails.logger.debug("Setting custom connection pool size of #{sidekiq_pool} for Sidekiq Server")
    config = ActiveRecord::Base.configurations[Rails.env] ||
                Rails.application.config.database_configuration[Rails.env]
    config['reaping_frequency'] = (ENV['DB_REAP_FREQ'] || 10).to_i  # seconds
    config['pool']              = sidekiq_pool + 2  # buffer
    ActiveRecord::Base.establish_connection(config)

    Rails.logger.info("Connection pool size for Sidekiq Server is now: #{ActiveRecord::Base.connection.pool.instance_variable_get('@size')}")
  end
end

mhenrixon · 2019-03-04T22:27:30Z

Could it be because of restarts or removal of queues? Some people reported that...

Would be helpful if you could share what versions you are on.

blarralde · 2019-03-04T22:28:55Z

Workers are scaled depending on load so yes this would be possible.

gem 'sidekiq', '5.2.3'
gem 'sidekiq-cron', '0.6.3'
gem 'sidekiq-failures', '1.0.0'
gem 'sidekiq-unique-jobs', '6.0.11'
gem 'sidekiq-status', '0.7.0'

mhenrixon · 2019-03-05T17:27:44Z

That is a tough nut to crack @blarralde. I'm not sure I can do much about that. It would be helpful what is in your logs around the restart. If the worker is killed right before unlock it would leave a key.

Also bump the gem one or two more versions. I solved some duplicate key problems in the last release. Not sure if it is theone you have there.

mhenrixon · 2019-03-05T17:28:35Z

You have the correct version there, should be as fine as I have been able to make it for now.

blarralde · 2019-03-05T17:35:21Z

Thanks @mhenrixon, I'll try to bump the version...

Shouldn't this normally take care of deleting keys when workers are shut down?

  config.death_handlers << ->(job, _ex) do
    SidekiqUniqueJobs::Digests.del(digest: job['unique_digest']) if job['unique_digest']
  end

mhenrixon · 2019-03-05T18:55:25Z

The death handlers are for jobs that run out of retry. In newer versions of sidekiq even for jobs that have retry 0, false. Sidekiq should put the job back on the queue if it crashes but unfortunately in some situations like when restarting a worker it can't if it gets sigkill (kill -9) in the end.

I am sure there are still things I need to improve in the gem. Version 6 really tries hard to prevent duplicates. Previous versions of the gem, unfortunately, did not prevent duplicates like expected. I made the gem super strict for v6 on purpose. Still getting to some problems like the one you are experiencing.

Couple of things I could do:

One thing that might help would be to present you a list of keys that don't have a matching job. That way you could rest assured that you are deleting keys that have no matching job. Even create a simple job/middleware to clean them
Another approach that I attempted before but discarded for the time being was to allow keys to be garbage collected. I'm still not sure how that would look exactly but maybe even in combination with the first item. Meaning when you try to lock, we could check that there are no jobs with that key and claim it. The problem I see with this approach is that it might make a lock like until executed useless. If the job was popped off the queue to be worked on then we would unlock invalidly.
Maybe we could use some conflict resolution as well here. They are rough around the edges still and not all of them are compatible with all lock types.

blarralde · 2019-03-08T18:27:43Z

Here's a quick and dirty garbage collector, let me know what you think:

module SidekiqUniqueGarbageCollector
  extend self

  def clean_unused
    unused_digests.each do |digest|
      SidekiqUniqueJobs::Digests.del digest: digest
    end
  end

  private
    def all_active_digests
      busy_digests + queued_digests
    end

    def all_digests
      SidekiqUniqueJobs::Digests.all.map{|d| d.gsub(SIDEKIQ_NAMESPACE, '')}
    end

    def all_queues
      [
        Sidekiq::RetrySet.new,
        Sidekiq::ScheduledSet.new,
      ] + named_queues
    end

    def busy_digests
      Sidekiq::Workers.new.map{|process, thread, msg| Sidekiq::Job.new(msg['payload'])['unique_digest'] }
    end

    def named_queues
      queue_names.map{|name| Sidekiq::Queue.new(name) }
    end

    def queued_digests
      all_queues.map{|queue| queue.map{|job| job['unique_digest'] } }.flatten.compact
    end

    def queue_names
      Sidekiq::Queue.all.map(&:name)
    end

    def unused_digests
      all_digests - all_active_digests
    end
end

ccleung · 2019-03-08T21:50:30Z

would adding a lock_expiration help in the interim? i think we may be experiencing the same issue, and on v6.0.12 - the issue of jobs getting stuck happens only intermittently though

mhenrixon · 2019-03-09T07:34:07Z

@ccleung it will help in some cases but not all. The problem is more to do with never reaching the unlock for a job.

It could be that the job is deleted but that the unlock never happens or maybe like I said before, sidekiq krasches (worker restarts) before unlock is done.

_{Sent with GitHawk}

mhenrixon · 2019-03-09T07:40:36Z

@blarralde that looks really promising! I didn't even know about that I could find the busy jobs/digests like that.

The only problem with that is that it would have to be in Lua I think. Right now when you run this code from one millisecond to another something might change. That isn't the case with Lua, Lua is like a database transaction lock. During the script execution, nothing can be changed from another process.

ccleung · 2019-03-15T16:13:34Z

@mhenrixon - just to double check - would implementing something like this help with the issue we have here? https://github.com/mhenrixon/sidekiq-unique-jobs#cleanup-dead-locks

On another note - apparently one of my colleagues downgraded to 6.0.8 and the problem hasn't manifested for their project anymore - could this have been an issue introduced between that version and the latest?

mhenrixon · 2019-03-15T17:35:57Z

I’ll have a look at the diff early next week but if you look at the previous issues the updates since 6.0.8 attempts to improve this situation.

matiasgarcia · 2019-04-08T12:53:07Z

@mhenrixon any news on this? we have just run with this issue in our project

maleksiuk · 2019-04-10T02:23:26Z

We stopped using sidekiq-unique-jobs for the particular job that we were having problems with and are now using redlock-rb which is behaving well and provides a lock expiration in case things go wrong. This is a job that runs frequently (once every couple of seconds). It reenqueues itself and we have a Sidekiq cron gem that will enqueue it every minute in case it happened to stop.

As far as I can tell, the many other jobs that use sidekiq-unique-jobs have not had a problem. They don't run at this high frequency. Also, I don't think our problem was connected to restarts or removal of queues. We don't remove queues and I didn't see restarts happening around the time that the job stopped executing.

mhenrixon · 2019-04-13T12:13:18Z

@maleksiuk maybe the performance is the problem for you? I'll see if I can add some performance tests/benchmarks for the locks.

I am also planning on rewriting the internals after adding some better coverage. I can replicate the problem in a console but I cannot for the life of me replicate the problem with a test :(

In your case it could be something with sidekiq-cron causing problems as well.

For the rest of you sidekiq/sidekiq#4141 will most certainly help with the odd issue. The gem still has some weird problem with unlocking in certain conditions. I am really digging into this right now.

simonoff · 2019-04-19T11:19:33Z

@maleksiuk maybe the performance is the problem for you? I'll see if I can add some performance tests/benchmarks for the locks.

I am also planning on rewriting the internals after adding some better coverage. I can replicate the problem in a console but I cannot for the life of me replicate the problem with a test :(

In your case it could be something with sidekiq-cron causing problems as well.

For the rest of you mperham/sidekiq#4141 will most certainly help with the odd issue. The gem still has some weird problem with unlocking in certain conditions. I am really digging into this right now.

Hello @mhenrixon i know a very easy way to reproduce it on production - put redis on another host and periodically made a disconnections between them.

mhenrixon · 2019-04-19T11:32:13Z

@simonoff do you mean you have a multiserver redis setup?

simonoff · 2019-04-19T12:22:35Z

@simonoff do you mean you have a multiserver redis setup?

No, we have a multi-server workers deployment, but Redis server only one.

soma · 2019-05-03T10:12:40Z

Don't know if this will help you guys find a fix for this but, I can +1 this issue. Just found a situation where an app running on Heroku got a maintenance (initiated by Heroku/Platform) restart and it failed to restart the process with SIGTERM and did SIGKILL.

For some reason, even after receiving a SIGTERM, sidekiq had jobs in queue, and continued to schedule jobs and it seems this ended up with unique-job-digests being left in a limbo. Since these were jobs we run frequently without parameters it stopped the processing completely for these tasks but no info was available in the sidekiq-admin.

Running the code above to clean out unused digests fixed the immediate problem for us, but we now have to consider how to handle these problems in the future, because it is most likely a common situation for us, that the jobrunners gets killed with SIGKILL.

mhenrixon · 2019-05-03T10:18:20Z

@soma @simonoff @ccleung @blarralde I am working on fixing it once and for all. It is turning out to be quite an endeavor but i am getting there ever so slowly.

ccleung · 2019-05-10T17:03:11Z

just an FYI, we were able to reproduce the issue on 6.0.8

mhenrixon · 2019-11-28T15:16:41Z

I am going to close this issue. It has been fixed on master (version 7) that no locks will keep being locked forever.

I'd like new fresh issues opened in the future so better be able to track and fix things on the fresh version. I hope you guys don't mind!

KevinColemanInc · 2019-11-29T16:20:24Z

I just updated to version 6.0.18 and I am still experiencing this issue where locks are not removed. They seem to be expiring on their own, but not necessarily when the job finishes. This has the effect of "this jo can only be run every X seconds".

mhenrixon · 2019-11-29T16:25:53Z

@KevinColemanInc is your problem also related to cron? If not could you open a new issue with all the details regarding your specific case?

dbackeus · 2020-01-16T08:32:57Z

I am going to close this issue. It has been fixed on master (version 7) that no locks will keep being locked forever.

@mhenrixon Could you post a reference to the commit / PR that fixed this issue?

mikeyoung85 mentioned this issue Mar 21, 2019

Remove Unique Digests librariesio/libraries.io#2329

Merged

mhenrixon closed this as completed Nov 28, 2019

giuliana-quickmail mentioned this issue Jan 11, 2021

Lock not getting properly cleared for some jobs #560

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

:until_executed jobs get stuck every now and then #379

:until_executed jobs get stuck every now and then #379

blarralde commented Mar 4, 2019

mhenrixon commented Mar 4, 2019

blarralde commented Mar 4, 2019

mhenrixon commented Mar 5, 2019

mhenrixon commented Mar 5, 2019

blarralde commented Mar 5, 2019

mhenrixon commented Mar 5, 2019

blarralde commented Mar 8, 2019

ccleung commented Mar 8, 2019 •

edited

mhenrixon commented Mar 9, 2019

mhenrixon commented Mar 9, 2019

ccleung commented Mar 15, 2019

mhenrixon commented Mar 15, 2019 via email

matiasgarcia commented Apr 8, 2019

maleksiuk commented Apr 10, 2019

mhenrixon commented Apr 13, 2019

simonoff commented Apr 19, 2019

mhenrixon commented Apr 19, 2019

simonoff commented Apr 19, 2019

soma commented May 3, 2019

mhenrixon commented May 3, 2019

ccleung commented May 10, 2019

mhenrixon commented Nov 28, 2019

KevinColemanInc commented Nov 29, 2019 •

edited

mhenrixon commented Nov 29, 2019

dbackeus commented Jan 16, 2020

:until_executed jobs get stuck every now and then #379

:until_executed jobs get stuck every now and then #379

Comments

blarralde commented Mar 4, 2019

mhenrixon commented Mar 4, 2019

blarralde commented Mar 4, 2019

mhenrixon commented Mar 5, 2019

mhenrixon commented Mar 5, 2019

blarralde commented Mar 5, 2019

mhenrixon commented Mar 5, 2019

blarralde commented Mar 8, 2019

ccleung commented Mar 8, 2019 • edited

mhenrixon commented Mar 9, 2019

mhenrixon commented Mar 9, 2019

ccleung commented Mar 15, 2019

mhenrixon commented Mar 15, 2019 via email

matiasgarcia commented Apr 8, 2019

maleksiuk commented Apr 10, 2019

mhenrixon commented Apr 13, 2019

simonoff commented Apr 19, 2019

mhenrixon commented Apr 19, 2019

simonoff commented Apr 19, 2019

soma commented May 3, 2019

mhenrixon commented May 3, 2019

ccleung commented May 10, 2019

mhenrixon commented Nov 28, 2019

KevinColemanInc commented Nov 29, 2019 • edited

mhenrixon commented Nov 29, 2019

dbackeus commented Jan 16, 2020

ccleung commented Mar 8, 2019 •

edited

KevinColemanInc commented Nov 29, 2019 •

edited