Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

:until_executed jobs get stuck every now and then #379

Closed
blarralde opened this issue Mar 4, 2019 · 25 comments
Closed

:until_executed jobs get stuck every now and then #379

blarralde opened this issue Mar 4, 2019 · 25 comments

Comments

@blarralde
Copy link

Describe the bug
New unique jobs can't be added to the queue despite the last one being complete and no other similar job being in the queue. This seems to be related to the keys not being properly cleared.

The jobs in question are from the CRON and ran every minute. The issue happens at random; sometimes spaced by a week, sometimes multiple times in a day.

Our way to go around this is to wipe out all unique keys when this happens.

Worker class

  sidekiq_options queue: :critical, retry: 0, unique: :until_executed

Sidekiq conf

Sidekiq.configure_server do |config|
  config.failures_default_mode = :exhausted
  config.redis = REDIS_MAIN_CONFIG.merge({ size: sidekiq_redis_size, namespace: "sidekiq:hacksterio", network_timeout: 8 })

  config.server_middleware do |chain|
    chain.add Sidekiq::Status::ServerMiddleware, expiration: 1.minute
  end
  config.client_middleware do |chain|
    chain.add Sidekiq::Status::ClientMiddleware
  end

  config.death_handlers << ->(job, _ex) do
    SidekiqUniqueJobs::Digests.del(digest: job['unique_digest']) if job['unique_digest']
  end

  if defined?(ActiveRecord::Base)
    Rails.logger.debug("Setting custom connection pool size of #{sidekiq_pool} for Sidekiq Server")
    config = ActiveRecord::Base.configurations[Rails.env] ||
                Rails.application.config.database_configuration[Rails.env]
    config['reaping_frequency'] = (ENV['DB_REAP_FREQ'] || 10).to_i  # seconds
    config['pool']              = sidekiq_pool + 2  # buffer
    ActiveRecord::Base.establish_connection(config)

    Rails.logger.info("Connection pool size for Sidekiq Server is now: #{ActiveRecord::Base.connection.pool.instance_variable_get('@size')}")
  end
end
@mhenrixon
Copy link
Owner

Could it be because of restarts or removal of queues? Some people reported that...

Would be helpful if you could share what versions you are on.

@blarralde
Copy link
Author

Workers are scaled depending on load so yes this would be possible.

gem 'sidekiq', '5.2.3'
gem 'sidekiq-cron', '0.6.3'
gem 'sidekiq-failures', '1.0.0'
gem 'sidekiq-unique-jobs', '6.0.11'
gem 'sidekiq-status', '0.7.0'

@mhenrixon
Copy link
Owner

That is a tough nut to crack @blarralde. I'm not sure I can do much about that. It would be helpful what is in your logs around the restart. If the worker is killed right before unlock it would leave a key.

Also bump the gem one or two more versions. I solved some duplicate key problems in the last release. Not sure if it is theone you have there.

@mhenrixon
Copy link
Owner

You have the correct version there, should be as fine as I have been able to make it for now.

@blarralde
Copy link
Author

Thanks @mhenrixon, I'll try to bump the version...

Shouldn't this normally take care of deleting keys when workers are shut down?

  config.death_handlers << ->(job, _ex) do
    SidekiqUniqueJobs::Digests.del(digest: job['unique_digest']) if job['unique_digest']
  end

@mhenrixon
Copy link
Owner

The death handlers are for jobs that run out of retry. In newer versions of sidekiq even for jobs that have retry 0, false. Sidekiq should put the job back on the queue if it crashes but unfortunately in some situations like when restarting a worker it can't if it gets sigkill (kill -9) in the end.

I am sure there are still things I need to improve in the gem. Version 6 really tries hard to prevent duplicates. Previous versions of the gem, unfortunately, did not prevent duplicates like expected. I made the gem super strict for v6 on purpose. Still getting to some problems like the one you are experiencing.

Couple of things I could do:

  1. One thing that might help would be to present you a list of keys that don't have a matching job. That way you could rest assured that you are deleting keys that have no matching job. Even create a simple job/middleware to clean them
  2. Another approach that I attempted before but discarded for the time being was to allow keys to be garbage collected. I'm still not sure how that would look exactly but maybe even in combination with the first item. Meaning when you try to lock, we could check that there are no jobs with that key and claim it. The problem I see with this approach is that it might make a lock like until executed useless. If the job was popped off the queue to be worked on then we would unlock invalidly.
  3. Maybe we could use some conflict resolution as well here. They are rough around the edges still and not all of them are compatible with all lock types.

@blarralde
Copy link
Author

Here's a quick and dirty garbage collector, let me know what you think:

module SidekiqUniqueGarbageCollector
  extend self

  def clean_unused
    unused_digests.each do |digest|
      SidekiqUniqueJobs::Digests.del digest: digest
    end
  end

  private
    def all_active_digests
      busy_digests + queued_digests
    end

    def all_digests
      SidekiqUniqueJobs::Digests.all.map{|d| d.gsub(SIDEKIQ_NAMESPACE, '')}
    end

    def all_queues
      [
        Sidekiq::RetrySet.new,
        Sidekiq::ScheduledSet.new,
      ] + named_queues
    end

    def busy_digests
      Sidekiq::Workers.new.map{|process, thread, msg| Sidekiq::Job.new(msg['payload'])['unique_digest'] }
    end

    def named_queues
      queue_names.map{|name| Sidekiq::Queue.new(name) }
    end

    def queued_digests
      all_queues.map{|queue| queue.map{|job| job['unique_digest'] } }.flatten.compact
    end

    def queue_names
      Sidekiq::Queue.all.map(&:name)
    end

    def unused_digests
      all_digests - all_active_digests
    end
end

@ccleung
Copy link

ccleung commented Mar 8, 2019

would adding a lock_expiration help in the interim? i think we may be experiencing the same issue, and on v6.0.12 - the issue of jobs getting stuck happens only intermittently though

@mhenrixon
Copy link
Owner

@ccleung it will help in some cases but not all. The problem is more to do with never reaching the unlock for a job.

It could be that the job is deleted but that the unlock never happens or maybe like I said before, sidekiq krasches (worker restarts) before unlock is done.

Sent with GitHawk

@mhenrixon
Copy link
Owner

@blarralde that looks really promising! I didn't even know about that I could find the busy jobs/digests like that.

The only problem with that is that it would have to be in Lua I think. Right now when you run this code from one millisecond to another something might change. That isn't the case with Lua, Lua is like a database transaction lock. During the script execution, nothing can be changed from another process.

@ccleung
Copy link

ccleung commented Mar 15, 2019

@mhenrixon - just to double check - would implementing something like this help with the issue we have here? https://github.com/mhenrixon/sidekiq-unique-jobs#cleanup-dead-locks

On another note - apparently one of my colleagues downgraded to 6.0.8 and the problem hasn't manifested for their project anymore - could this have been an issue introduced between that version and the latest?

@mhenrixon
Copy link
Owner

mhenrixon commented Mar 15, 2019 via email

@matiasgarcia
Copy link

@mhenrixon any news on this? we have just run with this issue in our project

@maleksiuk
Copy link

We stopped using sidekiq-unique-jobs for the particular job that we were having problems with and are now using redlock-rb which is behaving well and provides a lock expiration in case things go wrong. This is a job that runs frequently (once every couple of seconds). It reenqueues itself and we have a Sidekiq cron gem that will enqueue it every minute in case it happened to stop.

As far as I can tell, the many other jobs that use sidekiq-unique-jobs have not had a problem. They don't run at this high frequency. Also, I don't think our problem was connected to restarts or removal of queues. We don't remove queues and I didn't see restarts happening around the time that the job stopped executing.

@mhenrixon
Copy link
Owner

@maleksiuk maybe the performance is the problem for you? I'll see if I can add some performance tests/benchmarks for the locks.

I am also planning on rewriting the internals after adding some better coverage. I can replicate the problem in a console but I cannot for the life of me replicate the problem with a test :(

In your case it could be something with sidekiq-cron causing problems as well.

For the rest of you sidekiq/sidekiq#4141 will most certainly help with the odd issue. The gem still has some weird problem with unlocking in certain conditions. I am really digging into this right now.

@simonoff
Copy link
Contributor

@maleksiuk maybe the performance is the problem for you? I'll see if I can add some performance tests/benchmarks for the locks.

I am also planning on rewriting the internals after adding some better coverage. I can replicate the problem in a console but I cannot for the life of me replicate the problem with a test :(

In your case it could be something with sidekiq-cron causing problems as well.

For the rest of you mperham/sidekiq#4141 will most certainly help with the odd issue. The gem still has some weird problem with unlocking in certain conditions. I am really digging into this right now.

Hello @mhenrixon i know a very easy way to reproduce it on production - put redis on another host and periodically made a disconnections between them.

@mhenrixon
Copy link
Owner

@simonoff do you mean you have a multiserver redis setup?

@simonoff
Copy link
Contributor

@simonoff do you mean you have a multiserver redis setup?

No, we have a multi-server workers deployment, but Redis server only one.

@soma
Copy link

soma commented May 3, 2019

Don't know if this will help you guys find a fix for this but, I can +1 this issue. Just found a situation where an app running on Heroku got a maintenance (initiated by Heroku/Platform) restart and it failed to restart the process with SIGTERM and did SIGKILL.

For some reason, even after receiving a SIGTERM, sidekiq had jobs in queue, and continued to schedule jobs and it seems this ended up with unique-job-digests being left in a limbo. Since these were jobs we run frequently without parameters it stopped the processing completely for these tasks but no info was available in the sidekiq-admin.

Running the code above to clean out unused digests fixed the immediate problem for us, but we now have to consider how to handle these problems in the future, because it is most likely a common situation for us, that the jobrunners gets killed with SIGKILL.

@mhenrixon
Copy link
Owner

@soma @simonoff @ccleung @blarralde I am working on fixing it once and for all. It is turning out to be quite an endeavor but i am getting there ever so slowly.

@ccleung
Copy link

ccleung commented May 10, 2019

just an FYI, we were able to reproduce the issue on 6.0.8

@mhenrixon
Copy link
Owner

I am going to close this issue. It has been fixed on master (version 7) that no locks will keep being locked forever.

I'd like new fresh issues opened in the future so better be able to track and fix things on the fresh version. I hope you guys don't mind!

@KevinColemanInc
Copy link

KevinColemanInc commented Nov 29, 2019

I just updated to version 6.0.18 and I am still experiencing this issue where locks are not removed. They seem to be expiring on their own, but not necessarily when the job finishes. This has the effect of "this jo can only be run every X seconds".

image

@mhenrixon
Copy link
Owner

@KevinColemanInc is your problem also related to cron? If not could you open a new issue with all the details regarding your specific case?

@dbackeus
Copy link

I am going to close this issue. It has been fixed on master (version 7) that no locks will keep being locked forever.

@mhenrixon Could you post a reference to the commit / PR that fixed this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants