Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition in ruby reaper #559

Closed
courtneymiller2010 opened this issue Jan 7, 2021 · 9 comments · Fixed by #563
Closed

Race condition in ruby reaper #559

courtneymiller2010 opened this issue Jan 7, 2021 · 9 comments · Fixed by #563
Assignees
Milestone

Comments

@courtneymiller2010
Copy link

courtneymiller2010 commented Jan 7, 2021

Describe the bug
There is a possible race condition in the ruby reaper, where a worker is started, the reaper runs right after and reaps the active job.

Expected behavior
Active workers shouldn't be reaped.

Current behavior
Active workers are reaped incorrectly.

Worker class

module Debugging
  class ProcessWorker
    include Sidekiq::Worker
    sidekiq_options queue: "default", retry: 0, lock: :until_executed, on_conflict: :log

    def perform
      puts "Starting ProcessWorker"
      sleep(5)
      puts "Finishing ProcessWorker"
    end
  end
end

Config

SidekiqUniqueJobs.configure do |config|
  config.reaper          = :ruby
  config.reaper_interval = 1
end

Additional context

I've adding some logging to outline where the problem occurs.

log_info("****** valid #{valid}") and log_info("****** workers #{workers}") was added between line 129 and 131 in RubyReaper#active?

I added logging in RubyReaper#belongs_to_job?

def belongs_to_job?(digest)
  scheduled = scheduled?(digest)
  retried   = retried?(digest)
  enqueued  = enqueued?(digest)
  active    = active?(digest)

  log_info("digest #{digest} scheduled #{scheduled}")
  log_info("digest #{digest} retried #{retried}")
  log_info("digest #{digest} enqueued #{enqueued}")
  log_info("digest #{digest} active #{active}")

  scheduled || retried || enqueued || active
end

digests.each { |digest| log_info("digest #{digest} to be deleted") } was added after line 74 in BatchDelete#call

Here's my log output:

2021-01-07T18:22:41.016Z 14493 TID-owkkfzd4l uniquejobs=orphan-reaper INFO: Nothing to delete; exiting.
2021-01-07T18:22:41.030Z 14493 TID-owkkfzqzt Debugging::ProcessWorker JID-7be382796c87374b89c0cefa INFO: start
2021-01-07T18:22:41.335Z 14493 TID-owkkfzqzt Debugging::ProcessWorker JID-7be382796c87374b89c0cefa uniquejobs-server DIG-uniquejobs:512613aedf8d8099a473a343e0bc352c INFO: Starting ProcessWorker
2021-01-07T18:22:42.053Z 14493 TID-owkkfzd4l uniquejobs=orphan-reaper INFO: ****** valid true
2021-01-07T18:22:42.053Z 14493 TID-owkkfzd4l uniquejobs=orphan-reaper INFO: ****** workers {}
2021-01-07T18:22:42.053Z 14493 TID-owkkfzd4l uniquejobs=orphan-reaper INFO: digest uniquejobs:512613aedf8d8099a473a343e0bc352c scheduled false
2021-01-07T18:22:42.053Z 14493 TID-owkkfzd4l uniquejobs=orphan-reaper INFO: digest uniquejobs:512613aedf8d8099a473a343e0bc352c retried false
2021-01-07T18:22:42.053Z 14493 TID-owkkfzd4l uniquejobs=orphan-reaper INFO: digest uniquejobs:512613aedf8d8099a473a343e0bc352c enqueued false
2021-01-07T18:22:42.053Z 14493 TID-owkkfzd4l uniquejobs=orphan-reaper INFO: digest uniquejobs:512613aedf8d8099a473a343e0bc352c active false
2021-01-07T18:22:42.053Z 14493 TID-owkkfzd4l uniquejobs=orphan-reaper INFO: Deleting batch with 1 digests
2021-01-07T18:22:42.053Z 14493 TID-owkkfzd4l uniquejobs=orphan-reaper INFO: digest uniquejobs:512613aedf8d8099a473a343e0bc352c to be deleted
2021-01-07T18:22:42.056Z 14493 TID-owkkfzd4l INFO: (2021-01-07 13:22:42 -0500) Execution successfully returned 1
2021-01-07T18:22:43.063Z 14493 TID-owkkfzd4l uniquejobs=orphan-reaper INFO: Nothing to delete; exiting.
2021-01-07T18:22:44.068Z 14493 TID-owkkfzd4l uniquejobs=orphan-reaper INFO: Nothing to delete; exiting.
2021-01-07T18:22:45.077Z 14493 TID-owkkfzd4l uniquejobs=orphan-reaper INFO: Nothing to delete; exiting.
2021-01-07T18:22:46.088Z 14493 TID-owkkfzd4l uniquejobs=orphan-reaper INFO: Nothing to delete; exiting.
2021-01-07T18:22:46.336Z 14493 TID-owkkfzqzt Debugging::ProcessWorker JID-7be382796c87374b89c0cefa uniquejobs-server DIG-uniquejobs:512613aedf8d8099a473a343e0bc352c INFO: Finishing ProcessWorker
2021-01-07T18:22:46.338Z 14493 TID-owkkfzqzt Debugging::ProcessWorker JID-7be382796c87374b89c0cefa uniquejobs-server DIG-uniquejobs:512613aedf8d8099a473a343e0bc352c WARN: might need to be unlocked manually
2021-01-07T18:22:46.346Z 14493 TID-owkkfzqzt Debugging::ProcessWorker JID-7be382796c87374b89c0cefa INFO: done: 5.316 sec
2021-01-07T18:22:47.098Z 14493 TID-owkkfzd4l uniquejobs=orphan-reaper INFO: Nothing to delete; exiting.

Based on the logs, sidekiq logs that its starting ProcessWorker, then the first puts statement in ProcessWorker is logged, then the reaper starts running. When it checks for active jobs, there are no workers, even though the ProcessWorker has clearly started. Thus, active? comes back false, and the digest gets deleted.

I've exacerbated this problem by setting the reaper interval to 1 second, but I have had this occur in production with the interval set to the default of 6 minutes.

I've been able to replicate this issue about 50% of the time with the above settings/logging.

This issue doesn't seem to occur with the lua reaper. I'm testing our staging environment currently with the lua reaper to ensure it will work for us. Based on the documentation

The :ruby job is much slower but the :lua job locks redis while executing. While doing intense processing it is best to avoid locking redis with a lua script.

I was hesitant to enable because we do some pretty intense processing in sidekiq. I wasn't sure if the "intense processing" applied to what the reaper is doing or overall.

Is this expected behavior in the ruby reaper because it doesn't lock redis and we should use lua instead? Or is this actually a bug for the ruby reaper?

@courtneymiller2010
Copy link
Author

courtneymiller2010 commented Jan 7, 2021

Actually, upon testing some more with the lua reaper, I've been able to replicate it there too

Config

SidekiqUniqueJobs.configure do |config|
  config.reaper          = :lua
  config.reaper_interval = 1
  config.debug_lua       = true
end

Log output

2021-01-07T19:11:01.845Z 16010 TID-ox317lqxq INFO: (2021-01-07 14:11:01 -0500) Execution successfully returned 0
2021-01-07T19:11:02.658Z 16010 TID-ox2yluxxm Debugging::ProcessWorker JID-d7cd16b6c72840a3749c64a4 INFO: start
2021-01-07T19:11:02.760Z 16010 TID-ox2yluxxm Debugging::ProcessWorker JID-d7cd16b6c72840a3749c64a4 uniquejobs-server DIG-uniquejobs:512613aedf8d8099a473a343e0bc352c INFO: Starting ProcessWorker
2021-01-07T19:11:02.854Z 16010 TID-ox317lqxq INFO: (2021-01-07 14:11:02 -0500) Execution successfully returned 1
2021-01-07T19:11:03.862Z 16010 TID-ox317lqxq INFO: (2021-01-07 14:11:03 -0500) Execution successfully returned 0
2021-01-07T19:11:04.874Z 16010 TID-ox317lqxq INFO: (2021-01-07 14:11:04 -0500) Execution successfully returned 0
2021-01-07T19:11:05.034Z 16010 TID-ox31cju4m Debugging::ProcessWorker JID-e80fdf33a8a232bcff90d794 INFO: start
2021-01-07T19:11:05.088Z 16010 TID-ox31cju4m Debugging::ProcessWorker JID-e80fdf33a8a232bcff90d794 uniquejobs-server DIG-uniquejobs:512613aedf8d8099a473a343e0bc352c INFO: Starting ProcessWorker
2021-01-07T19:11:05.883Z 16010 TID-ox317lqxq INFO: (2021-01-07 14:11:05 -0500) Execution successfully returned 0
2021-01-07T19:11:06.896Z 16010 TID-ox317lqxq INFO: (2021-01-07 14:11:06 -0500) Execution successfully returned 0
2021-01-07T19:11:07.761Z 16010 TID-ox2yluxxm Debugging::ProcessWorker JID-d7cd16b6c72840a3749c64a4 uniquejobs-server DIG-uniquejobs:512613aedf8d8099a473a343e0bc352c INFO: Finishing ProcessWorker
2021-01-07T19:11:07.765Z 16010 TID-ox2yluxxm Debugging::ProcessWorker JID-d7cd16b6c72840a3749c64a4 uniquejobs-server DIG-uniquejobs:512613aedf8d8099a473a343e0bc352c WARN: might need to be unlocked manually
2021-01-07T19:11:07.769Z 16010 TID-ox2yluxxm Debugging::ProcessWorker JID-d7cd16b6c72840a3749c64a4 INFO: done: 5.111 sec
2021-01-07T19:11:07.908Z 16010 TID-ox317lqxq INFO: (2021-01-07 14:11:07 -0500) Execution successfully returned 0
2021-01-07T19:11:08.916Z 16010 TID-ox317lqxq INFO: (2021-01-07 14:11:08 -0500) Execution successfully returned 0
2021-01-07T19:11:09.931Z 16010 TID-ox317lqxq INFO: (2021-01-07 14:11:09 -0500) Execution successfully returned 0
2021-01-07T19:11:10.092Z 16010 TID-ox31cju4m Debugging::ProcessWorker JID-e80fdf33a8a232bcff90d794 uniquejobs-server DIG-uniquejobs:512613aedf8d8099a473a343e0bc352c INFO: Finishing ProcessWorker
2021-01-07T19:11:10.100Z 16010 TID-ox31cju4m Debugging::ProcessWorker JID-e80fdf33a8a232bcff90d794 INFO: done: 5.066 sec
2021-01-07T19:11:10.941Z 16010 TID-ox317lqxq INFO: (2021-01-07 14:11:10 -0500) Execution successfully returned 0

Lua Debug Logs

1:M 07 Jan 19:11:01.867 # reap_orphans.lua - BEGIN
1:M 07 Jan 19:11:01.868 # reap_orphans.lua - Interating through: uniquejobs:digests for orphaned locks
1:M 07 Jan 19:11:01.868 # reap_orphans.lua - END
1:M 07 Jan 19:11:02.876 # reap_orphans.lua - BEGIN
1:M 07 Jan 19:11:02.876 # reap_orphans.lua - Interating through: uniquejobs:digests for orphaned locks
1:M 07 Jan 19:11:02.876 # reap_orphans.lua - Searching for digest: uniquejobs:512613aedf8d8099a473a343e0bc352c in schedule
1:M 07 Jan 19:11:02.876 # reap_orphans.lua - searching in: schedule for digest: uniquejobs:512613aedf8d8099a473a343e0bc352c cursor: 0
1:M 07 Jan 19:11:02.876 # reap_orphans.lua - Searching for digest: uniquejobs:512613aedf8d8099a473a343e0bc352c in retry
1:M 07 Jan 19:11:02.876 # reap_orphans.lua - searching in: retry for digest: uniquejobs:512613aedf8d8099a473a343e0bc352c cursor: 0
1:M 07 Jan 19:11:02.876 # reap_orphans.lua - Searching for digest: uniquejobs:512613aedf8d8099a473a343e0bc352c in all queues
1:M 07 Jan 19:11:02.876 # reap_orphans.lua - searching all queues for a matching digest: uniquejobs:512613aedf8d8099a473a343e0bc352c
1:M 07 Jan 19:11:02.876 # reap_orphans.lua - searching all queues for a matching digest: uniquejobs:512613aedf8d8099a473a343e0bc352c
1:M 07 Jan 19:11:02.876 # reap_orphans.lua - searching all queues for a matching digest: uniquejobs:512613aedf8d8099a473a343e0bc352c
1:M 07 Jan 19:11:02.877 # reap_orphans.lua - searching all queues for a matching digest: uniquejobs:512613aedf8d8099a473a343e0bc352c
1:M 07 Jan 19:11:02.877 # reap_orphans.lua - searching all queues for a matching digest: uniquejobs:512613aedf8d8099a473a343e0bc352c
1:M 07 Jan 19:11:02.877 # reap_orphans.lua - searching all queues for a matching digest: uniquejobs:512613aedf8d8099a473a343e0bc352c
1:M 07 Jan 19:11:02.877 # reap_orphans.lua - searching all queues for a matching digest: uniquejobs:512613aedf8d8099a473a343e0bc352c
1:M 07 Jan 19:11:02.877 # reap_orphans.lua - searching all queues for a matching digest: uniquejobs:512613aedf8d8099a473a343e0bc352c
1:M 07 Jan 19:11:02.877 # reap_orphans.lua - searching all queues for a matching digest: uniquejobs:512613aedf8d8099a473a343e0bc352c
1:M 07 Jan 19:11:02.877 # reap_orphans.lua - searching all queues for a matching digest: uniquejobs:512613aedf8d8099a473a343e0bc352c
1:M 07 Jan 19:11:02.877 # reap_orphans.lua - searching all queues for a matching digest: uniquejobs:512613aedf8d8099a473a343e0bc352c
1:M 07 Jan 19:11:02.877 # reap_orphans.lua - searching all queues for a matching digest: uniquejobs:512613aedf8d8099a473a343e0bc352c
1:M 07 Jan 19:11:02.877 # reap_orphans.lua - searching all queues for a matching digest: uniquejobs:512613aedf8d8099a473a343e0bc352c
1:M 07 Jan 19:11:02.877 # reap_orphans.lua - Searching for digest: uniquejobs:512613aedf8d8099a473a343e0bc352c in process sets
1:M 07 Jan 19:11:02.877 # reap_orphans.lua - Searching in process list for digest: uniquejobs:512613aedf8d8099a473a343e0bc352c cursor: 0
1:M 07 Jan 19:11:02.877 # reap_orphans.lua - Found number of processes: 1 next cursor: 0
1:M 07 Jan 19:11:02.877 # reap_orphans.lua - searching in process set: MacBook-Pro.local:16010:d427f88abdfd for digest: uniquejobs:512613aedf8d8099a473a343e0bc352c cursor: 0
1:M 07 Jan 19:11:02.877 # reap_orphans.lua - No entries in: MacBook-Pro.local:16010:d427f88abdfd:workers
1:M 07 Jan 19:11:02.877 # reap_orphans.lua - END
1:M 07 Jan 19:11:03.887 # reap_orphans.lua - BEGIN
1:M 07 Jan 19:11:03.887 # reap_orphans.lua - Interating through: uniquejobs:digests for orphaned locks
1:M 07 Jan 19:11:03.887 # reap_orphans.lua - END
1:M 07 Jan 19:11:04.901 # reap_orphans.lua - BEGIN
1:M 07 Jan 19:11:04.901 # reap_orphans.lua - Interating through: uniquejobs:digests for orphaned locks
1:M 07 Jan 19:11:04.901 # reap_orphans.lua - END
1:M 07 Jan 19:11:05.910 # reap_orphans.lua - BEGIN
1:M 07 Jan 19:11:05.910 # reap_orphans.lua - Interating through: uniquejobs:digests for orphaned locks
1:M 07 Jan 19:11:05.910 # reap_orphans.lua - Searching for digest: uniquejobs:512613aedf8d8099a473a343e0bc352c in schedule
1:M 07 Jan 19:11:05.910 # reap_orphans.lua - searching in: schedule for digest: uniquejobs:512613aedf8d8099a473a343e0bc352c cursor: 0
1:M 07 Jan 19:11:05.910 # reap_orphans.lua - Searching for digest: uniquejobs:512613aedf8d8099a473a343e0bc352c in retry
1:M 07 Jan 19:11:05.910 # reap_orphans.lua - searching in: retry for digest: uniquejobs:512613aedf8d8099a473a343e0bc352c cursor: 0
1:M 07 Jan 19:11:05.910 # reap_orphans.lua - Searching for digest: uniquejobs:512613aedf8d8099a473a343e0bc352c in all queues
1:M 07 Jan 19:11:05.910 # reap_orphans.lua - searching all queues for a matching digest: uniquejobs:512613aedf8d8099a473a343e0bc352c
1:M 07 Jan 19:11:05.910 # reap_orphans.lua - searching all queues for a matching digest: uniquejobs:512613aedf8d8099a473a343e0bc352c
1:M 07 Jan 19:11:05.910 # reap_orphans.lua - searching all queues for a matching digest: uniquejobs:512613aedf8d8099a473a343e0bc352c
1:M 07 Jan 19:11:05.910 # reap_orphans.lua - searching all queues for a matching digest: uniquejobs:512613aedf8d8099a473a343e0bc352c
1:M 07 Jan 19:11:05.910 # reap_orphans.lua - searching all queues for a matching digest: uniquejobs:512613aedf8d8099a473a343e0bc352c
1:M 07 Jan 19:11:05.910 # reap_orphans.lua - searching all queues for a matching digest: uniquejobs:512613aedf8d8099a473a343e0bc352c
1:M 07 Jan 19:11:05.910 # reap_orphans.lua - searching all queues for a matching digest: uniquejobs:512613aedf8d8099a473a343e0bc352c
1:M 07 Jan 19:11:05.910 # reap_orphans.lua - searching all queues for a matching digest: uniquejobs:512613aedf8d8099a473a343e0bc352c
1:M 07 Jan 19:11:05.910 # reap_orphans.lua - searching all queues for a matching digest: uniquejobs:512613aedf8d8099a473a343e0bc352c
1:M 07 Jan 19:11:05.910 # reap_orphans.lua - searching all queues for a matching digest: uniquejobs:512613aedf8d8099a473a343e0bc352c
1:M 07 Jan 19:11:05.911 # reap_orphans.lua - searching all queues for a matching digest: uniquejobs:512613aedf8d8099a473a343e0bc352c
1:M 07 Jan 19:11:05.911 # reap_orphans.lua - searching all queues for a matching digest: uniquejobs:512613aedf8d8099a473a343e0bc352c
1:M 07 Jan 19:11:05.911 # reap_orphans.lua - searching all queues for a matching digest: uniquejobs:512613aedf8d8099a473a343e0bc352c
1:M 07 Jan 19:11:05.911 # reap_orphans.lua - Searching for digest: uniquejobs:512613aedf8d8099a473a343e0bc352c in process sets
1:M 07 Jan 19:11:05.911 # reap_orphans.lua - Searching in process list for digest: uniquejobs:512613aedf8d8099a473a343e0bc352c cursor: 0
1:M 07 Jan 19:11:05.911 # reap_orphans.lua - Found number of processes: 1 next cursor: 0
1:M 07 Jan 19:11:05.911 # reap_orphans.lua - searching in process set: MacBook-Pro.local:16010:d427f88abdfd for digest: uniquejobs:512613aedf8d8099a473a343e0bc352c cursor: 0
1:M 07 Jan 19:11:05.911 # reap_orphans.lua - Found digest uniquejobs:512613aedf8d8099a473a343e0bc352c in: MacBook-Pro.local:16010:d427f88abdfd:workers
1:M 07 Jan 19:11:05.911 # reap_orphans.lua - END
1:M 07 Jan 19:11:06.923 # reap_orphans.lua - BEGIN
1:M 07 Jan 19:11:06.923 # reap_orphans.lua - Interating through: uniquejobs:digests for orphaned locks
1:M 07 Jan 19:11:06.923 # reap_orphans.lua - Searching for digest: uniquejobs:512613aedf8d8099a473a343e0bc352c in schedule
1:M 07 Jan 19:11:06.923 # reap_orphans.lua - searching in: schedule for digest: uniquejobs:512613aedf8d8099a473a343e0bc352c cursor: 0
1:M 07 Jan 19:11:06.923 # reap_orphans.lua - Searching for digest: uniquejobs:512613aedf8d8099a473a343e0bc352c in retry
1:M 07 Jan 19:11:06.923 # reap_orphans.lua - searching in: retry for digest: uniquejobs:512613aedf8d8099a473a343e0bc352c cursor: 0
1:M 07 Jan 19:11:06.923 # reap_orphans.lua - Searching for digest: uniquejobs:512613aedf8d8099a473a343e0bc352c in all queues
1:M 07 Jan 19:11:06.923 # reap_orphans.lua - searching all queues for a matching digest: uniquejobs:512613aedf8d8099a473a343e0bc352c
1:M 07 Jan 19:11:06.923 # reap_orphans.lua - searching all queues for a matching digest: uniquejobs:512613aedf8d8099a473a343e0bc352c
1:M 07 Jan 19:11:06.923 # reap_orphans.lua - searching all queues for a matching digest: uniquejobs:512613aedf8d8099a473a343e0bc352c
1:M 07 Jan 19:11:06.924 # reap_orphans.lua - searching all queues for a matching digest: uniquejobs:512613aedf8d8099a473a343e0bc352c
1:M 07 Jan 19:11:06.924 # reap_orphans.lua - searching all queues for a matching digest: uniquejobs:512613aedf8d8099a473a343e0bc352c
1:M 07 Jan 19:11:06.924 # reap_orphans.lua - searching all queues for a matching digest: uniquejobs:512613aedf8d8099a473a343e0bc352c
1:M 07 Jan 19:11:06.924 # reap_orphans.lua - searching all queues for a matching digest: uniquejobs:512613aedf8d8099a473a343e0bc352c
1:M 07 Jan 19:11:06.924 # reap_orphans.lua - searching all queues for a matching digest: uniquejobs:512613aedf8d8099a473a343e0bc352c
1:M 07 Jan 19:11:06.924 # reap_orphans.lua - searching all queues for a matching digest: uniquejobs:512613aedf8d8099a473a343e0bc352c
1:M 07 Jan 19:11:06.924 # reap_orphans.lua - searching all queues for a matching digest: uniquejobs:512613aedf8d8099a473a343e0bc352c
1:M 07 Jan 19:11:06.924 # reap_orphans.lua - searching all queues for a matching digest: uniquejobs:512613aedf8d8099a473a343e0bc352c
1:M 07 Jan 19:11:06.924 # reap_orphans.lua - searching all queues for a matching digest: uniquejobs:512613aedf8d8099a473a343e0bc352c
1:M 07 Jan 19:11:06.924 # reap_orphans.lua - searching all queues for a matching digest: uniquejobs:512613aedf8d8099a473a343e0bc352c
1:M 07 Jan 19:11:06.924 # reap_orphans.lua - Searching for digest: uniquejobs:512613aedf8d8099a473a343e0bc352c in process sets
1:M 07 Jan 19:11:06.924 # reap_orphans.lua - Searching in process list for digest: uniquejobs:512613aedf8d8099a473a343e0bc352c cursor: 0
1:M 07 Jan 19:11:06.924 # reap_orphans.lua - Found number of processes: 1 next cursor: 0
1:M 07 Jan 19:11:06.924 # reap_orphans.lua - searching in process set: MacBook-Pro.local:16010:d427f88abdfd for digest: uniquejobs:512613aedf8d8099a473a343e0bc352c cursor: 0
1:M 07 Jan 19:11:06.925 # reap_orphans.lua - Found digest uniquejobs:512613aedf8d8099a473a343e0bc352c in: MacBook-Pro.local:16010:d427f88abdfd:workers
1:M 07 Jan 19:11:06.925 # reap_orphans.lua - END
1:M 07 Jan 19:11:07.797 # unlock.lua - BEGIN unlock digest: uniquejobs:512613aedf8d8099a473a343e0bc352c (job_id: d7cd16b6c72840a3749c64a4)
1:M 07 Jan 19:11:07.797 # unlock.lua - HEXISTS uniquejobs:512613aedf8d8099a473a343e0bc352c:LOCKED d7cd16b6c72840a3749c64a4
1:M 07 Jan 19:11:07.798 # unlock.lua - ZADD uniquejobs:changelog 1610046667.7652 {"message":"Yielding to: uniquejobs:512613aedf8d8099a473a343e0bc352c:LOCKED (e80fdf33a8a232bcff90d794,)","job_id":"d7cd16b6c72840a3749c64a4","time":1610046667.7652,"script":"unlock.lua","digest":"uniquejobs:512613aedf8d8099a473a343e0bc352c"}
1:M 07 Jan 19:11:07.798 # unlock.lua - Removing 1 entries from changelog (total entries 1001 exceeds max_history: 1000)
1:M 07 Jan 19:11:07.798 # unlock.lua - PUBLISH uniquejobs:changelog {"message":"Yielding to: uniquejobs:512613aedf8d8099a473a343e0bc352c:LOCKED (e80fdf33a8a232bcff90d794,)","job_id":"d7cd16b6c72840a3749c64a4","time":1610046667.7652,"script":"unlock.lua","digest":"uniquejobs:512613aedf8d8099a473a343e0bc352c"}
1:M 07 Jan 19:11:07.798 # unlock.lua - Yielding to uniquejobs:512613aedf8d8099a473a343e0bc352c:LOCKED (e80fdf33a8a232bcff90d794,) uniquejobs:512613aedf8d8099a473a343e0bc352c:LOCKED by job d7cd16b6c72840a3749c64a4
1:M 07 Jan 19:11:07.937 # reap_orphans.lua - BEGIN
1:M 07 Jan 19:11:07.937 # reap_orphans.lua - Interating through: uniquejobs:digests for orphaned locks
1:M 07 Jan 19:11:07.937 # reap_orphans.lua - Searching for digest: uniquejobs:512613aedf8d8099a473a343e0bc352c in schedule
1:M 07 Jan 19:11:07.937 # reap_orphans.lua - searching in: schedule for digest: uniquejobs:512613aedf8d8099a473a343e0bc352c cursor: 0
1:M 07 Jan 19:11:07.937 # reap_orphans.lua - Searching for digest: uniquejobs:512613aedf8d8099a473a343e0bc352c in retry
1:M 07 Jan 19:11:07.937 # reap_orphans.lua - searching in: retry for digest: uniquejobs:512613aedf8d8099a473a343e0bc352c cursor: 0
1:M 07 Jan 19:11:07.937 # reap_orphans.lua - Searching for digest: uniquejobs:512613aedf8d8099a473a343e0bc352c in all queues
1:M 07 Jan 19:11:07.937 # reap_orphans.lua - searching all queues for a matching digest: uniquejobs:512613aedf8d8099a473a343e0bc352c
1:M 07 Jan 19:11:07.937 # reap_orphans.lua - searching all queues for a matching digest: uniquejobs:512613aedf8d8099a473a343e0bc352c
1:M 07 Jan 19:11:07.937 # reap_orphans.lua - searching all queues for a matching digest: uniquejobs:512613aedf8d8099a473a343e0bc352c
1:M 07 Jan 19:11:07.937 # reap_orphans.lua - searching all queues for a matching digest: uniquejobs:512613aedf8d8099a473a343e0bc352c
1:M 07 Jan 19:11:07.937 # reap_orphans.lua - searching all queues for a matching digest: uniquejobs:512613aedf8d8099a473a343e0bc352c
1:M 07 Jan 19:11:07.938 # reap_orphans.lua - searching all queues for a matching digest: uniquejobs:512613aedf8d8099a473a343e0bc352c
1:M 07 Jan 19:11:07.938 # reap_orphans.lua - searching all queues for a matching digest: uniquejobs:512613aedf8d8099a473a343e0bc352c
1:M 07 Jan 19:11:07.938 # reap_orphans.lua - searching all queues for a matching digest: uniquejobs:512613aedf8d8099a473a343e0bc352c
1:M 07 Jan 19:11:07.938 # reap_orphans.lua - searching all queues for a matching digest: uniquejobs:512613aedf8d8099a473a343e0bc352c
1:M 07 Jan 19:11:07.938 # reap_orphans.lua - searching all queues for a matching digest: uniquejobs:512613aedf8d8099a473a343e0bc352c
1:M 07 Jan 19:11:07.938 # reap_orphans.lua - searching all queues for a matching digest: uniquejobs:512613aedf8d8099a473a343e0bc352c
1:M 07 Jan 19:11:07.938 # reap_orphans.lua - searching all queues for a matching digest: uniquejobs:512613aedf8d8099a473a343e0bc352c
1:M 07 Jan 19:11:07.938 # reap_orphans.lua - searching all queues for a matching digest: uniquejobs:512613aedf8d8099a473a343e0bc352c
1:M 07 Jan 19:11:07.938 # reap_orphans.lua - Searching for digest: uniquejobs:512613aedf8d8099a473a343e0bc352c in process sets
1:M 07 Jan 19:11:07.938 # reap_orphans.lua - Searching in process list for digest: uniquejobs:512613aedf8d8099a473a343e0bc352c cursor: 0
1:M 07 Jan 19:11:07.938 # reap_orphans.lua - Found number of processes: 1 next cursor: 0
1:M 07 Jan 19:11:07.938 # reap_orphans.lua - searching in process set: MacBook-Pro.local:16010:d427f88abdfd for digest: uniquejobs:512613aedf8d8099a473a343e0bc352c cursor: 0
1:M 07 Jan 19:11:07.938 # reap_orphans.lua - Found digest uniquejobs:512613aedf8d8099a473a343e0bc352c in: MacBook-Pro.local:16010:d427f88abdfd:workers
1:M 07 Jan 19:11:07.938 # reap_orphans.lua - END
1:M 07 Jan 19:11:08.946 # reap_orphans.lua - BEGIN
1:M 07 Jan 19:11:08.946 # reap_orphans.lua - Interating through: uniquejobs:digests for orphaned locks
1:M 07 Jan 19:11:08.946 # reap_orphans.lua - Searching for digest: uniquejobs:512613aedf8d8099a473a343e0bc352c in schedule
1:M 07 Jan 19:11:08.946 # reap_orphans.lua - searching in: schedule for digest: uniquejobs:512613aedf8d8099a473a343e0bc352c cursor: 0
1:M 07 Jan 19:11:08.946 # reap_orphans.lua - Searching for digest: uniquejobs:512613aedf8d8099a473a343e0bc352c in retry
1:M 07 Jan 19:11:08.946 # reap_orphans.lua - searching in: retry for digest: uniquejobs:512613aedf8d8099a473a343e0bc352c cursor: 0
1:M 07 Jan 19:11:08.946 # reap_orphans.lua - Searching for digest: uniquejobs:512613aedf8d8099a473a343e0bc352c in all queues
1:M 07 Jan 19:11:08.946 # reap_orphans.lua - searching all queues for a matching digest: uniquejobs:512613aedf8d8099a473a343e0bc352c
1:M 07 Jan 19:11:08.946 # reap_orphans.lua - searching all queues for a matching digest: uniquejobs:512613aedf8d8099a473a343e0bc352c
1:M 07 Jan 19:11:08.946 # reap_orphans.lua - searching all queues for a matching digest: uniquejobs:512613aedf8d8099a473a343e0bc352c
1:M 07 Jan 19:11:08.946 # reap_orphans.lua - searching all queues for a matching digest: uniquejobs:512613aedf8d8099a473a343e0bc352c
1:M 07 Jan 19:11:08.946 # reap_orphans.lua - searching all queues for a matching digest: uniquejobs:512613aedf8d8099a473a343e0bc352c
1:M 07 Jan 19:11:08.946 # reap_orphans.lua - searching all queues for a matching digest: uniquejobs:512613aedf8d8099a473a343e0bc352c
1:M 07 Jan 19:11:08.946 # reap_orphans.lua - searching all queues for a matching digest: uniquejobs:512613aedf8d8099a473a343e0bc352c
1:M 07 Jan 19:11:08.947 # reap_orphans.lua - searching all queues for a matching digest: uniquejobs:512613aedf8d8099a473a343e0bc352c
1:M 07 Jan 19:11:08.947 # reap_orphans.lua - searching all queues for a matching digest: uniquejobs:512613aedf8d8099a473a343e0bc352c
1:M 07 Jan 19:11:08.947 # reap_orphans.lua - searching all queues for a matching digest: uniquejobs:512613aedf8d8099a473a343e0bc352c
1:M 07 Jan 19:11:08.947 # reap_orphans.lua - searching all queues for a matching digest: uniquejobs:512613aedf8d8099a473a343e0bc352c
1:M 07 Jan 19:11:08.947 # reap_orphans.lua - searching all queues for a matching digest: uniquejobs:512613aedf8d8099a473a343e0bc352c
1:M 07 Jan 19:11:08.947 # reap_orphans.lua - searching all queues for a matching digest: uniquejobs:512613aedf8d8099a473a343e0bc352c
1:M 07 Jan 19:11:08.947 # reap_orphans.lua - Searching for digest: uniquejobs:512613aedf8d8099a473a343e0bc352c in process sets
1:M 07 Jan 19:11:08.947 # reap_orphans.lua - Searching in process list for digest: uniquejobs:512613aedf8d8099a473a343e0bc352c cursor: 0
1:M 07 Jan 19:11:08.947 # reap_orphans.lua - Found number of processes: 1 next cursor: 0
1:M 07 Jan 19:11:08.947 # reap_orphans.lua - searching in process set: MacBook-Pro.local:16010:d427f88abdfd for digest: uniquejobs:512613aedf8d8099a473a343e0bc352c cursor: 0
1:M 07 Jan 19:11:08.947 # reap_orphans.lua - Found digest uniquejobs:512613aedf8d8099a473a343e0bc352c in: MacBook-Pro.local:16010:d427f88abdfd:workers
1:M 07 Jan 19:11:08.947 # reap_orphans.lua - END
1:M 07 Jan 19:11:09.961 # reap_orphans.lua - BEGIN
1:M 07 Jan 19:11:09.961 # reap_orphans.lua - Interating through: uniquejobs:digests for orphaned locks
1:M 07 Jan 19:11:09.961 # reap_orphans.lua - Searching for digest: uniquejobs:512613aedf8d8099a473a343e0bc352c in schedule
1:M 07 Jan 19:11:09.961 # reap_orphans.lua - searching in: schedule for digest: uniquejobs:512613aedf8d8099a473a343e0bc352c cursor: 0
1:M 07 Jan 19:11:09.961 # reap_orphans.lua - Searching for digest: uniquejobs:512613aedf8d8099a473a343e0bc352c in retry
1:M 07 Jan 19:11:09.961 # reap_orphans.lua - searching in: retry for digest: uniquejobs:512613aedf8d8099a473a343e0bc352c cursor: 0
1:M 07 Jan 19:11:09.962 # reap_orphans.lua - Searching for digest: uniquejobs:512613aedf8d8099a473a343e0bc352c in all queues
1:M 07 Jan 19:11:09.962 # reap_orphans.lua - searching all queues for a matching digest: uniquejobs:512613aedf8d8099a473a343e0bc352c
1:M 07 Jan 19:11:09.962 # reap_orphans.lua - searching all queues for a matching digest: uniquejobs:512613aedf8d8099a473a343e0bc352c
1:M 07 Jan 19:11:09.962 # reap_orphans.lua - searching all queues for a matching digest: uniquejobs:512613aedf8d8099a473a343e0bc352c
1:M 07 Jan 19:11:09.962 # reap_orphans.lua - searching all queues for a matching digest: uniquejobs:512613aedf8d8099a473a343e0bc352c
1:M 07 Jan 19:11:09.962 # reap_orphans.lua - searching all queues for a matching digest: uniquejobs:512613aedf8d8099a473a343e0bc352c
1:M 07 Jan 19:11:09.962 # reap_orphans.lua - searching all queues for a matching digest: uniquejobs:512613aedf8d8099a473a343e0bc352c
1:M 07 Jan 19:11:09.962 # reap_orphans.lua - searching all queues for a matching digest: uniquejobs:512613aedf8d8099a473a343e0bc352c
1:M 07 Jan 19:11:09.962 # reap_orphans.lua - searching all queues for a matching digest: uniquejobs:512613aedf8d8099a473a343e0bc352c
1:M 07 Jan 19:11:09.962 # reap_orphans.lua - searching all queues for a matching digest: uniquejobs:512613aedf8d8099a473a343e0bc352c
1:M 07 Jan 19:11:09.963 # reap_orphans.lua - searching all queues for a matching digest: uniquejobs:512613aedf8d8099a473a343e0bc352c
1:M 07 Jan 19:11:09.963 # reap_orphans.lua - searching all queues for a matching digest: uniquejobs:512613aedf8d8099a473a343e0bc352c
1:M 07 Jan 19:11:09.963 # reap_orphans.lua - searching all queues for a matching digest: uniquejobs:512613aedf8d8099a473a343e0bc352c
1:M 07 Jan 19:11:09.963 # reap_orphans.lua - searching all queues for a matching digest: uniquejobs:512613aedf8d8099a473a343e0bc352c
1:M 07 Jan 19:11:09.963 # reap_orphans.lua - Searching for digest: uniquejobs:512613aedf8d8099a473a343e0bc352c in process sets
1:M 07 Jan 19:11:09.963 # reap_orphans.lua - Searching in process list for digest: uniquejobs:512613aedf8d8099a473a343e0bc352c cursor: 0
1:M 07 Jan 19:11:09.963 # reap_orphans.lua - Found number of processes: 1 next cursor: 0
1:M 07 Jan 19:11:09.963 # reap_orphans.lua - searching in process set: MacBook-Pro.local:16010:d427f88abdfd for digest: uniquejobs:512613aedf8d8099a473a343e0bc352c cursor: 0
1:M 07 Jan 19:11:09.963 # reap_orphans.lua - Found digest uniquejobs:512613aedf8d8099a473a343e0bc352c in: MacBook-Pro.local:16010:d427f88abdfd:workers
1:M 07 Jan 19:11:09.963 # reap_orphans.lua - END
1:M 07 Jan 19:11:10.094 # unlock.lua - BEGIN unlock digest: uniquejobs:512613aedf8d8099a473a343e0bc352c (job_id: e80fdf33a8a232bcff90d794)
1:M 07 Jan 19:11:10.094 # unlock.lua - HEXISTS uniquejobs:512613aedf8d8099a473a343e0bc352c:LOCKED e80fdf33a8a232bcff90d794
1:M 07 Jan 19:11:10.094 # unlock.lua - LREM uniquejobs:512613aedf8d8099a473a343e0bc352c:QUEUED -1 e80fdf33a8a232bcff90d794
1:M 07 Jan 19:11:10.094 # unlock.lua - LREM uniquejobs:512613aedf8d8099a473a343e0bc352c:PRIMED -1 e80fdf33a8a232bcff90d794
1:M 07 Jan 19:11:10.094 # unlock.lua - ZREM uniquejobs:digests uniquejobs:512613aedf8d8099a473a343e0bc352c
1:M 07 Jan 19:11:10.094 # unlock.lua - UNLINK uniquejobs:512613aedf8d8099a473a343e0bc352c uniquejobs:512613aedf8d8099a473a343e0bc352c:INFO
1:M 07 Jan 19:11:10.095 # unlock.lua - HDEL uniquejobs:512613aedf8d8099a473a343e0bc352c:LOCKED e80fdf33a8a232bcff90d794
1:M 07 Jan 19:11:10.095 # unlock.lua - LPUSH uniquejobs:512613aedf8d8099a473a343e0bc352c:QUEUED 1
1:M 07 Jan 19:11:10.095 # unlock.lua - PEXPIRE uniquejobs:512613aedf8d8099a473a343e0bc352c:QUEUED 5000
1:M 07 Jan 19:11:10.095 # unlock.lua - ZADD uniquejobs:changelog 1610046670.0942 {"message":"Unlocked","job_id":"e80fdf33a8a232bcff90d794","time":1610046670.0942,"script":"unlock.lua","digest":"uniquejobs:512613aedf8d8099a473a343e0bc352c"}
1:M 07 Jan 19:11:10.095 # unlock.lua - Removing 1 entries from changelog (total entries 1001 exceeds max_history: 1000)
1:M 07 Jan 19:11:10.095 # unlock.lua - PUBLISH uniquejobs:changelog {"message":"Unlocked","job_id":"e80fdf33a8a232bcff90d794","time":1610046670.0942,"script":"unlock.lua","digest":"uniquejobs:512613aedf8d8099a473a343e0bc352c"}
1:M 07 Jan 19:11:10.095 # unlock.lua - END unlock digest: uniquejobs:512613aedf8d8099a473a343e0bc352c (job_id: e80fdf33a8a232bcff90d794)
1:M 07 Jan 19:11:10.098 # unlock.lua - BEGIN unlock digest: uniquejobs:512613aedf8d8099a473a343e0bc352c (job_id: e80fdf33a8a232bcff90d794)
1:M 07 Jan 19:11:10.098 # unlock.lua - HEXISTS uniquejobs:512613aedf8d8099a473a343e0bc352c:LOCKED e80fdf33a8a232bcff90d794
1:M 07 Jan 19:11:10.098 # unlock.lua - ZADD uniquejobs:changelog 1610046670.0983 {"message":"Yielding to: uniquejobs:512613aedf8d8099a473a343e0bc352c:LOCKED ()","job_id":"e80fdf33a8a232bcff90d794","time":1610046670.0983,"script":"unlock.lua","digest":"uniquejobs:512613aedf8d8099a473a343e0bc352c"}
1:M 07 Jan 19:11:10.098 # unlock.lua - Removing 1 entries from changelog (total entries 1001 exceeds max_history: 1000)
1:M 07 Jan 19:11:10.098 # unlock.lua - PUBLISH uniquejobs:changelog {"message":"Yielding to: uniquejobs:512613aedf8d8099a473a343e0bc352c:LOCKED ()","job_id":"e80fdf33a8a232bcff90d794","time":1610046670.0983,"script":"unlock.lua","digest":"uniquejobs:512613aedf8d8099a473a343e0bc352c"}
1:M 07 Jan 19:11:10.098 # unlock.lua - Yielding to uniquejobs:512613aedf8d8099a473a343e0bc352c:LOCKED () uniquejobs:512613aedf8d8099a473a343e0bc352c:LOCKED by job e80fdf33a8a232bcff90d794
1:M 07 Jan 19:11:10.940 # reap_orphans.lua - BEGIN
1:M 07 Jan 19:11:10.940 # reap_orphans.lua - Interating through: uniquejobs:digests for orphaned locks
1:M 07 Jan 19:11:10.940 # reap_orphans.lua - END

@mhenrixon
Copy link
Owner

First of all, thanks for the detailed error report.

I'd like to point out that having the reapers run every second is not a good idea. They are both supposed to be running less frequently.

What I can see immediately is that the order is not optimal. The active jobs should be checked first, it might also be interesting to check the active jobs twice or even between each one to make sure that if the queue didn't have the job then it might have been moved to active/processing.

@courtneymiller2010
Copy link
Author

I've set it to 1 second in this test to be able to replicate the problem with any consistence and be able to show the issue. I'm definitely not proposing to run the reaper at that frequency. In production, we have the reaper running at the default 6 minutes and have seen this issue occur with it running at that frequency.

I don't think that checking active first or even multiple times will fix the issue. Right now, active is being checked last and by then the worker still hasn't made it into the queue. Moving it up to check sooner will just ensure that the worker has even less time to make it into the queue.

Is there somewhere else to check before the job gets into the active queue? Is there somewhere outside of the sidekiq API that has the information? Is there some kind of lock or processing hold that can be done that ensures workers aren't in transition when checking?

That's why I initially thought that the lua reaper worked, because it obtained a lock before checking the status of workers. However, with race conditions, there would need to be a lock on adding the item and then another on reading the item to ensure the data is correct. I doubt sidekiq is placing a lock when adding the item to redis, as that would create a slow down. So this might not be something that can be fixed.

@mhenrixon
Copy link
Owner

Actually, the sidekiq pro and ent versions have different fetchers that does some such things as you suggest.

Actually, i have intended to create my own sidekiq fetcher that takes into consideration not fetching jobs that are not able to be processed because a running job with an existing lock hasn't been cleared.

Maybe it's time to do some more thinking on that. I'll keep you posted.

@courtneymiller2010
Copy link
Author

Another idea might be to have a config value of lock reap duration. When the reaper runs, it checks how long the lock has existed and only deletes it after it exceeds the lock reap duration and matches the other conditions. I can't imagine a scenario where you would want to reap a lock within milliseconds or even a second or two after it has been created. Obviously the amount of time you can stand a lock to stay around after its been abandoned depends on the project and what kind of worker/processing load they have.

Based on the sidekiq web lock UI, it looks like there is a created_at timestamp associated with the lock.

@mhenrixon
Copy link
Owner

@courtneymiller2010 that is an excellent suggestion.

I'll create a failing test for this scenario.

@mhenrixon
Copy link
Owner

@courtneymiller2010, if at all possible, I'd love if you could check the PR #563 and check if you think that will work. I basically just wait so that no all jobs that are within the current reaper interval (reaper_timeout) are all considered as active. Only jobs that are older than the current time - reaper_timeout will be allowed to be reaped.

I hacked this together in a really short time based on feeling so I'd love to get some feedback on this before a release.

Cheers!

@mhenrixon mhenrixon self-assigned this Jan 16, 2021
@mhenrixon mhenrixon added this to the V7.0 milestone Jan 16, 2021
@courtneymiller2010
Copy link
Author

Thanks for tackling this! Unfortunately, I don't think this will fix the issue for the ruby reaper. The problem being that because the reaper is running right after the worker starts and the worker hasn't been added to Sidekiq redis processes, there is no "item", thus, it will never actually hit considered_active?. I think you'd need to exit out of belongs_to_job? early if the timeout hasn't be reached... but at that point.. we don't have the "item" to get the timestamp off of. It's this weird catch 22 situation.

On the lua reaper front, I'm not sure. I didn't spend a lot of time groking/debugging that processes. I assume it would be the same issue though, that the "job" hasn't been added to redis and thus it can't pull out the created at time to compare.

mhenrixon added a commit that referenced this issue Jan 20, 2021
Close #559

This gives a buffer on how many jobs will be considered eligible for reaping.
mhenrixon added a commit that referenced this issue Jan 20, 2021
Close #559

This takes care of an edge case where jobs were just marked as active.
mhenrixon added a commit that referenced this issue Jan 20, 2021
Close #559

This takes care of an edge case where jobs were just marked as active.
mhenrixon added a commit that referenced this issue Jan 20, 2021
Close #559

This takes care of an edge case where jobs were just marked as active.
@sfavrinlumint
Copy link

This issue is still present in 8.0.3, reaper_timeout doesn't fix it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants