New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Race condition in ruby reaper #559
Comments
Actually, upon testing some more with the lua reaper, I've been able to replicate it there too Config SidekiqUniqueJobs.configure do |config|
config.reaper = :lua
config.reaper_interval = 1
config.debug_lua = true
end Log output
Lua Debug Logs
|
First of all, thanks for the detailed error report. I'd like to point out that having the reapers run every second is not a good idea. They are both supposed to be running less frequently. What I can see immediately is that the order is not optimal. The active jobs should be checked first, it might also be interesting to check the active jobs twice or even between each one to make sure that if the queue didn't have the job then it might have been moved to active/processing. |
I've set it to 1 second in this test to be able to replicate the problem with any consistence and be able to show the issue. I'm definitely not proposing to run the reaper at that frequency. In production, we have the reaper running at the default 6 minutes and have seen this issue occur with it running at that frequency. I don't think that checking active first or even multiple times will fix the issue. Right now, active is being checked last and by then the worker still hasn't made it into the queue. Moving it up to check sooner will just ensure that the worker has even less time to make it into the queue. Is there somewhere else to check before the job gets into the active queue? Is there somewhere outside of the sidekiq API that has the information? Is there some kind of lock or processing hold that can be done that ensures workers aren't in transition when checking? That's why I initially thought that the lua reaper worked, because it obtained a lock before checking the status of workers. However, with race conditions, there would need to be a lock on adding the item and then another on reading the item to ensure the data is correct. I doubt sidekiq is placing a lock when adding the item to redis, as that would create a slow down. So this might not be something that can be fixed. |
Actually, the sidekiq pro and ent versions have different fetchers that does some such things as you suggest. Actually, i have intended to create my own sidekiq fetcher that takes into consideration not fetching jobs that are not able to be processed because a running job with an existing lock hasn't been cleared. Maybe it's time to do some more thinking on that. I'll keep you posted. |
Another idea might be to have a config value of lock reap duration. When the reaper runs, it checks how long the lock has existed and only deletes it after it exceeds the lock reap duration and matches the other conditions. I can't imagine a scenario where you would want to reap a lock within milliseconds or even a second or two after it has been created. Obviously the amount of time you can stand a lock to stay around after its been abandoned depends on the project and what kind of worker/processing load they have. Based on the sidekiq web lock UI, it looks like there is a created_at timestamp associated with the lock. |
@courtneymiller2010 that is an excellent suggestion. I'll create a failing test for this scenario. |
@courtneymiller2010, if at all possible, I'd love if you could check the PR #563 and check if you think that will work. I basically just wait so that no all jobs that are within the current reaper interval (reaper_timeout) are all considered as active. Only jobs that are older than the current time - reaper_timeout will be allowed to be reaped. I hacked this together in a really short time based on feeling so I'd love to get some feedback on this before a release. Cheers! |
Thanks for tackling this! Unfortunately, I don't think this will fix the issue for the ruby reaper. The problem being that because the reaper is running right after the worker starts and the worker hasn't been added to Sidekiq redis processes, there is no "item", thus, it will never actually hit On the lua reaper front, I'm not sure. I didn't spend a lot of time groking/debugging that processes. I assume it would be the same issue though, that the "job" hasn't been added to redis and thus it can't pull out the created at time to compare. |
Close #559 This gives a buffer on how many jobs will be considered eligible for reaping.
Close #559 This takes care of an edge case where jobs were just marked as active.
Close #559 This takes care of an edge case where jobs were just marked as active.
Close #559 This takes care of an edge case where jobs were just marked as active.
This issue is still present in 8.0.3, |
Describe the bug
There is a possible race condition in the ruby reaper, where a worker is started, the reaper runs right after and reaps the active job.
Expected behavior
Active workers shouldn't be reaped.
Current behavior
Active workers are reaped incorrectly.
Worker class
Config
Additional context
I've adding some logging to outline where the problem occurs.
log_info("****** valid #{valid}")
andlog_info("****** workers #{workers}")
was added between line 129 and 131 in RubyReaper#active?I added logging in
RubyReaper#belongs_to_job?
digests.each { |digest| log_info("digest #{digest} to be deleted") }
was added after line 74 in BatchDelete#callHere's my log output:
Based on the logs, sidekiq logs that its starting ProcessWorker, then the first puts statement in ProcessWorker is logged, then the reaper starts running. When it checks for active jobs, there are no workers, even though the ProcessWorker has clearly started. Thus,
active?
comes back false, and the digest gets deleted.I've exacerbated this problem by setting the reaper interval to 1 second, but I have had this occur in production with the interval set to the default of 6 minutes.
I've been able to replicate this issue about 50% of the time with the above settings/logging.
This issue doesn't seem to occur with the lua reaper. I'm testing our staging environment currently with the lua reaper to ensure it will work for us. Based on the documentation
I was hesitant to enable because we do some pretty intense processing in sidekiq. I wasn't sure if the "intense processing" applied to what the reaper is doing or overall.
Is this expected behavior in the ruby reaper because it doesn't lock redis and we should use lua instead? Or is this actually a bug for the ruby reaper?
The text was updated successfully, but these errors were encountered: