Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems with reaper and long running jobs #779

Open
RobsonKing opened this issue May 5, 2023 · 11 comments
Open

Problems with reaper and long running jobs #779

RobsonKing opened this issue May 5, 2023 · 11 comments

Comments

@RobsonKing
Copy link

Describe the bug
We have jobs that occasionally take a long time to execute (~20 minutes). These jobs are using until_and_while_executing. After about a minute a duplicate job start running. The lock appears to be removed by the reaper (our reaper interval is 1 minute).

Expected behavior
Uniqueness should be maintained, even if the job runs for a long time.

Current behavior

  • First job starts executing
  • Second job is enqueued, gets blocked and put on the scheduled queue to retry. So far everything is working as expected
TestingUniqueJob jid=e888a4c9373cceef87a7e834 INFO: start
TestingUniqueJob JID-e888a4c9373cceef87a7e834]With args [31] 
TestingUniqueJob JID-e888a4c9373cceef87a7e834 Enqueued TestingUniqueJob with [31] on default (85c4ea4da6a2f6df530a4178)
TestingUniqueJob jid=e888a4c9373cceef87a7e834 elapsed=0.057 INFO: done
  • reaper runs and deletes the lock
uniquejobs=reaper INFO: Nothing to delete; exiting.
uniquejobs=reaper INFO: Deleting batch with 1 digests
INFO: (2023-05-04 21:57:00 +0000) Execution successfully returned 1
  • job that was being blocked and rescheduled now starts executing.

Worker class

class TestingUniqueJob <ApplicationSidekiqWorker
  sidekiq_options queue: 'default',
                  lock: :until_and_while_executing,
                  on_conflict: { client: :log, server: :reschedule }

  def perform(id)
    Rails.logger.info("*** [TestingUniqueJob] processing #{id} START")
    sleep(40 * 60)
    Rails.logger.info("*** [TestingUniqueJob] processing #{id} END")
  end
end

Questions

  • Is this behaviour expected?
  • Is there a way to prevent the reaper from deleting locks when jobs are still executing?
  • How does the reaper determine that a lock can be cleaned up?
@mhenrixon
Copy link
Owner

First of all, long running jobs are like ddosing yourself.

https://twitter.com/nateberkopec/status/1635280301709746179?s=46

https://www.linkedin.com/posts/nate-berkopec-6832b41a8_long-running-background-jobs-60-seconds-activity-7037074319537016832-fite?utm_source=share&utm_medium=member_ios

That said, I'm sure there is something that could be done to improve.

Unfortunately, I am out of commission for a few weeks so won't be able to look at it until end of May at best.

@RobsonKing
Copy link
Author

RobsonKing commented May 8, 2023

We are aware that running long jobs is not ideal... In this case we only have one... but when the lock is cleaned up and other jobs start to run, we do DDOS ourselves.

Could you provide any info about how the reaper decides to clean something up? Is it just based on how old the lock is? Or does it try and check that the job is still running?

Looking through the code, it looks like belongs_to_job? tries looks to see if the digest belongs to a job that still exists. It looks like active? tries to see if the job is still active. Is this where my problem is likely to be found?

@mhenrixon
Copy link
Owner

It does something like this:

  1. Considers the lock active if not enough time has passed
  2. Checks what is enqueued
  3. Checks what is scheduled
  4. Checks what the workers are processing

May I ask what version you are on? I just fixed exactly your problem the other week.

@RobsonKing
Copy link
Author

We are on version 7.1.29

Are you referring to #756 ?

Did this change get ported back into the version 7 branch?

@mhenrixon
Copy link
Owner

I don't believe I backported it. Sorry, been away clearing my head and meditating for two weeks in silence. I'll see if can get to this sometime this week.

@jpriollaud
Copy link

This issue is negating the core function of the gem, because locks are disappearing. Upgrading to 8 isn't really a viable option because other middlewares haven't fully caught up to the latest Sidekiq yet. Looking forward to the backport.

@mhenrixon
Copy link
Owner

This issue is negating the core function of the gem, because locks are disappearing. Upgrading to 8 isn't really a viable option because other middlewares haven't fully caught up to the latest Sidekiq yet. Looking forward to the backport.

I know! I’ll try to sort it out this week. It has been a crazy year, hard to get to the open source work.

Prioritizing this today and tomorrow.

@rafaeelaudibert
Copy link

@mhenrixon This should be fixed as of #799 and v7.1.30, right?

@mhenrixon
Copy link
Owner

@mhenrixon This should be fixed as of #799 and v7.1.30, right?

Yes sir

@democlitos
Copy link

democlitos commented Oct 10, 2023

Versions: sidekiq 7.1.5, sidekiq-cron 1.9.1, sidekiq sidekiq-unique-jobs 8.0.3

I'm experiencing a similar issue. I have a worker FetchAssetValueWorker (lock: :until_executed, on_conflict: :reject) that runs every 10 minutes (scheduled using sidekiq-cron). Under specific circumstances, this worker may take more than 30 minutes to run. Everytime this worker is added to the dead queue for being rejected, for some reason it's also removed from redis (through delete_by_digest command):

2023-10-10T02:00:13.770Z pid=1 tid=hs1 uniquejobs=client until_executed=uniquejobs:bc7b7713595a5fb14e4ed3456bfd9049 INFO: Adding dead FetchAssetValueWorker job 83d9a5a3e96f64bbd7a920db
2023-10-10T02:00:13.775Z pid=1 tid=hs1 uniquejobs=client until_executed=uniquejobs:bc7b7713595a5fb14e4ed3456bfd9049 INFO: delete_by_digest(uniquejobs:bc7b7713595a5fb14e4ed3456bfd9049) completed in 3ms

Because of this, the worker will be duplicated if it takes more than 20 minutes to complete. I didn't see anything about this behavior in the documentation. Is this expected behavior or is it a bug? And how can I circumvent this?

@mhenrixon
Copy link
Owner

@democlitos I have a few reaper improvements on the main branch right now. It is only released once it is verified by someone that it is working better.

#830 amongst others

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants