Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for unique jobs #4

Closed
davidcelis opened this issue Feb 7, 2012 · 14 comments
Closed

Support for unique jobs #4

davidcelis opened this issue Feb 7, 2012 · 14 comments

Comments

@davidcelis
Copy link

I'd love to use sidekiq in lieu of Resque, but I'm currently also using resque-loner. It would be nice to see a way to create jobs in sidekiq that would only be queued up once per payload.

If I can, I'll look through resque-loner's code and see what would be involved in adapting it for use in sidekiq, but I also wanted to open this issue to get input.

@ryanlecompte
Copy link
Contributor

Hi David,

sidekiq currently doesn't support a way to queue up jobs uniquely once per payload as you describe. However, I think support for this could be fairly easily added to the project in a generic way. Normally extensions to sidekiq would be done via custom middleware, but in this case we would want to perform a uniqueness check before the middleware chain would ever get executed.

We could keep track of a redis set that contains a base64 value of all currently queued payloads. If a configuration option is enabled to only enqueue unique items, then we would first consult this set before adding a new job to the queue to make sure it doesn't already exist in the to-be-processed queue.

I went ahead and implemented this approach in a separate branch with tests. @mperham, this may not be the direction we would want to go here, feel free to comment otherwise:

https://github.com/mperham/sidekiq/compare/unique_payloads

@mperham
Copy link
Collaborator

mperham commented Feb 7, 2012

Ideally, message processing should be idempotent so duplicate jobs won't matter, aside from the performance overhead of duplicate work. I'm going to let this one sit for a few days so I can think about it.

Ryan, I'd also consider using middleware here for the server half, I want to avoid adding non-core features to the manager. The manager is a singleton and adding more network traffic to his code path will affect scalability.

@ryanlecompte
Copy link
Contributor

That definitely sounds reasonable to me. I agree that the server half should be done with middleware. I'll explore that idea later tonight in the topic branch that I created and update this issue! Thanks.

@davidcelis
Copy link
Author

Well, that was fast. I was considering the same sort of logic implemented here, but this also removes the ability to have both unique jobs and original-style jobs.

As far as the duplicate jobs not mattering, it very well may be that they would matter much less while using sidekiq over Resque, but in my use case it would still end up being wasted time as the job itself can get time consuming depending on the number of users in the system. My own project Recommendable currently uses Resque to enqueue users to have the collaborative filtering logic run on them. With a lot of other users in the system, the job can take a couple of minutes (I've been trying to think of ways to optimize this, but the collab filtering is currently O(n^2) and I've been banging my head about it). Because of this, I use resque-loner to prevent any users from being queued twice; both jobs would end up yielding identical results.

Anyway, I'm glad you guys are looking into it, and I'm impressed that it's happening this quickly. Thank you!

@rsim
Copy link

rsim commented Feb 7, 2012

I'm currently using resque-lock-timeout to support that only one job per customer account can run in parallel. Would be good that instead of just class/arguments uniqueness it would be possible to generate custom job identifier/lock key which would be used to check that only one job with such identifier can run in parallel.

@ryanlecompte
Copy link
Contributor

I went ahead and moved the server part to an after-style middleware. I also changed the name of the configuration option. When this option is enabled (ignore_duplicate_jobs), subsequent attempts to schedule a job with the same payload will not be added to the queue if the job is currently queued or is currently executing.

It doesn't address @rsim's comment regarding being able to specify a unique job identifier. I think you could get around this by simply adding your own identifier to the job's payload.

Most recent diff can be seen here: https://github.com/mperham/sidekiq/compare/unique_payloads

@mperham
Copy link
Collaborator

mperham commented Feb 8, 2012

It came to me last night that sidekiq::client should probably have pluggable middleware too so you can modify the message before sending it to redis or, in this case, stop the send altogether.

@ryanlecompte
Copy link
Contributor

I do like the idea of having pluggable middleware for the client, a la Faraday. Nice way to separate the concerns and allow injecting logic from the client side, similar to what can be done now on the server side.

@rsim
Copy link

rsim commented Feb 8, 2012

In my use case when I use resque-lock-timeout I want that job is pushed to job queue, I just don't want that any worker would pick up this job until previous job with the same generated job identifier (in my case "customer account id") isn't finished.

@mperham
Copy link
Collaborator

mperham commented Feb 11, 2012

Fixed by @ryanlecompte's client side middleware work.

@smathy
Copy link

smathy commented Jul 31, 2013

Sorry to be a dunce here, but can I confirm (@mperham or @ryanlecompte or anyone else who knows) the fix for how to do this using the middleware would be something like...

In the call method of my middleware class, search through the Sidekiq::Queue I'm passed looking for a job with a matching payload, and only yield if I don't find it.

Do I also have to check Sidekiq::Workers or will the job still be in Sidekiq::Queue even if it's being worked on?

@mperham
Copy link
Collaborator

mperham commented Jul 31, 2013

Jason, you cannot implement unique jobs using the sidekiq API. There's many many race conditions in what you suggest.

On Jul 30, 2013, at 17:30, Jason King notifications@github.com wrote:

Sorry to be a dunce here, but can I confirm (@mperham or @ryanlecompte or anyone else who knows) the fix for how to do this using the middleware would be something like...

In the call method of my middleware class, search through the Sidekiq::Queue I'm passed looking for a job with a matching payload, and only yield if I don't find it.

Do I also have to check Sidekiq::Workers or will the job still be in Sidekiq::Queue even if it's being worked on?


Reply to this email directly or view it on GitHub.

@smathy
Copy link

smathy commented Jul 31, 2013

Hmm, I am confused then, thanks. I thought this was part of the reason to do it on the client, that's why I was asking about the Sidekiq::Workers actually, because I was wondering whether/how my client would have visibility of those.

So then, square one, how is the implementation of client middleware a solution to this unique job issue? You use some other semaphore/mutex entirely?

@mperham
Copy link
Collaborator

mperham commented Jul 31, 2013

Please see the various unique jobs impls that are already available.

On Jul 30, 2013, at 18:10, Jason King notifications@github.com wrote:

Hmm, I am confused then, thanks. I thought this was part of the reason to do it on the client, that's why I was asking about the Sidekiq::Workers actually, because I was wondering whether/how my client would have visibility of those.

So then, square one, how is the implementation of client middleware a solution to this unique job issue? You use some other semaphore/mutex entirely?


Reply to this email directly or view it on GitHub.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants