New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement Sidekiq::Worker.perform_bulk
#5042
Conversation
Co-authored-by: jeffcarbs <jeff.carbonella@gmail.com>
Co-authored-by: jeffcarbs <jeff.carbonella@gmail.com>
Co-authored-by: jeffcarbs <jeff.carbonella@gmail.com>
…ulk behavior Co-authored-by: jeffcarbs <jeff.carbonella@gmail.com>
Co-authored-by: jeffcarbs <jeff.carbonella@gmail.com>
Co-authored-by: jeffcarbs <jeff.carbonella@gmail.com>
This looks great. Feel free to add an item to the changelog. |
@mperham Alright, I've gone ahead and added an entry to |
@mperham @kellysutton - Thank you for this! I have ideas on a possible improvement. We currently have the ability to schedule bulk jobs for some time in the future, so like Sidekiq::Client.push_bulk("class" => FooJob, "args" => [[1], [2]], "at" => [1.minute.from_now.to_f, 5.minutes.from_now.to_f]) (which was added with #4243) or Sidekiq::Client.push_bulk("class" => FooJob, "args" => [[1], [2]], "at" => 1.minute.from_now.to_f) but with the new implementation of (Note: When I say that we lose this ability, this is still possible via the I wonder if we can build an API that supports scheduling even for push_bulk, like say, named def perform_bulk_in(args, interval, batch_size: 1_000)
....
end
If I have your interest, I'd be happy to build this 🙂 |
@manojmj92 Great call-out. This was a purposeful omission in the original PR, but happy to explore alternatives. The comment for this method hints at using Generally at my place of work, we've almost entirely moved away from setting |
I'm happy with it as is; I don't think we need to expose every option at every level. |
@@ -191,6 +191,12 @@ def perform_async(*args) | |||
@klass.client_push(@opts.merge("args" => args, "class" => @klass)) | |||
end | |||
|
|||
def perform_bulk(args, batch_size: 1_000) | |||
args.each_slice(batch_size).flat_map do |slice| | |||
Sidekiq::Client.push_bulk(@opts.merge("class" => @klass, "args" => slice)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mperham this is a bit different than the example on https://github.com/mperham/sidekiq/wiki/Batches#huge-batches in that the jobs get sequentially added by push_bulk rather than concurrently added via perform_async. On super large sets of jobs, like inserting 300,000 at once in 500 job push_bulks with about 400 worker threads, Redis times out sometimes. Maybe the recommendation for huge batches should be changed?
I suspect that it's just as good performance-wise to sequentially post the jobs by bulk by a single thread.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it’s not a great idea to do lots of concurrent bulk pushes. I think the best thing you can do is lower the bulk size from 1000 to 100 so each bulk op is faster but YMMV.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe better to do the bulk inserts sequentially rather than concurrently? Especially if running hundreds of threads.
@kellysutton Personally, I would be interested in hearing why you avoid setting |
Sure thing. Specifically, So in our first pass of implementing |
This is the code implements #5041.
We've got a monkey-patch into
Sidekiq::Worker::ClassMethods
andSidekiq::Worker::Setter
that we have been getting a lot of mileage out of that might be worth upstreaming.Background: We do a lot of batch processing, needing to enqueue 1000's or 100's of K's of jobs at once. We were using
Sidekiq::Client.push_bulk
with aneach_slice
to create chunks of 1000.Enter what we call
.perform_bulk
. It's an implementation ofSidekiq::Client.push_bulk
that encodes the best practice of 1000 jobs at a time. This allows clients to enqueue many jobs without the sharp edge of forgetting to slice their batch up into smaller chunks.Downside here is that the
.push_bulk
-based approach does communicate to the client, "Hey, you're interacting with the network N times (once per loop)."This PR has the implementation and tests for the
.perform_bulk
method. Please let me know if anything looks amiss, but I tried to keep things as idiomatic as possible.Notes: