Implement `Sidekiq::Worker.perform_bulk` #5042

kellysutton · 2021-11-02T03:22:18Z

This is the code implements #5041.

We've got a monkey-patch into Sidekiq::Worker::ClassMethods and Sidekiq::Worker::Setter that we have been getting a lot of mileage out of that might be worth upstreaming.

Background: We do a lot of batch processing, needing to enqueue 1000's or 100's of K's of jobs at once. We were using Sidekiq::Client.push_bulk with an each_slice to create chunks of 1000.

Enter what we call .perform_bulk. It's an implementation of Sidekiq::Client.push_bulk that encodes the best practice of 1000 jobs at a time. This allows clients to enqueue many jobs without the sharp edge of forgetting to slice their batch up into smaller chunks.

Downside here is that the .push_bulk-based approach does communicate to the client, "Hey, you're interacting with the network N times (once per loop)."

This PR has the implementation and tests for the .perform_bulk method. Please let me know if anything looks amiss, but I tried to keep things as idiomatic as possible.

Notes:

Should I update the CHANGELOG or is that for the repo maintainers to update?
I took a stab at a comment, but wasn't sure if it was too verbose. Feel free to slim down.

Co-authored-by: jeffcarbs <jeff.carbonella@gmail.com>

…ulk behavior Co-authored-by: jeffcarbs <jeff.carbonella@gmail.com>

Co-authored-by: jeffcarbs <jeff.carbonella@gmail.com>

mperham · 2021-11-02T03:29:50Z

This looks great. Feel free to add an item to the changelog.

kellysutton · 2021-11-02T16:39:43Z

@mperham Alright, I've gone ahead and added an entry to Changes.md. Let me know if there are other things need to be done!

manojmj92 · 2021-11-02T19:43:35Z

@mperham @kellysutton - Thank you for this!

I have ideas on a possible improvement.

We currently have the ability to schedule bulk jobs for some time in the future, so like

Sidekiq::Client.push_bulk("class" => FooJob, "args" => [[1], [2]], "at" => [1.minute.from_now.to_f, 5.minutes.from_now.to_f])

(which was added with #4243)

or

Sidekiq::Client.push_bulk("class" => FooJob, "args" => [[1], [2]], "at" => 1.minute.from_now.to_f)

but with the new implementation of perform_bulk we lose the ability to specify at (which could be a Numeric or an array of Numeric representing timestamp)

(Note: When I say that we lose this ability, this is still possible via the set interface, so like SomeWorker.set(wait_until: 1.minute.from_now.to_f).perform_bulk(...), but even here, wait_until can't be an array of Numeric)

I wonder if we can build an API that supports scheduling even for push_bulk, like say, named perform_bulk_in, which would accept an interval too, like

def perform_bulk_in(args, interval, batch_size: 1_000)
....
end

interval here can be an array of Numeric or a single Numeric. If it is an array of Numeric, we can slice it the same way as batch_size.

If I have your interest, I'd be happy to build this 🙂

kellysutton · 2021-11-02T19:57:53Z

@manojmj92 Great call-out. This was a purposeful omission in the original PR, but happy to explore alternatives. The comment for this method hints at using Sidekiq::Client.push_bulk if you need to configure things like that. I'll defer to @mperham to decide what to do here.

Generally at my place of work, we've almost entirely moved away from setting 'at' on jobs, especially when enqueueing them in bulk. If you're interested to hear more, I can try to whip up a blog post explaining why we've stopped doing that.

mperham · 2021-11-02T19:59:29Z

I'm happy with it as is; I don't think we need to expose every option at every level. Sidekiq::Client.push_bulk is still available if you need to bulk schedule.

justin808 · 2022-02-21T21:56:21Z

lib/sidekiq/worker.rb

@@ -191,6 +191,12 @@ def perform_async(*args)
        @klass.client_push(@opts.merge("args" => args, "class" => @klass))
      end

+      def perform_bulk(args, batch_size: 1_000)
+        args.each_slice(batch_size).flat_map do |slice|
+          Sidekiq::Client.push_bulk(@opts.merge("class" => @klass, "args" => slice))


@mperham this is a bit different than the example on https://github.com/mperham/sidekiq/wiki/Batches#huge-batches in that the jobs get sequentially added by push_bulk rather than concurrently added via perform_async. On super large sets of jobs, like inserting 300,000 at once in 500 job push_bulks with about 400 worker threads, Redis times out sometimes. Maybe the recommendation for huge batches should be changed?

I suspect that it's just as good performance-wise to sequentially post the jobs by bulk by a single thread.

I don't understand.

and got many errors:

Yeah, it’s not a great idea to do lots of concurrent bulk pushes. I think the best thing you can do is lower the bulk size from 1000 to 100 so each bulk op is faster but YMMV.

Maybe better to do the bulk inserts sequentially rather than concurrently? Especially if running hundreds of threads.

ulyssesrex · 2022-07-15T13:38:18Z

Generally at my place of work, we've almost entirely moved away from setting 'at' on jobs, especially when enqueueing them in bulk. If you're interested to hear more, I can try to whip up a blog post explaining why we've stopped doing that.

@kellysutton Personally, I would be interested in hearing why you avoid setting 'at' on jobs. We're experimenting with a throttling solution at the moment that explicitly sets that argument.

kellysutton · 2022-07-18T21:56:23Z

Sure thing. at is a perfectly fine concept for most usages of Sidekiq, but it's got a sharp edge that doesn't fit our expectations of the library.

Specifically, at only specifies when something should be loaded onto a queue and not when something should execute. We see this problem happen more often when enqueueing jobs in bulk, where the difference between expected vs. actual execution time differs.

So in our first pass of implementing .perform_bulk, we didn't include that. The at functionality is still present in the lower-level Sidekiq::Client.push_bulk if needed!

kellysutton and others added 6 commits November 1, 2021 20:20

Sketch out some failing tests to capture the behavior

efa01c5

Co-authored-by: jeffcarbs <jeff.carbonella@gmail.com>

Implement Sidekiq::Client.perform_bulk

3fc18a0

Co-authored-by: jeffcarbs <jeff.carbonella@gmail.com>

Allow .perform_bulk to operate on different batch sizes

c8bb41d

Co-authored-by: jeffcarbs <jeff.carbonella@gmail.com>

Write a failing test to capture the Sidekiq::Worker::Setter.perform_b…

8d8abb7

…ulk behavior Co-authored-by: jeffcarbs <jeff.carbonella@gmail.com>

Implement Sidekiq::Worker::Setter.perform_bulk

ac8cf25

Co-authored-by: jeffcarbs <jeff.carbonella@gmail.com>

Write a small comment for to document the method

94609de

Co-authored-by: jeffcarbs <jeff.carbonella@gmail.com>

Add a Changes.md entry

9f45c39

mperham merged commit 4a04326 into sidekiq:main Nov 2, 2021

kellysutton deleted the perform-bulk branch November 3, 2021 16:24

justin808 reviewed Feb 21, 2022

View reviewed changes

aprescott mentioned this pull request Oct 9, 2023

Add push_bulk_in & push_bulk_in! aprescott/sidekiq-bulk#8

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement `Sidekiq::Worker.perform_bulk` #5042

Implement `Sidekiq::Worker.perform_bulk` #5042

kellysutton commented Nov 2, 2021 •

edited

mperham commented Nov 2, 2021

kellysutton commented Nov 2, 2021

manojmj92 commented Nov 2, 2021 •

edited

kellysutton commented Nov 2, 2021 •

edited

mperham commented Nov 2, 2021

justin808 Feb 21, 2022

mperham Feb 21, 2022

justin808 Feb 21, 2022

mperham Feb 21, 2022

justin808 Feb 22, 2022

ulyssesrex commented Jul 15, 2022

kellysutton commented Jul 18, 2022

Implement Sidekiq::Worker.perform_bulk #5042

Implement Sidekiq::Worker.perform_bulk #5042

Conversation

kellysutton commented Nov 2, 2021 • edited

mperham commented Nov 2, 2021

kellysutton commented Nov 2, 2021

manojmj92 commented Nov 2, 2021 • edited

kellysutton commented Nov 2, 2021 • edited

mperham commented Nov 2, 2021

justin808 Feb 21, 2022

Choose a reason for hiding this comment

mperham Feb 21, 2022

Choose a reason for hiding this comment

justin808 Feb 21, 2022

Choose a reason for hiding this comment

mperham Feb 21, 2022

Choose a reason for hiding this comment

justin808 Feb 22, 2022

Choose a reason for hiding this comment

ulyssesrex commented Jul 15, 2022

kellysutton commented Jul 18, 2022

Implement `Sidekiq::Worker.perform_bulk` #5042

Implement `Sidekiq::Worker.perform_bulk` #5042

kellysutton commented Nov 2, 2021 •

edited

manojmj92 commented Nov 2, 2021 •

edited

kellysutton commented Nov 2, 2021 •

edited