perf: speed up clean operation #2326

emcsween · 2022-03-18T20:14:07Z

Hi there,

Thanks for this very useful queue library. We've been using it successfully for a while and it's been very robust. However, we recently managed to freeze our Redis server when we tried to clean around 1 million jobs from a wait queue using the clean operation. This PR is a suggestion for improving the speed of the clean operation so that it takes seconds instead of hours on queues with millions of jobs.

The clean operation on sets backed by lists (wait, active, paused) quickly gets very slow when the list is large. This is because each job deletion scans the whole list in a LREM call, resulting in O(N * M) complexity where N is the number of jobs in the list and M is the number of jobs to delete.

With this change, the deletion is done in two passes. The first pass sets each item that should be deleted to a special value. The second pass deletes all items with that special value in a single LREM call. This results in O(N) complexity.

Benchmarks

I ran some (not super accurate) benchmarks on my laptop to show the effect when cleaning either all jobs or 1000 jobs in queues of different sizes:

Queue size	Time to clean 1000 jobs - before	after	Time to clean all jobs - before	after
1K	27 ms	10 ms	27 ms	16 ms
10K	331 ms	11 ms	1.7 s	100 ms
100K	3.7 s	14 ms	3 minutes	900 ms
1M	42.1 s	50 ms	didn't measure - would take hours	11.7 s

My benchmark script, for reference:

import Queue from 'bull'

const queue = new Queue('test')

await queue.empty()

for (const msgCount of [1000, 10000, 100000, 1000000]) {
  console.log(`Queue size: ${msgCount}`)

  const jobs = new Array(msgCount).fill({ some: 'data' })
  console.time('add')
  await queue.addBulk(jobs)
  console.timeEnd('add')

  console.time('clean')
  await queue.clean(0, 'wait')
  console.timeEnd('clean')

  console.log()
}

queue.close()

Alternative implementation

Figuring out the index for the LSET that sets the deletion marker is a bit tricky because we traverse the list in batches from the end. This reverse traversal was introduced in #2205 to speed up the clean operation when a limit is given. The script could be slightly simplified if we traversed the list in batches from the beginning. I'd be happy to have a go at it if that makes more sense.

manast

Thanks, this was a very clever optimization. I wrote a small comment.

manast · 2022-03-19T02:27:32Z

lib/commands/cleanJobsInSet-3.lua

@@ -63,7 +65,14 @@ while ((limit <= 0 or deletedCount < limit) and next(jobIds, nil) ~= nil) do
      jobTS = rcall("HGET", jobKey, "timestamp")
      if (not jobTS or jobTS < maxTimestamp) then
        if isList then
-          rcall("LREM", setKey, 0, jobId)
+          if deletionMarker == nil then


ok, so the risk is to pick a deletion marker that could be a jobId that we are not going to delete, and that is why you pick the first jobId. This is a bit unfortunate, if we could have a short, special value we would 1) skip this "if test" for every item and 2) a potential very large jobId that consumes more time and memory to store and delete. Did you try using as a marker the lua value nil?

Alternatively, the value 0 could be used, but then we need to give an error if the user tries to use a custom Id of 0. I am quite sure nobody does this but it should be in place to avoid possible errors.

That's a good point about possible large job ids. Since Redis list values are strings, a natural empty value would be the empty string. Luckily, it looks like it's already impossible to add a job with an empty string. Testing with:

queue.add({ some: 'data' }, { jobId: '' })

silently ignores the jobId and generates a numeric id. This is a consequence of the way the addJob Lua script uses the empty string as a signal that no jobId was passed.

I'll make that change right away.

emcsween · 2022-03-19T17:03:47Z

Thanks for the review, @manast! I changed the deletion marker to the empty string and reran the tests.

The clean operation on sets backed by lists (wait, active, paused) quickly gets very slow when the list is large. This is because each job deletion scans the whole list in a LREM call, resulting in O(N * M) complexity where N is the number of jobs in the list and M is the number of jobs to delete. With this change, the deletion is done in two passes. The first pass sets each item that should be deleted to a special value. The second pass deletes all items with that special value in a single LREM call. This results in O(N) complexity.

emcsween · 2022-03-20T12:48:06Z

I added a delay to the test I added in the hope that it will help it pass in CI.

## [4.8.1](v4.8.0...v4.8.1) (2022-03-21) ### Performance Improvements * speed up clean operation ([#2326](#2326)) ([ef5f471](ef5f471))

manast · 2022-03-21T01:23:40Z

🎉 This PR is included in version 4.8.1 🎉

The release is available on:

Your semantic-release bot 📦🚀

manast reviewed Mar 19, 2022

View reviewed changes

emcsween force-pushed the clean-speedup branch from 85d803b to f86d6b6 Compare March 19, 2022 17:00

emcsween requested a review from manast March 19, 2022 17:03

manast approved these changes Mar 20, 2022

View reviewed changes

emcsween force-pushed the clean-speedup branch from f86d6b6 to 8be6d70 Compare March 20, 2022 12:43

manast merged commit ef5f471 into OptimalBits:develop Mar 21, 2022

github-actions bot pushed a commit that referenced this pull request Mar 21, 2022

chore(release): 4.8.1 [skip ci]

61a87a8

## [4.8.1](v4.8.0...v4.8.1) (2022-03-21) ### Performance Improvements * speed up clean operation ([#2326](#2326)) ([ef5f471](ef5f471))

manast added the released label Mar 21, 2022

emcsween deleted the clean-speedup branch March 21, 2022 11:27

emcsween mentioned this pull request Mar 21, 2022

Upgrade bull dependency darky/bull-repl#44

Closed

manast mentioned this pull request May 18, 2022

(DOCS) Update REFERENCE.md to note a performance footgun #2357

Closed

hsource mentioned this pull request May 29, 2022

perf(clean): use ZRANGEBYSCORE to improve performance #2363

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: speed up clean operation #2326

perf: speed up clean operation #2326

emcsween commented Mar 18, 2022 •

edited

manast left a comment

manast Mar 19, 2022

manast Mar 19, 2022

emcsween Mar 19, 2022

emcsween commented Mar 19, 2022

emcsween commented Mar 20, 2022

manast commented Mar 21, 2022

perf: speed up clean operation #2326

perf: speed up clean operation #2326

Conversation

emcsween commented Mar 18, 2022 • edited

Benchmarks

Alternative implementation

manast left a comment

Choose a reason for hiding this comment

manast Mar 19, 2022

Choose a reason for hiding this comment

manast Mar 19, 2022

Choose a reason for hiding this comment

emcsween Mar 19, 2022

Choose a reason for hiding this comment

emcsween commented Mar 19, 2022

emcsween commented Mar 20, 2022

manast commented Mar 21, 2022

emcsween commented Mar 18, 2022 •

edited