perf(clean): use ZRANGEBYSCORE to improve performance #2363

hsource · 2022-05-27T09:13:07Z

Motivation

When running clean on large queues, it can take a long time when tons of entries are within the grace period. The cleanJobsInSet Lua script never reaches the limit and endlessly iterates through items. Since redis scripts are atomic, this can lock up the rest of the server.

Fix

For sorted sets, the timestamp is just the score. As such, if there's a limit, we can just immediately get all the relevant ones using ZRANGEBYSCORE in one iteration that also filters out the timestamps within the grace period.

Testing

Added automated tests

NODE_ENV=test yarn mocha -- 'test/test_queue*' -g "clean completed"

manast · 2022-05-28T07:19:51Z

Thanks for the PR.
I am a bit confused about this PR since we already have a maxTimestamp and limit that should allow you to execute the script iteratively, furthermore the script itself is pretty fast now, taking just a couple of seconds for 1M jobs. So please can you elaborate a bit more why the current solution does not work?

hsource · 2022-05-29T06:56:19Z

@manast I think the PR #2326 only sped up the cases where the cleared job queue was a list. We're clearing the "completed" queue, which isn't a list, so that optimization doesn't apply.

limit does work but runs into an issue when every job is after maxTimestamp. Let's say I have a table with 1 million or 10 million jobs, all of which are within the grace period. It would need to iterate through all of those jobs before realizing "I can't delete any jobs" and exiting. This can take tens of seconds or minutes, during which services that depend on Redis would be unresponsive.

This change allows those scripts to exit more quickly by taking advantage of the fact that the jobs are ordered by creation time. If we encounter one job that's after maxTimestamp, pretty much all other jobs should be after maxTimestamp. We can then stop immediately instead of iterating through all the other jobs that we can't delete anyways.

Let me know if you want me to add more comments in the code! Happy to do that too.

manast · 2022-05-29T07:54:37Z

When you use limit in the sets case this is the method used:

  result = cleanSet(KEYS[1], ARGV[1], rangeStart, rangeEnd, ARGV[2], {"processedOn", "timestamp"})

So the range will be limited to the "limit" parameter you send to the command, there should be no issue as long as this limit is small enough, maybe between the order of magnitudes 10k to 100k. If this is not the case then, could you provide some code that shows the contrary it please?

hsource · 2022-05-29T09:26:50Z

Got it. Here's the specific example that I dealt with. I debugged this by manually running the cleanJobsInSet script with limit set to 100, and the detailed steps are in the original PR.

redis-cli --ldb --eval node_modules/bull/lib/commands/cleanJobsInSet-3.lua bull:sendEmail:completed bull:sendEmail:priority bull:sendEmail:limiter , bull:sendEmail: 1653036781370 true 100 completed

To test it without the fix, just change the "true" to "false" in that command. This is equivalent to the Javascript code:

queue.clean(7 * 24 * 60 * 60 * 1000, 'completed', 100);

With even this very small limit, here's what happens when all of the jobs are after maxTimestamp. Line numbers refer to the original cleanJobsInSet-3.lua file

The script fetches the first batch of 100 items (line 47)
deletedCount is 0, and limit is 100, so it goes into the loop (line 55)
All of the items are after maxTimestamp, so at the end of the while loop, the deletedCount is still 0
It fetches the next set of items (line 114)
The script then repeats step 1-4 20k times, reading all of the 2 million jobs that I had stored

I verified this is the behavior by adding redis.log statements for each iteration that logged the deletedCount

This behavior does happen even with a larger limit and slow down the system, although it's somewhat faster since the batches are larger. I would consider removing the new parameter and just making stopAfterMaxTimestamp just be true by default since I think it's a better default, but I introduced the new stopWithinGrace parameter to make the change fully backwards-compatible.

manast · 2022-05-30T01:48:21Z

Ok, so to sumarize, the problem is when you want to delete completed/failed jobs having a grace period that happens to have a huge amount of jobs in that grace period.
I think this can be resolved by using Redis better than we are using it now. The ZRANGE command supports "BYSCORE" and also a LIMIT-OFFSET arguments. So basically what we can do instead is:

calculate a min-max range based on the grace period (i.e. from -infinity to given timestamp).
limit the amount of jobs using the LIMIT argument.
Then it would only return the jobs that should be delete, so we can also remove the check in this line: https://github.com/taskforcesh/bullmq/blob/master/src/commands/cleanJobsInSet-2.lua#L98

manast · 2022-05-30T01:49:30Z

Here the ZRANGE reference:
https://redis.io/commands/zrange/#common-behavior-and-options
https://redis.io/commands/zrange/#score-ranges

hsource · 2022-05-30T17:15:14Z

Got it. @manast - are you sure you'd like to use ZRANGE (or ZRANGEBYSCORE, for backwards compatibility)? I can push a new PR. It's slightly non-backwards compatible, as the score is the creation time, and the current limiting is done by any timestamp (finishedOn and processedOn take precedence).

Also, just to be clear: this PR is for the old bull project that we're currently using, and not the new bullmq project. We may transition to it at some point, but that's not currently in our roadmap.

manast · 2022-05-31T02:08:26Z

The score for completed/failed jobs should be the same as finishedOn. For delayed jobs it represents the time left to be delayed, I do not think it makes any big difference in which order the delayed jobs are deleted.

hsource · 2022-06-06T07:37:42Z

@manast I changed this to use the suggested solution. Take a look!

hsource · 2022-07-25T08:45:36Z

@manast - if you have some time, can you take a look and see if this is ready to merge? We'd love to switch off our fork and be able to just install the normal package again. I implemented all your recommended changes and added much better tests.

manast

LGTM

## [4.8.5](v4.8.4...v4.8.5) (2022-07-27) ### Performance Improvements * **clean:** use ZRANGEBYSCORE to improve performance ([#2363](#2363)) ([3331188](3331188))

manast · 2022-07-27T15:24:47Z

🎉 This PR is included in version 4.8.5 🎉

The release is available on:

Your semantic-release bot 📦🚀

hsource force-pushed the harry-stop branch from 3e64dfc to 5906493 Compare June 6, 2022 07:35

hsource changed the title ~~perf(clean): add stopWithinGrace to improve performance~~ perf(clean): use ZRANGEBYSCORE to improve performance Jun 6, 2022

perf(clean): use ZRANGEBYSCORE to improve performance

2809306

hsource force-pushed the harry-stop branch from 5906493 to 2809306 Compare June 6, 2022 07:40

manast reviewed Jul 27, 2022

View reviewed changes

manast merged commit 3331188 into OptimalBits:develop Jul 27, 2022

github-actions bot pushed a commit that referenced this pull request Jul 27, 2022

chore(release): 4.8.5 [skip ci]

089ee99

## [4.8.5](v4.8.4...v4.8.5) (2022-07-27) ### Performance Improvements * **clean:** use ZRANGEBYSCORE to improve performance ([#2363](#2363)) ([3331188](3331188))

manast added the released label Jul 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(clean): use ZRANGEBYSCORE to improve performance #2363

perf(clean): use ZRANGEBYSCORE to improve performance #2363

hsource commented May 27, 2022 •

edited

manast commented May 28, 2022

hsource commented May 29, 2022 •

edited

manast commented May 29, 2022 •

edited

hsource commented May 29, 2022 •

edited

manast commented May 30, 2022 •

edited

manast commented May 30, 2022

hsource commented May 30, 2022 •

edited

manast commented May 31, 2022

hsource commented Jun 6, 2022

hsource commented Jul 25, 2022

manast left a comment

manast commented Jul 27, 2022

perf(clean): use ZRANGEBYSCORE to improve performance #2363

perf(clean): use ZRANGEBYSCORE to improve performance #2363

Conversation

hsource commented May 27, 2022 • edited

Motivation

Fix

Testing

manast commented May 28, 2022

hsource commented May 29, 2022 • edited

manast commented May 29, 2022 • edited

hsource commented May 29, 2022 • edited

manast commented May 30, 2022 • edited

manast commented May 30, 2022

hsource commented May 30, 2022 • edited

manast commented May 31, 2022

hsource commented Jun 6, 2022

hsource commented Jul 25, 2022

manast left a comment

Choose a reason for hiding this comment

manast commented Jul 27, 2022

hsource commented May 27, 2022 •

edited

hsource commented May 29, 2022 •

edited

manast commented May 29, 2022 •

edited

hsource commented May 29, 2022 •

edited

manast commented May 30, 2022 •

edited

hsource commented May 30, 2022 •

edited