Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf(clean): use ZRANGEBYSCORE to improve performance #2363

Merged
merged 1 commit into from Jul 27, 2022

Conversation

hsource
Copy link
Contributor

@hsource hsource commented May 27, 2022

Motivation

When running clean on large queues, it can take a long time when tons of entries are within the grace period. The cleanJobsInSet Lua script never reaches the limit and endlessly iterates through items. Since redis scripts are atomic, this can lock up the rest of the server.

Fix

For sorted sets, the timestamp is just the score. As such, if there's a limit, we can just immediately get all the relevant ones using ZRANGEBYSCORE in one iteration that also filters out the timestamps within the grace period.

Testing

Added automated tests

NODE_ENV=test yarn mocha -- 'test/test_queue*' -g "clean completed"

@manast
Copy link
Member

manast commented May 28, 2022

Thanks for the PR.
I am a bit confused about this PR since we already have a maxTimestamp and limit that should allow you to execute the script iteratively, furthermore the script itself is pretty fast now, taking just a couple of seconds for 1M jobs. So please can you elaborate a bit more why the current solution does not work?

@hsource
Copy link
Contributor Author

hsource commented May 29, 2022

@manast I think the PR #2326 only sped up the cases where the cleared job queue was a list. We're clearing the "completed" queue, which isn't a list, so that optimization doesn't apply.

limit does work but runs into an issue when every job is after maxTimestamp. Let's say I have a table with 1 million or 10 million jobs, all of which are within the grace period. It would need to iterate through all of those jobs before realizing "I can't delete any jobs" and exiting. This can take tens of seconds or minutes, during which services that depend on Redis would be unresponsive.

This change allows those scripts to exit more quickly by taking advantage of the fact that the jobs are ordered by creation time. If we encounter one job that's after maxTimestamp, pretty much all other jobs should be after maxTimestamp. We can then stop immediately instead of iterating through all the other jobs that we can't delete anyways.

Let me know if you want me to add more comments in the code! Happy to do that too.

@manast
Copy link
Member

manast commented May 29, 2022

When you use limit in the sets case this is the method used:

  result = cleanSet(KEYS[1], ARGV[1], rangeStart, rangeEnd, ARGV[2], {"processedOn", "timestamp"})

So the range will be limited to the "limit" parameter you send to the command, there should be no issue as long as this limit is small enough, maybe between the order of magnitudes 10k to 100k. If this is not the case then, could you provide some code that shows the contrary it please?

@hsource
Copy link
Contributor Author

hsource commented May 29, 2022

Got it. Here's the specific example that I dealt with. I debugged this by manually running the cleanJobsInSet script with limit set to 100, and the detailed steps are in the original PR.

redis-cli --ldb --eval node_modules/bull/lib/commands/cleanJobsInSet-3.lua bull:sendEmail:completed bull:sendEmail:priority bull:sendEmail:limiter , bull:sendEmail: 1653036781370 true 100 completed

To test it without the fix, just change the "true" to "false" in that command. This is equivalent to the Javascript code:

queue.clean(7 * 24 * 60 * 60 * 1000, 'completed', 100);

With even this very small limit, here's what happens when all of the jobs are after maxTimestamp. Line numbers refer to the original cleanJobsInSet-3.lua file

  1. The script fetches the first batch of 100 items (line 47)
  2. deletedCount is 0, and limit is 100, so it goes into the loop (line 55)
  3. All of the items are after maxTimestamp, so at the end of the while loop, the deletedCount is still 0
  4. It fetches the next set of items (line 114)
  5. The script then repeats step 1-4 20k times, reading all of the 2 million jobs that I had stored

I verified this is the behavior by adding redis.log statements for each iteration that logged the deletedCount

This behavior does happen even with a larger limit and slow down the system, although it's somewhat faster since the batches are larger. I would consider removing the new parameter and just making stopAfterMaxTimestamp just be true by default since I think it's a better default, but I introduced the new stopWithinGrace parameter to make the change fully backwards-compatible.

@manast
Copy link
Member

manast commented May 30, 2022

Ok, so to sumarize, the problem is when you want to delete completed/failed jobs having a grace period that happens to have a huge amount of jobs in that grace period.
I think this can be resolved by using Redis better than we are using it now. The ZRANGE command supports "BYSCORE" and also a LIMIT-OFFSET arguments. So basically what we can do instead is:

  1. calculate a min-max range based on the grace period (i.e. from -infinity to given timestamp).
  2. limit the amount of jobs using the LIMIT argument.
    Then it would only return the jobs that should be delete, so we can also remove the check in this line: https://github.com/taskforcesh/bullmq/blob/master/src/commands/cleanJobsInSet-2.lua#L98

@manast
Copy link
Member

manast commented May 30, 2022

@hsource
Copy link
Contributor Author

hsource commented May 30, 2022

Got it. @manast - are you sure you'd like to use ZRANGE (or ZRANGEBYSCORE, for backwards compatibility)? I can push a new PR. It's slightly non-backwards compatible, as the score is the creation time, and the current limiting is done by any timestamp (finishedOn and processedOn take precedence).

Also, just to be clear: this PR is for the old bull project that we're currently using, and not the new bullmq project. We may transition to it at some point, but that's not currently in our roadmap.

@manast
Copy link
Member

manast commented May 31, 2022

The score for completed/failed jobs should be the same as finishedOn. For delayed jobs it represents the time left to be delayed, I do not think it makes any big difference in which order the delayed jobs are deleted.

@hsource hsource changed the title perf(clean): add stopWithinGrace to improve performance perf(clean): use ZRANGEBYSCORE to improve performance Jun 6, 2022
@hsource
Copy link
Contributor Author

hsource commented Jun 6, 2022

@manast I changed this to use the suggested solution. Take a look!

@hsource
Copy link
Contributor Author

hsource commented Jul 25, 2022

@manast - if you have some time, can you take a look and see if this is ready to merge? We'd love to switch off our fork and be able to just install the normal package again. I implemented all your recommended changes and added much better tests.

Copy link
Member

@manast manast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@manast manast merged commit 3331188 into OptimalBits:develop Jul 27, 2022
github-actions bot pushed a commit that referenced this pull request Jul 27, 2022
## [4.8.5](v4.8.4...v4.8.5) (2022-07-27)

### Performance Improvements

* **clean:** use ZRANGEBYSCORE to improve performance ([#2363](#2363)) ([3331188](3331188))
@manast
Copy link
Member

manast commented Jul 27, 2022

🎉 This PR is included in version 4.8.5 🎉

The release is available on:

Your semantic-release bot 📦🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants