updateDelaySet may cause very high redis cpu usage #1893

amit-gshe · 2020-10-28T08:47:47Z

We have a very large queue which has about several millions of delay jobs. And our queue has tens of workers(pods) in k8s cluster and even double or triple more pods when the business peak time. We have encountered an issue that bull can lead high redis cpu usage and the redis gets slow then stop prosessing jobs. It's a pain last for several months. Finally we find that the updateDelaySet get triggered so frequently on each worker and thus caused the high redis cpu usage and then consumer get so slow even stop processing waiting jobs and the waiting queue get stacked and finally the entire bull queue stop working.

I found that the updateDelaySet was expensive when the queue was huge so we should stop doing updateDelaySet when the waiting queue has too many jobs to process. We can add an option say maxWaitingJobs which default set Number.MAX. The user can set this value according their needs.

The text was updated successfully, but these errors were encountered:

manast · 2020-10-28T10:54:41Z

Before I dig more into this issue, could you tell me more about your queue setup? for instance why do you have so many delayed jobs, and what is the average delay for these jobs? are you using rate limiting?

manast · 2020-10-28T10:56:47Z

also, are you using priorities or just normal jobs?

amit-gshe · 2020-10-28T10:59:36Z

My queue config: new Queue('FOOD_DIGEST_JOB', { redis: { host, port }, limiter: { max: 1000, duration: 1000 } }). We have millions of users and they can feed fish and every fish has a delay job in our setting. we use normal jobs. Each of our jobs can delay tens of seconds to days.

manast · 2020-10-28T11:08:19Z

ok, so you have both delayed jobs and rate limited to 1000 jobs per second right?

amit-gshe · 2020-10-28T11:09:15Z

yes exactly

manast · 2020-10-28T11:09:31Z

and where are you running your redis instances?

amit-gshe · 2020-10-28T11:10:27Z

we run our redis instance on aws elastic cache redis.

manast · 2020-10-28T11:12:11Z

did you consider upgrading the instance type to a faster one? I know in aws is wasteful since larger instances also has more cores but redis will only use one.

manast · 2020-10-28T11:14:47Z

I assume you are using last bull version 3.18.1 right?

amit-gshe · 2020-10-28T11:16:09Z

yes we tried upgrading our instance but nothing difference. The queue can run a period of time but this issue will occur occasionally. we use 3.18.

manast · 2020-10-28T11:17:49Z

yes. Moving the delayed jobs back to wait is not very efficient, specially in this case when your queues are always full of jobs. I think I can implement a solution for this, but it may take a couple of days before it is ready (if everything goes well).

manast · 2020-10-30T16:24:50Z

I implemented some improvements that should reduce the CPU usage for a case like this, I wonder if you have any chance to verify it before we make a new release? #1897

amit-gshe · 2020-10-31T04:07:05Z

I tried #1897 with option new Queue('FOOD_DIGEST_JOB', { redis: { host, port }, limiter: { max: 1000, duration: 1000 } }) this morning and I found the CPU usage is lower but the consumimg rate get slower at the same time which result in the delayed jobs get stacked in delay queue and not consumed in time. The rate limiter setting is 1000/s, but the consume rate was just 140+/s. The adding rate was 300+/s (There are many jobs adding and consuming at the same time). BTW I found the redis slow log is clean now. The redis slow log increase quickly in the before version.

manast · 2020-10-31T08:31:41Z

ok, I will try to mimic those numbers in a synthetic test.

manast · 2020-10-31T15:54:36Z

I added a small fix to the above mentioned PR, would you mind to test if it works faster?

amit-gshe · 2020-10-31T16:30:59Z

Thx for your working. I tested again and the consuming speed was very fast. And the rate is close to the rate limiter. Jobs moves to active and get consumed quickly but the redis cpu is much lower than before.

amit-gshe · 2020-10-31T16:42:53Z

I found the Queue.prototype.updateDelayTimer get called very frequently and is it normal? I also inspect the redis slow log found logs as below:

127.0.0.1:6371> slowlog get 10
1) Slow log id: 667021
   Start at: 1604160921
   Running time(ms): 10435
   Command: evalsha 451d4221d6a7ff2251cd0864ab0a28bdba5ea099 10 bull:FOOD_DIGEST_JOB:active bull:FOOD_DIGEST_JOB:wait bull:FOOD_DIGEST_JOB:delayed bull:FOOD_DIGEST_JOB:paused bull:FOOD_DIGEST_JOB:completed bull:FOOD_DIGEST_JOB:failed bull:FOOD_DIGEST_JOB:priority bull:FOOD_DIGEST_JOB:5f9d708c0397ad2ac118dcc4 bull:FOOD_DIGEST_JOB:5f9d708c0397ad2ac118dcc4:logs bull:FOOD_DIGEST_JOB:limiter 5f9d708c0397ad2ac118dcc4 aa11c239-d230-4300-9722-283e3851da24
   Client IP and port: 10.5.229.39:59240
   Client name:
2) Slow log id: 667020
   Start at: 1604160921
   Running time(ms): 10390
   Command: LREM bull:FOOD_DIGEST_JOB:wait 0 5f9d708c0397ad2ac118dcc4
   Client IP and port: ?:0
   Client name:
3) Slow log id: 667019
   Start at: 1604160919
   Running time(ms): 10717
   Command: evalsha 451d4221d6a7ff2251cd0864ab0a28bdba5ea099 10 bull:FOOD_DIGEST_JOB:active bull:FOOD_DIGEST_JOB:wait bull:FOOD_DIGEST_JOB:delayed bull:FOOD_DIGEST_JOB:paused bull:FOOD_DIGEST_JOB:completed bull:FOOD_DIGEST_JOB:failed bull:FOOD_DIGEST_JOB:priority bull:FOOD_DIGEST_JOB:5f9d5471569a9534485395e3 bull:FOOD_DIGEST_JOB:5f9d5471569a9534485395e3:logs bull:FOOD_DIGEST_JOB:limiter 5f9d5471569a9534485395e3 9f802678-ca99-4050-a253-78338e6dc3be
   Client IP and port: 10.5.228.117:59884
   Client name:
4) Slow log id: 667018
   Start at: 1604160919
   Running time(ms): 10664
   Command: LREM bull:FOOD_DIGEST_JOB:wait 0 5f9d5471569a9534485395e3
   Client IP and port: ?:0
   Client name:

However the slow log length was just 4 and didn't increase any more.

manast · 2020-10-31T17:33:19Z

yes, I think it is normal, my fix basically disables the updateDelayTimer as long as there are jobs in the queue to be processed, but sometimes when it gets rate limited it will kick again. It is very difficult to make it perfect just using LUA scripts. I am working on a redis module that solves all this problems perfectly but it will not work on AWS elasticache since it does not support modules.

manast · 2020-11-24T09:44:09Z

Closing this since there are no more actions that I can take.

amit-gshe added a commit to amit-gshe/bull that referenced this issue Oct 28, 2020

fix: OptimalBits#1893

8e99209

manast added a commit that referenced this issue Oct 30, 2020

fix: faster delay handling. fixes #1893

4f5e388

manast closed this as completed Nov 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

updateDelaySet may cause very high redis cpu usage #1893

updateDelaySet may cause very high redis cpu usage #1893

amit-gshe commented Oct 28, 2020

manast commented Oct 28, 2020

manast commented Oct 28, 2020

amit-gshe commented Oct 28, 2020 •

edited

manast commented Oct 28, 2020

amit-gshe commented Oct 28, 2020

manast commented Oct 28, 2020

amit-gshe commented Oct 28, 2020

manast commented Oct 28, 2020

manast commented Oct 28, 2020

amit-gshe commented Oct 28, 2020 •

edited

manast commented Oct 28, 2020

manast commented Oct 30, 2020

amit-gshe commented Oct 31, 2020 •

edited

manast commented Oct 31, 2020

manast commented Oct 31, 2020

amit-gshe commented Oct 31, 2020 •

edited

amit-gshe commented Oct 31, 2020

manast commented Oct 31, 2020

manast commented Nov 24, 2020

updateDelaySet may cause very high redis cpu usage #1893

updateDelaySet may cause very high redis cpu usage #1893

Comments

amit-gshe commented Oct 28, 2020

manast commented Oct 28, 2020

manast commented Oct 28, 2020

amit-gshe commented Oct 28, 2020 • edited

manast commented Oct 28, 2020

amit-gshe commented Oct 28, 2020

manast commented Oct 28, 2020

amit-gshe commented Oct 28, 2020

manast commented Oct 28, 2020

manast commented Oct 28, 2020

amit-gshe commented Oct 28, 2020 • edited

manast commented Oct 28, 2020

manast commented Oct 30, 2020

amit-gshe commented Oct 31, 2020 • edited

manast commented Oct 31, 2020

manast commented Oct 31, 2020

amit-gshe commented Oct 31, 2020 • edited

amit-gshe commented Oct 31, 2020

manast commented Oct 31, 2020

manast commented Nov 24, 2020

amit-gshe commented Oct 28, 2020 •

edited

amit-gshe commented Oct 28, 2020 •

edited

amit-gshe commented Oct 31, 2020 •

edited

amit-gshe commented Oct 31, 2020 •

edited