Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

updateDelaySet may cause very high redis cpu usage #1893

Closed
amit-gshe opened this issue Oct 28, 2020 · 19 comments
Closed

updateDelaySet may cause very high redis cpu usage #1893

amit-gshe opened this issue Oct 28, 2020 · 19 comments

Comments

@amit-gshe
Copy link

We have a very large queue which has about several millions of delay jobs. And our queue has tens of workers(pods) in k8s cluster and even double or triple more pods when the business peak time. We have encountered an issue that bull can lead high redis cpu usage and the redis gets slow then stop prosessing jobs. It's a pain last for several months. Finally we find that the updateDelaySet get triggered so frequently on each worker and thus caused the high redis cpu usage and then consumer get so slow even stop processing waiting jobs and the waiting queue get stacked and finally the entire bull queue stop working.

I found that the updateDelaySet was expensive when the queue was huge so we should stop doing updateDelaySet when the waiting queue has too many jobs to process. We can add an option say maxWaitingJobs which default set Number.MAX. The user can set this value according their needs.

amit-gshe added a commit to amit-gshe/bull that referenced this issue Oct 28, 2020
@manast
Copy link
Member

manast commented Oct 28, 2020

Before I dig more into this issue, could you tell me more about your queue setup? for instance why do you have so many delayed jobs, and what is the average delay for these jobs? are you using rate limiting?

@manast
Copy link
Member

manast commented Oct 28, 2020

also, are you using priorities or just normal jobs?

@amit-gshe
Copy link
Author

amit-gshe commented Oct 28, 2020

My queue config: new Queue('FOOD_DIGEST_JOB', { redis: { host, port }, limiter: { max: 1000, duration: 1000 } }). We have millions of users and they can feed fish and every fish has a delay job in our setting. we use normal jobs. Each of our jobs can delay tens of seconds to days.

@manast
Copy link
Member

manast commented Oct 28, 2020

ok, so you have both delayed jobs and rate limited to 1000 jobs per second right?

@amit-gshe
Copy link
Author

yes exactly

@manast
Copy link
Member

manast commented Oct 28, 2020

and where are you running your redis instances?

@amit-gshe
Copy link
Author

we run our redis instance on aws elastic cache redis.

@manast
Copy link
Member

manast commented Oct 28, 2020

did you consider upgrading the instance type to a faster one? I know in aws is wasteful since larger instances also has more cores but redis will only use one.

@manast
Copy link
Member

manast commented Oct 28, 2020

I assume you are using last bull version 3.18.1 right?

@amit-gshe
Copy link
Author

amit-gshe commented Oct 28, 2020

yes we tried upgrading our instance but nothing difference. The queue can run a period of time but this issue will occur occasionally. we use 3.18.

@manast
Copy link
Member

manast commented Oct 28, 2020

yes. Moving the delayed jobs back to wait is not very efficient, specially in this case when your queues are always full of jobs. I think I can implement a solution for this, but it may take a couple of days before it is ready (if everything goes well).

manast added a commit that referenced this issue Oct 30, 2020
@manast
Copy link
Member

manast commented Oct 30, 2020

I implemented some improvements that should reduce the CPU usage for a case like this, I wonder if you have any chance to verify it before we make a new release? #1897

@amit-gshe
Copy link
Author

amit-gshe commented Oct 31, 2020

I tried #1897 with option new Queue('FOOD_DIGEST_JOB', { redis: { host, port }, limiter: { max: 1000, duration: 1000 } }) this morning and I found the CPU usage is lower but the consumimg rate get slower at the same time which result in the delayed jobs get stacked in delay queue and not consumed in time. The rate limiter setting is 1000/s, but the consume rate was just 140+/s. The adding rate was 300+/s (There are many jobs adding and consuming at the same time). BTW I found the redis slow log is clean now. The redis slow log increase quickly in the before version.

@manast
Copy link
Member

manast commented Oct 31, 2020

ok, I will try to mimic those numbers in a synthetic test.

@manast
Copy link
Member

manast commented Oct 31, 2020

I added a small fix to the above mentioned PR, would you mind to test if it works faster?

@amit-gshe
Copy link
Author

amit-gshe commented Oct 31, 2020

Thx for your working. I tested again and the consuming speed was very fast. And the rate is close to the rate limiter. Jobs moves to active and get consumed quickly but the redis cpu is much lower than before.

@amit-gshe
Copy link
Author

I found the Queue.prototype.updateDelayTimer get called very frequently and is it normal? I also inspect the redis slow log found logs as below:

127.0.0.1:6371> slowlog get 10
1) Slow log id: 667021
   Start at: 1604160921
   Running time(ms): 10435
   Command: evalsha 451d4221d6a7ff2251cd0864ab0a28bdba5ea099 10 bull:FOOD_DIGEST_JOB:active bull:FOOD_DIGEST_JOB:wait bull:FOOD_DIGEST_JOB:delayed bull:FOOD_DIGEST_JOB:paused bull:FOOD_DIGEST_JOB:completed bull:FOOD_DIGEST_JOB:failed bull:FOOD_DIGEST_JOB:priority bull:FOOD_DIGEST_JOB:5f9d708c0397ad2ac118dcc4 bull:FOOD_DIGEST_JOB:5f9d708c0397ad2ac118dcc4:logs bull:FOOD_DIGEST_JOB:limiter 5f9d708c0397ad2ac118dcc4 aa11c239-d230-4300-9722-283e3851da24
   Client IP and port: 10.5.229.39:59240
   Client name:
2) Slow log id: 667020
   Start at: 1604160921
   Running time(ms): 10390
   Command: LREM bull:FOOD_DIGEST_JOB:wait 0 5f9d708c0397ad2ac118dcc4
   Client IP and port: ?:0
   Client name:
3) Slow log id: 667019
   Start at: 1604160919
   Running time(ms): 10717
   Command: evalsha 451d4221d6a7ff2251cd0864ab0a28bdba5ea099 10 bull:FOOD_DIGEST_JOB:active bull:FOOD_DIGEST_JOB:wait bull:FOOD_DIGEST_JOB:delayed bull:FOOD_DIGEST_JOB:paused bull:FOOD_DIGEST_JOB:completed bull:FOOD_DIGEST_JOB:failed bull:FOOD_DIGEST_JOB:priority bull:FOOD_DIGEST_JOB:5f9d5471569a9534485395e3 bull:FOOD_DIGEST_JOB:5f9d5471569a9534485395e3:logs bull:FOOD_DIGEST_JOB:limiter 5f9d5471569a9534485395e3 9f802678-ca99-4050-a253-78338e6dc3be
   Client IP and port: 10.5.228.117:59884
   Client name:
4) Slow log id: 667018
   Start at: 1604160919
   Running time(ms): 10664
   Command: LREM bull:FOOD_DIGEST_JOB:wait 0 5f9d5471569a9534485395e3
   Client IP and port: ?:0
   Client name:

However the slow log length was just 4 and didn't increase any more.

@manast
Copy link
Member

manast commented Oct 31, 2020

yes, I think it is normal, my fix basically disables the updateDelayTimer as long as there are jobs in the queue to be processed, but sometimes when it gets rate limited it will kick again. It is very difficult to make it perfect just using LUA scripts. I am working on a redis module that solves all this problems perfectly but it will not work on AWS elasticache since it does not support modules.

@manast
Copy link
Member

manast commented Nov 24, 2020

Closing this since there are no more actions that I can take.

@manast manast closed this as completed Nov 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants