Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

(DOCS) Update REFERENCE.md to note a performance footgun #2357

Closed
wants to merge 1 commit into from

Conversation

baublet
Copy link

@baublet baublet commented May 17, 2022

At my company, we ran queue.clean(0, "waiting"); on a queue with 1.5 million tasks enqueued (we know this is the original sin, but it happened). Locally, on a couple thousand tasks, this method worked almost instantly. But in our lower environments, the Lua script it runs pinned our Redis server to 100% CPU for over 18 hours while it worked. The database stopped responding to any commands, and we could not shut it off, even through the GCP hosted databases console.

We had to replace our Redis database (the service stopped responding entirely) for the environment. Luckily, we did not run this on production, or we would have been EXTRA boned. I only blocked the company for around 3 hours before we were able to recover. 馃槩

Note: as of 10:40am central US time, the script is still running, and the server won't respond to commands. If this wasn't blocking our developers and QA team, we might be interested to see how danged long it takes.

At my company, we ran `queue.clean(0, "waiting");` on a queue with 1.5 million tasks enqueued (don't ask, cleaning up other peoples' messes is my specialty). This worked locally almost instantly on a couple thousand tasks. But in our lower environments, the Lua script it runs pinned our Redis server to 100% CPU for **over 18 hours** while it worked. The database stopped responding to any commands, and we could not shut it off, even through the GCP hosted databases console.

We had to replace our Redis database (the service stopped responding entirely) for the environment. Luckily, we did not run this on production, or we would have been EXTRA boned. I only blocked the company for around 3 hours before we were able to recover. 馃槩
@manast
Copy link
Member

manast commented May 18, 2022

I am sorry that this happened to you. I am not sure if this is a performance issue or something else. You mentioned the service stopped responding completely and has not recovered yet, so with this information it is not possible to conclude what caused the issue. We had a performance issue regarding cleaning, but we merged a fix a couple of months ago. So the first thing would be to check which version of Bull is performing this operation. Secondly, you should be able to use the Redis command "MONITOR" to check what is currently going on on the Redis instance, it will print all the commands being executed in real-time.

@manast
Copy link
Member

manast commented May 18, 2022

And here is the PR (as you can see, it should take around 10 seconds for 1M jobs):
#2326

@baublet
Copy link
Author

baublet commented May 18, 2022

Oh hey, thanks for the heads up @manast. It's high time we upgraded, then! We're still on 3.14 (fairly old) with some of the performance issues fixed in there. I'll close this and move to upgrading our internal version.

Many thanks!

@baublet baublet closed this May 18, 2022
@baublet baublet deleted the patch-1 branch May 18, 2022 14:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants