(DOCS) Update REFERENCE.md to note a performance footgun #2357

baublet · 2022-05-17T15:34:24Z

At my company, we ran queue.clean(0, "waiting"); on a queue with 1.5 million tasks enqueued (we know this is the original sin, but it happened). Locally, on a couple thousand tasks, this method worked almost instantly. But in our lower environments, the Lua script it runs pinned our Redis server to 100% CPU for over 18 hours while it worked. The database stopped responding to any commands, and we could not shut it off, even through the GCP hosted databases console.

We had to replace our Redis database (the service stopped responding entirely) for the environment. Luckily, we did not run this on production, or we would have been EXTRA boned. I only blocked the company for around 3 hours before we were able to recover. 😢

Note: as of 10:40am central US time, the script is still running, and the server won't respond to commands. If this wasn't blocking our developers and QA team, we might be interested to see how danged long it takes.

At my company, we ran `queue.clean(0, "waiting");` on a queue with 1.5 million tasks enqueued (don't ask, cleaning up other peoples' messes is my specialty). This worked locally almost instantly on a couple thousand tasks. But in our lower environments, the Lua script it runs pinned our Redis server to 100% CPU for **over 18 hours** while it worked. The database stopped responding to any commands, and we could not shut it off, even through the GCP hosted databases console. We had to replace our Redis database (the service stopped responding entirely) for the environment. Luckily, we did not run this on production, or we would have been EXTRA boned. I only blocked the company for around 3 hours before we were able to recover. 😢

manast · 2022-05-18T02:59:23Z

I am sorry that this happened to you. I am not sure if this is a performance issue or something else. You mentioned the service stopped responding completely and has not recovered yet, so with this information it is not possible to conclude what caused the issue. We had a performance issue regarding cleaning, but we merged a fix a couple of months ago. So the first thing would be to check which version of Bull is performing this operation. Secondly, you should be able to use the Redis command "MONITOR" to check what is currently going on on the Redis instance, it will print all the commands being executed in real-time.

manast · 2022-05-18T03:00:46Z

And here is the PR (as you can see, it should take around 10 seconds for 1M jobs):
#2326

baublet · 2022-05-18T14:42:36Z

Oh hey, thanks for the heads up @manast. It's high time we upgraded, then! We're still on 3.14 (fairly old) with some of the performance issues fixed in there. I'll close this and move to upgrading our internal version.

Many thanks!

BobbyAD approved these changes May 17, 2022

View reviewed changes

baublet closed this May 18, 2022

baublet deleted the patch-1 branch May 18, 2022 14:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(DOCS) Update REFERENCE.md to note a performance footgun #2357

(DOCS) Update REFERENCE.md to note a performance footgun #2357

baublet commented May 17, 2022 •

edited

manast commented May 18, 2022

manast commented May 18, 2022 •

edited

baublet commented May 18, 2022

(DOCS) Update REFERENCE.md to note a performance footgun #2357

(DOCS) Update REFERENCE.md to note a performance footgun #2357

Conversation

baublet commented May 17, 2022 • edited

manast commented May 18, 2022

manast commented May 18, 2022 • edited

baublet commented May 18, 2022

baublet commented May 17, 2022 •

edited

manast commented May 18, 2022 •

edited