Bull process stops consuming new messages #2612

Shahtaj-Khalid · 2023-06-12T02:13:41Z

Description

I'm initialising bull's process at the start of my worker (running via docker and k8s), and it continues to listen to the message against the configured queue. Problem is, I have noticed that after some time (few hours, it's not some fix time), bull stops consuming new message, even though jobs do exists in the wait queue, I have checked in redis.
When I re-start my worker pod, it starts consuming those jobs again.
There are no errors in the 'error' event right away, when the consumer stop processing new jobs, but sometimes, I observed below error in my worker after few more hrs:

 at TCP.onStreamRead (internal/stream_base_commons.js:209:20) {
  errno: -104,
  code: 'ECONNRESET',
  syscall: 'read'
}

Minimal, Working Test code to reproduce the issue.

This is how I'm initialising bull, and activating process method

Worker.jobQueue = new Bull(jobName, { prefix, redis: redisOptions, enableReadyCheck: false, settings: { maxStalledCount: 30 } });

Worker.jobQueue.process(flags.concurrency, async (job) =>
		this.runJob(job),
);
...

async runJob(job: Bull.Job): Promise<IBullJobResponse> {
// some code
       return {
		success: true,
	};
}

Also, please note : when I restart my redis pod, redis throws unreachable error which gets recovered and bull continues to consume new messages. It seems like this issue is happening only when redis throws Connection reset by peer error, and it's not getting caught by error event as well right away, it takes some hours, and once error event is received, bull starts processing the queued jobs again.

Bull version

"bull": "^4.10.2",
"ioredis": "^4.28.5"
Nodejs: 14.15

Additional information

Since I'm not receiving any error event, it's hard to debug this, kindly let me know what could possibly trigger this issue, it's a severe issue in my case since we are relaying on bull for all data processing, thank you.

The text was updated successfully, but these errors were encountered:

qlereboursBS · 2023-06-25T10:32:15Z

Can you try without an async function please?
I just had the same issue and it seems to work with the done function.

Worker.jobQueue.process(flags.concurrency,(job) =>
		this.runJob(job).then(() => done())
);

manast · 2023-06-25T14:52:23Z

@qlereboursBS there would be no difference, in either using done (without async), or returning a promise, or using async. Just do not define an async function (or return a promise) and use done at the same time. In other words, the code you pasted above will not work.

Shahtaj-Khalid · 2023-06-25T15:38:48Z

@qlereboursBS I have tried without async too, observing same issue.
I updated ioredis version to 5.2.4, still observing same issue.

@manast I came across a similar issue : #890, I'm passing maxRetriesPerRequest: null & enableReadyCheck: false in redis options as well, still observing same issue.

Can you please guide what to do here??

manast · 2023-06-28T10:37:15Z

By any chance could you test with BullMQ instead? if BullMQ does not suffer from this problem then we know it is something specific to Bull and it may be easier to spot the reason for it.

manast · 2023-06-28T10:40:01Z

I can also see that both Bull and BullMQ use the same version of ioredis, so if the issue also exists in BullMQ then it could be some reconnection issue with IORedis, otherwise, the problem must come from the error handling logic in Bull.

Shahtaj-Khalid · 2023-06-28T12:38:44Z

I'm able to narrow down the issue, Seems like the issue is with error handling logic of Bull, I'm able to reproduce this every time in below scenario occur:

I ran MONITOR command on redis, to monitor every call
Observed that Bull sends brpoplpush command every 5 seconds against the process queue.
Sent some messages in bull and they were being consumed properly.
After few hours, redis MONITOR stopped with error : Error: Connection reset by peer
Right after this, When I re-ran the redis Monitor, brpoplpush command is not getting issued anymore and bull stops consuming new messages as well.

Also, no error event is produced by bull in this case, but after hrs, bull emits below error event :

at TCP.onStreamRead (internal/stream_base_commons.js:209:20) {
  errno: -104,
  code: 'ECONNRESET',
  syscall: 'read'
}

and after this, process's event loop is resumed, and it starts processing the messages again.

--
also, Please Note that I'm using ioRedis directly in the same project to perform some other tasks, and it's connected with the same redis server, which my bull instance is connected with, but that client (which is directly using ioRedis) continues to work as expected even after above error in redis.
But a Point to note here is, even ioRedis didn't emit any error when my redis encountered Connection reset by peer error.

Hope this information helps.

cc: @manast

manast · 2023-06-28T14:56:29Z

I wonder if this is not an issue with ioredis actually. For instance, the setting "maxRetriesPerRequest: null" will work so that a command will never fail, "Set maxRetriesPerRequest to null to disable this behavior, and every command will wait forever until the connection is alive again (which is the default behavior before ioredis v4).". It is possible that a bug in ioredis prevents the brpoplpush command to be re-executed when the connection is alive again.
Any chance to test with BullMQ instead as suggested earlier?

Shahtaj-Khalid · 2023-06-28T15:41:09Z

We are using Bull in our project, if this is absolutely necessary to fix the issue, I can give this a try, but we want to continue to use Bull until unless there is an absolute need to switch.

manast · 2023-06-29T09:17:32Z

I understand, what I meant is that if the same problem does not exist in BullMQ then we can corner the bug easier. Right now we do not have a lot to go for as the issue is not easy to reproduce.

Shahtaj-Khalid · 2023-07-02T09:17:14Z

@manast Sure I'll try to test it and will share the results.
In the mean time, sharing one more observation, it might help as well.

I noticed that when I'm running the same worker via npm for local testing, process does not get effected with the connection reset error of redis (it's connected with the same redis server, running via docker), but the issue is seen when I'm running the worker via docker only (connected with the same redis instance). It's very strange, can you tell what could be the reason? since bull's and ioRedis version is same in both cases.

manast · 2023-07-02T09:53:38Z

I don't know the reason but clearly, the type of connection error is different and therefore it could be an issue with ioredis that is not handling it correctly.

manast · 2023-07-02T09:56:06Z

But you are still not able to provide a reproducible case even using docker right? If I were you that's where I would put my efforts...

stale · 2023-09-04T04:01:53Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

IgorKnezevicSymphony · 2023-12-04T15:50:46Z

@Shahtaj-Khalid Do you have any update on this issue?
My setup is AWS EKS, Redis as Elasticache (even tried self managed as pod same issue) and my client app is nodejs with Bull, after around 2 hours I get same error on the client side.
Redis server has tcp-keepalive set at 500 sec, together with timeout set to 0.

This was referenced Jun 25, 2023

Reconnection problem #890

Closed

Jobs get stuck after Redis reconnect #1873

Closed

stale bot added the wontfix label Sep 4, 2023

stale bot closed this as completed Sep 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bull process stops consuming new messages #2612

Bull process stops consuming new messages #2612

Shahtaj-Khalid commented Jun 12, 2023 •

edited

qlereboursBS commented Jun 25, 2023

manast commented Jun 25, 2023

Shahtaj-Khalid commented Jun 25, 2023

manast commented Jun 28, 2023

manast commented Jun 28, 2023

Shahtaj-Khalid commented Jun 28, 2023 •

edited

manast commented Jun 28, 2023

Shahtaj-Khalid commented Jun 28, 2023

manast commented Jun 29, 2023

Shahtaj-Khalid commented Jul 2, 2023

manast commented Jul 2, 2023

manast commented Jul 2, 2023

stale bot commented Sep 4, 2023

IgorKnezevicSymphony commented Dec 4, 2023 •

edited

Bull process stops consuming new messages #2612

Bull process stops consuming new messages #2612

Comments

Shahtaj-Khalid commented Jun 12, 2023 • edited

Description

Minimal, Working Test code to reproduce the issue.

Bull version

Additional information

qlereboursBS commented Jun 25, 2023

manast commented Jun 25, 2023

Shahtaj-Khalid commented Jun 25, 2023

manast commented Jun 28, 2023

manast commented Jun 28, 2023

Shahtaj-Khalid commented Jun 28, 2023 • edited

manast commented Jun 28, 2023

Shahtaj-Khalid commented Jun 28, 2023

manast commented Jun 29, 2023

Shahtaj-Khalid commented Jul 2, 2023

manast commented Jul 2, 2023

manast commented Jul 2, 2023

stale bot commented Sep 4, 2023

IgorKnezevicSymphony commented Dec 4, 2023 • edited

Shahtaj-Khalid commented Jun 12, 2023 •

edited

Shahtaj-Khalid commented Jun 28, 2023 •

edited

IgorKnezevicSymphony commented Dec 4, 2023 •

edited