Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bull process stops consuming new messages #2612

Closed
Shahtaj-Khalid opened this issue Jun 12, 2023 · 14 comments
Closed

Bull process stops consuming new messages #2612

Shahtaj-Khalid opened this issue Jun 12, 2023 · 14 comments
Labels

Comments

@Shahtaj-Khalid
Copy link

Shahtaj-Khalid commented Jun 12, 2023

Description

I'm initialising bull's process at the start of my worker (running via docker and k8s), and it continues to listen to the message against the configured queue. Problem is, I have noticed that after some time (few hours, it's not some fix time), bull stops consuming new message, even though jobs do exists in the wait queue, I have checked in redis.
When I re-start my worker pod, it starts consuming those jobs again.
There are no errors in the 'error' event right away, when the consumer stop processing new jobs, but sometimes, I observed below error in my worker after few more hrs:

 at TCP.onStreamRead (internal/stream_base_commons.js:209:20) {
  errno: -104,
  code: 'ECONNRESET',
  syscall: 'read'
}

Minimal, Working Test code to reproduce the issue.

This is how I'm initialising bull, and activating process method

Worker.jobQueue = new Bull(jobName, { prefix, redis: redisOptions, enableReadyCheck: false, settings: { maxStalledCount: 30 } });

Worker.jobQueue.process(flags.concurrency, async (job) =>
		this.runJob(job),
);
...

async runJob(job: Bull.Job): Promise<IBullJobResponse> {
// some code
       return {
		success: true,
	};
}

Also, please note : when I restart my redis pod, redis throws unreachable error which gets recovered and bull continues to consume new messages. It seems like this issue is happening only when redis throws Connection reset by peer error, and it's not getting caught by error event as well right away, it takes some hours, and once error event is received, bull starts processing the queued jobs again.

Bull version

"bull": "^4.10.2",
"ioredis": "^4.28.5"
Nodejs: 14.15

Additional information

Since I'm not receiving any error event, it's hard to debug this, kindly let me know what could possibly trigger this issue, it's a severe issue in my case since we are relaying on bull for all data processing, thank you.

@qlereboursBS
Copy link

Can you try without an async function please?
I just had the same issue and it seems to work with the done function.

Worker.jobQueue.process(flags.concurrency,(job) =>
		this.runJob(job).then(() => done())
);

@manast
Copy link
Member

manast commented Jun 25, 2023

@qlereboursBS there would be no difference, in either using done (without async), or returning a promise, or using async. Just do not define an async function (or return a promise) and use done at the same time. In other words, the code you pasted above will not work.

@Shahtaj-Khalid
Copy link
Author

@qlereboursBS I have tried without async too, observing same issue.
I updated ioredis version to 5.2.4, still observing same issue.

@manast I came across a similar issue : #890, I'm passing maxRetriesPerRequest: null & enableReadyCheck: false in redis options as well, still observing same issue.

Can you please guide what to do here??

@manast
Copy link
Member

manast commented Jun 28, 2023

By any chance could you test with BullMQ instead? if BullMQ does not suffer from this problem then we know it is something specific to Bull and it may be easier to spot the reason for it.

@manast
Copy link
Member

manast commented Jun 28, 2023

I can also see that both Bull and BullMQ use the same version of ioredis, so if the issue also exists in BullMQ then it could be some reconnection issue with IORedis, otherwise, the problem must come from the error handling logic in Bull.

@Shahtaj-Khalid
Copy link
Author

Shahtaj-Khalid commented Jun 28, 2023

I'm able to narrow down the issue, Seems like the issue is with error handling logic of Bull, I'm able to reproduce this every time in below scenario occur:

  1. I ran MONITOR command on redis, to monitor every call
  2. Observed that Bull sends brpoplpush command every 5 seconds against the process queue.
  3. Sent some messages in bull and they were being consumed properly.
  4. After few hours, redis MONITOR stopped with error : Error: Connection reset by peer
  5. Right after this, When I re-ran the redis Monitor, brpoplpush command is not getting issued anymore and bull stops consuming new messages as well.

Also, no error event is produced by bull in this case, but after hrs, bull emits below error event :

at TCP.onStreamRead (internal/stream_base_commons.js:209:20) {
  errno: -104,
  code: 'ECONNRESET',
  syscall: 'read'
}

and after this, process's event loop is resumed, and it starts processing the messages again.

--
also, Please Note that I'm using ioRedis directly in the same project to perform some other tasks, and it's connected with the same redis server, which my bull instance is connected with, but that client (which is directly using ioRedis) continues to work as expected even after above error in redis.
But a Point to note here is, even ioRedis didn't emit any error when my redis encountered Connection reset by peer error.

Hope this information helps.

cc: @manast

@manast
Copy link
Member

manast commented Jun 28, 2023

I wonder if this is not an issue with ioredis actually. For instance, the setting "maxRetriesPerRequest: null" will work so that a command will never fail, "Set maxRetriesPerRequest to null to disable this behavior, and every command will wait forever until the connection is alive again (which is the default behavior before ioredis v4).". It is possible that a bug in ioredis prevents the brpoplpush command to be re-executed when the connection is alive again.
Any chance to test with BullMQ instead as suggested earlier?

@Shahtaj-Khalid
Copy link
Author

We are using Bull in our project, if this is absolutely necessary to fix the issue, I can give this a try, but we want to continue to use Bull until unless there is an absolute need to switch.

@manast
Copy link
Member

manast commented Jun 29, 2023

I understand, what I meant is that if the same problem does not exist in BullMQ then we can corner the bug easier. Right now we do not have a lot to go for as the issue is not easy to reproduce.

@Shahtaj-Khalid
Copy link
Author

@manast Sure I'll try to test it and will share the results.
In the mean time, sharing one more observation, it might help as well.

I noticed that when I'm running the same worker via npm for local testing, process does not get effected with the connection reset error of redis (it's connected with the same redis server, running via docker), but the issue is seen when I'm running the worker via docker only (connected with the same redis instance). It's very strange, can you tell what could be the reason? since bull's and ioRedis version is same in both cases.

@manast
Copy link
Member

manast commented Jul 2, 2023

I don't know the reason but clearly, the type of connection error is different and therefore it could be an issue with ioredis that is not handling it correctly.

@manast
Copy link
Member

manast commented Jul 2, 2023

But you are still not able to provide a reproducible case even using docker right? If I were you that's where I would put my efforts...

@stale
Copy link

stale bot commented Sep 4, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Sep 4, 2023
@stale stale bot closed this as completed Sep 11, 2023
@IgorKnezevicSymphony
Copy link

IgorKnezevicSymphony commented Dec 4, 2023

@Shahtaj-Khalid Do you have any update on this issue?
My setup is AWS EKS, Redis as Elasticache (even tried self managed as pod same issue) and my client app is nodejs with Bull, after around 2 hours I get same error on the client side.
Redis server has tcp-keepalive set at 500 sec, together with timeout set to 0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants