Jobs get stuck after Redis reconnect #1873

royalpinto · 2020-09-28T04:02:17Z

Description

When the Redis gets disconnected and connected again, the queue doesn't pick up any jobs. (Easy To Reproduce)

Minimal, Working Test code to reproduce the issue.

const Queue = require('bull')
const Redis = require('ioredis')

const client = new Redis({
  host: 'localhost',
  port: 6379
})

const queue = new Queue('test', client)

client.on('ready', () => {
  console.log('client ready')
})

client.on('connect', () => {
  console.log('client connect')
})

client.on('close', () => {
  console.log('client close')
})

client.on('end', error => {
  console.log('client end', error)
})

client.on('reconnecting', error => {
  console.log('client reconnecting', error)
})

client.on('error', error => {
  console.log('client error', error.message)
})

queue
  .on('completed', (job, results) => {
    console.log('queue completed a job')
  })
  .on('error', async (error) => {
    console.error('queue error', error.message)
  })
  .on('failed', async (job, error) => {
    console.error('queue error', error.message)
  })

queue.process(async (job) => {
  console.log('running task')
})

setInterval(function () {
  queue.add({}).then(() => console.log('pushed task')).catch((e) => console.log('failed to push task'))
}, 10000)

Bull version

Bull 3.18.0
Node 12 and 14
Redis 3.2.8

Steps to reproduce.

Run the above script
Bull JS start accepting and processing Jobs
Stop Redis
Bull JS will keep retrying
Start Redis
Bull JS will connect in a few milliseconds
At this stage, the queue can push the jobs and but it won't process any jobs. :-(

Things we tried

Tried explicitly the reconnect option of ioredis. https://github.com/luin/ioredis#auto-reconnect and https://github.com/luin/ioredis#reconnect-on-error But issue don't seem to be coming from ioredis.

The text was updated successfully, but these errors were encountered:

manast · 2020-09-28T16:19:03Z

I tested the code above (I just changed the timeout to be 1 second for faster testing), it is working perfectly on Mac os, I tried many times to stop the redis server and re-start it and it always continue processing as soon as the redis server is online. Is there anything else you can share in order to reproduce the issue?

royalpinto · 2020-09-29T03:50:52Z

@manast It's strange only on one instance I could reproduce, but other places I could not reproduce. I will try to investigate more on this.

However, there is a similar issue we found which is reproducible everywhere. The scenario is below:

Stop Redis Server
Start above snippet
The snippet will keep trying to connect
Start the Redis server
The snippet will eventually connect in a few seconds.
=> Now it cannot push or process any jobs. It rather throws "UnhandledPromiseRejectionWarning: Error: connect ECONNREFUSED 127.0.0.1:6379". (This is for the client but something similar for the subscriber too. Even if we push from another process, the subscriber won't pick up the jobs).

I tried analyzing the code and found that inside the Queue class the this._initializing promise is initialized only once in the constructor and it will be rejected because Redis will not be running during the startup. When it eventually connects, this._initializing will be the same old state so throws an error. It should rather re-initialize.

https://github.com/OptimalBits/bull/blob/develop/lib/queue.js#L148

manast · 2020-09-29T11:46:49Z

okok, so you mean if redis is not running when bull starts, I will look into it.

ruscon · 2020-10-13T11:08:30Z

@royalpinto @manast yep, I have the same issue.

mbieser · 2020-10-17T18:34:47Z

+1

manast · 2020-10-20T07:48:46Z

I marked this as an enhancement because I consider it a new requirement, i.e. being able to work when starting without an available redis connection if that connection becomes available at a later time. May require some non trivial refactoring.

elghazal-a · 2020-11-18T22:35:39Z

Actually, I can reproduce this scenario

Stop Redis
Wait for a moment and start again Redis
Bull doesn't catch up tasks that should've run the time when Redis was down
Restart bull, then Bull catch up all tasks

I think this is a bug. Do you have any workaround for that @manast ?

koiszzz · 2021-03-10T08:34:36Z

I have met the same issue. when redis server came into crash for a long time, longer than redis reconnecting times , the job just add into the queue and just waiting,never start

manast · 2021-03-10T08:41:04Z

@koiszzz I cannot find how many times ioredis reconnects before it gives up, maybe it tries to reconnect forever or maybe not, this should be investigated, however I tested extensively reconnects in latest version of Bull and could not reproduce any issue. If anybody does please post some instructions on how to reproduce it, ideally a piece of code I can run so I can take a deeper look into it.

koiszzz · 2021-03-10T09:02:23Z

I post an bull test project bull-test,
start redis server and then start the express server , browser the url : localhost:3000/add, the console will print job done.
then stop the redis server, and wait for a moment .
restart the redis server, browser the url : localhost:3000/add, you will find the job add to the queue, but then state will be stuck in 'waiting' @manast

manast · 2021-03-10T09:26:23Z

@koiszzz I tested your code. I guess you are testing using a redis instance inside docker? I experience a bug with ioredis reported here: redis/ioredis#1285 where reconnect does not work if redis is running inside a docker container in MacOS at least. Your repo works well for me if I use a local redis instance but it does not work if I use a redis docker instance... So if this is your case I suggest you ping Luin to see if he can reproduce it, he has failed to reproduce it so far but if more people than me can reproduce it, then maybe it will get more attention... alternatively that I debug ioredis myself but not sure when I will find time for that...

koiszzz · 2021-03-10T09:30:51Z

Thanks for replying, @manast . The redis server is installed in docker. Maybe I should install a redis server without docker for test.

givorra-ls · 2021-03-29T15:43:18Z

Lol.... I had the same issue, I was reading your comments and doing some tests. I realised that when running redis via docker, the redis data has no persistence and is therefore Bull jobs are deleted when the container is stopped and restarted... The "solution" is to start redis via docker-compose using a volume :)

miwnwski · 2021-06-02T16:42:58Z

I encountered this error and have been investigating, I'll share our findings here - maybe it helps. As far as I can understand, when the client reconnects, Queue#processJobs is not run. Queue.prototype.process calls Queue#start which calls Queue#run which calls Queue#processJobs which enters the queue in the forever loop state by throwing promises.

The problem (as I understand it) is that when Queue#processJobs is called it can resolve if the Queue is closing or if an error is thrown. This creates a state where the connection exists but the forever promise loop never executes.

Our patch fix for this (at least locally - not tested in production yet) is basically to call queue.run(queue.concurrency) on ready. Using the above example which I could use to reproduce the problem, this would be the "fix":

const Queue = require('bull')
const Redis = require('ioredis')

const client = new Redis({
  host: 'localhost',
  port: 6379
})

const queue = new Queue('test', client)
let reconnectingEvent = false 

client.on('ready', () => {
  console.log('client ready')
  if (reconnectingEvent) { // this is actually the fix, we just set a flag to not kick run unless we are restarting
    console.log('rerun queue run')
    reconnectingEvent = false
    queue.run(queue.concurrency)
  }
})

client.on('connect', () => {
  console.log('client connect')
})

client.on('close', () => {
  console.log('client close')
})

client.on('end', error => {
  console.log('client end', error)
})

client.on('reconnecting', error => {
  console.log('client reconnecting', error)
  reconnectingEvent = true
})

client.on('error', error => {
  console.log('client error', error.message)
})

queue
  .on('completed', (job, results) => {
    console.log('queue completed a job')
  })
  .on('error', async (error) => {
    console.error('queue error', error.message)
  })
  .on('failed', async (job, error) => {
    console.error('queue error', error.message)
  })

queue.process(async (job) => {
  console.log('running task')
})

setInterval(function () {
  queue.add({}).then(() => console.log('pushed task')).catch((e) => console.log('failed to push task'))
}, 10000)

I'm not sure if this is a "good" solution, but it works for now. Maybe maintainers have a better clue on how to fix this cleanly and properly - or describe to which I can PR a fix for.

@tarunbatra do you have anything to add or do you feel this sufficiently describes our findings?

manast · 2021-06-03T08:32:14Z

This sounds like a good solution actually, we just need to make sure that the old "loop" is not running when the "ready" event is fired by ioredis, so basically we should "cancel" previous loop proactively when starting the new one.

miwnwski · 2021-06-08T12:58:58Z

@manast How would you approach this? Is there something we can check today (without monkey patching the lib) for an existing running loop?

gabrielrtomio · 2021-06-08T17:58:28Z

I have the same issue using heroku, when it restart all queues get stuck in waiting.

manast · 2021-06-09T06:00:13Z

@miwnwski yeah it is a bit tricky, you need to somehow refactor the "run" method so that it can be cancellable https://github.com/OptimalBits/bull/blob/develop/lib/queue.js#L864
I could do it myself but I have a lot in my plate right now so it will take some time...

anhnhatuit · 2021-07-09T03:50:51Z

I have the same issue using GCP. Any quick solution about that.
Thank a lot for help.

batrudinych · 2021-07-09T10:50:20Z

Faced the same issue using AWS Elasticache. Looks like there was a connection interruption or some kind of a switch. No error logs or failed jobs, discovered the issue by missing data in the database.

nirgilboa · 2021-08-21T14:02:29Z

+1
Also encountering this issue when using AWS Elasticache

dbousamra · 2021-08-23T08:28:30Z

We are also encountering this issue.

ilgianlu · 2021-08-23T14:35:20Z

Hi, I'm working in a kubernetes cluster. Having the same issue. My workaround is to let the queue processor container die whenever it cannot connect to redis container. Kubernetes will take care of restarting my processor pod.
In order for the processor to exit abruptly I'm passing this redis conf:

const redisOptions = {
        ...config.get('redis.connectParams'),
        retryStrategy: (times) => {
            console.log('could not connect to redis!');
            process.exit(1);
        },
    };

xairoo · 2021-09-12T21:17:24Z

Currently in dev and have the same problem. Appears after wake my system from hibernate. I am using a remote Redis server, so the connection gets dropped after a while when the system sleeps :-)

ioredis will reconnect automatically, but Bull won't pull new jobs.

francois2metz · 2021-09-13T07:41:12Z

Hi there. I do have some logs on our open source application https://github.com/1024pix/pix-editor

At 2021-09-10T06:02:16.000Z, the connection to redis is dropped and the client is trying to reconnect until 2021-09-10T06:02:29.000Z,
the connection is still not stable but commands have been resumed, but without the brpoplpush without means that the queues are not processed
The connection is restarted every few minutes during days if I don't restart the application

Hope this helps? I'm available to debug anything

extract_ioredis_logs.csv

xairoo · 2021-09-17T20:20:22Z

My MongoDB driver fires the disconnect event when I wake my system from sleep. It will directly reconnect.
ioredis doesn't trigger anything after sleep (but jobs get pushed to queue).

Hm, changed a while ago from Node Redis to ioredis. I think Node Redis hasn't this problem.

Anyway, this doesn't help:

maxRetriesPerRequest: null,
enableReadyCheck: false,

m-gagnon · 2021-09-27T14:19:36Z

I am running into this issue as well when facing any kind of redis reconnection. This is very painful since the service appears healthy but in fact never processes any jobs. Any workaround available? Thanks.

xairoo · 2021-10-25T06:28:50Z

Looks like i am missing some bull connections after sleep.
Happens only using a remote redis server. Locally it works.

Screenshots from RedisInsight (cmd client is the RedisInsight client):
After starting it:

After sleep/reconnect:

I splitted your code example in a sender/receiver and worker to test if it happens only in such cases.

xairoo · 2021-10-25T06:45:10Z

@manast same happens when I am closing the queue and reopening it.
It results in missing clients. But maybe I am doing it wrong.
You can reproduce it locally.

After start:

After close/reconnect:

New jobs get added, but they don't get processed - no wonder without a redis connection.
But again, maybe I am doing it wrong.

const Queue = require("bull");

var testQueue = new Queue("test", {
  redis: {
    maxRetriesPerRequest: null,
    enableReadyCheck: false,
  },
});

testQueue.process(async (job) => {
  console.log(`Process started for Job ${job.id}`);
});

// testQueue.add(
//   { foo: "bar" },
//   {
//     repeat: {
//       every: 1500,
//     },
//   }
// );

// Instead of `repeat`
setInterval(() => {
  console.log(testQueue.client.status);
  if (testQueue.client.status === "ready") { // Check for ready, otherwise it will crash
    testQueue.add({ foo: "bar" });
  }
}, 2500);

testQueue.on("error", (err) => console.log(err));

testQueue.on("global:completed", (jobId) => {
  console.log(`Job ${jobId} completed`);
});

setTimeout(() => {
  console.log("closing");
  testQueue.close();
  setTimeout(() => {
    console.log("reconnect");
    testQueue = new Queue("test", {
      redis: {
        maxRetriesPerRequest: null,
        enableReadyCheck: false,
      },
    });
  }, 2500);
}, 2500);

anhnhatuit · 2021-10-25T07:13:35Z

Hello @xairoo x,
Looks like the issue is faced if brpoplpush cmd is blocked. We are using redis of GCP. It's not lost connect but still get stuck after 2 or 3 days processing. We have about 8M jobs should be processed per day. For another queue, Just processed about 2M jobs we are not facing any issue.

manast · 2021-10-25T08:37:39Z

@anhnhatuit my test code above forces a blocking brpoplpush

manast · 2021-10-25T08:39:06Z

@xairoo I do not understand what you mean with "same happens when I am closing the queue and reopening it." ?

What do you mean by closing and reopening. And can you or can't you reproduce the issue with the code I provided?, if so, would you mind to explain step by step how to reproduce it?

anhnhatuit · 2021-10-25T08:47:36Z

Hello @manast,
So, if we set redis option as below may resolve the issue, right?

maxRetriesPerRequest: null,
enableReadyCheck: false,

Thanks

manast · 2021-10-25T08:55:35Z

@anhnhatuit as far as I know, yes.

anhnhatuit · 2021-10-25T09:00:28Z

Many thanks @manast ,
Let me try. I will update the result soon.

Thanks,
Nhat

xairoo · 2021-10-25T09:30:23Z

@manast yes I can reproduce it with your code and a remote redis server.

I only changed this:

// testQueue.add(
//   { foo: "bar" },
//   {
//     repeat: {
//       every: 1500,
//     },
//   }
// );

// Instead of `repeat`
setInterval(() => {
  console.log(testQueue.client.status);
  if (testQueue.client.status === "ready") { // Check for ready, otherwise it will crash
    testQueue.add({ foo: "bar" });
  }
}, 2500);

Anyway, this shouldn't matter.

What I mean with closing and reconnecting

When you start your example, there are 3 bull connections.
After a disconnect for 5+ minutes (redis TCP keepalive is set by default to 300 seconds) you can see only 1 or 2 bull connections.

I played a bit around and came to testQueue.close().
After closing the queue the 3 connections are gone (logically).
So I had to reopen/start the queue again.
Please have a look at my example #1873 (comment) - shouldn't the queue process the added jobs after the "reconnect" @ line 43 testQueue = new Queue("test", { again?

The listeners are dead (disconnected), I think that's the problem.

testQueue.process(async (job) => {
  console.log(`Process started for Job ${job.id}`);
});

testQueue.on("global:completed", (jobId) => {
  console.log(`Job ${jobId} completed`);
});

^^ not working because the 2 required redis connections for this are missing.
Only new jobs get added (this client is present), but jobs are not processed.

flavianh · 2021-10-25T12:27:05Z

I'm going to attempt to sum up the above conversations.

Bull is getting stuck after the Redis server disconnects then reconnects again. It affects many Bull users as those disconnects happen from time to time in a cloud env.
The leading hypothesis now is that ioredis used to be configured by default with the options

maxRetriesPerRequest: null,
enableReadyCheck: false,

But this is no longer the case as of version 4 of ioredis.

Workaround for now: add these options explicitly (tested and confirmed by @manast, who is a maintanier of this repo )
Proposal for a durable fix: add these options explicitly in Bull. (my 2 cents: maybe also add a warning if the configuration is overriden by a user without these values?)

manast · 2021-10-27T10:43:25Z

🎉 This issue has been resolved in version 4.0.0 🎉

The release is available on:

Your semantic-release bot 📦🚀

xairoo · 2021-10-27T16:39:55Z

Doesn't work.

If the connection gets dropped by redis TCP keepalive, the listeners queue.process() and queue.on() don't reconnect automatically. New jobs will be put to queue.

I think the listeners should check the connection based on an interval and reconnect if the connection was dropped. That will solve the problem.

EPecherkin · 2021-10-27T21:37:40Z

This solution doesn't work for me. The issue is still there

manast · 2021-10-28T02:24:31Z

Doesn't work.

If the connection gets dropped by redis TCP keepalive, the listeners queue.process() and queue.on() don't reconnect automatically. New jobs will be put to queue.

I think the listeners should check the connection based on an interval and reconnect if the connection was dropped. That will solve the problem.

According to ioredis documentation, with the settings provided this should not happen, and I am unable to reproduce it.

manast · 2021-10-28T02:25:52Z

@xairoo and @EPecherkin since it seems like you can reproduce the issue easily, can you open a new issue for your specific use case and steps on how to reproduce it so that I can look into it?

anhnhatuit · 2021-10-28T02:47:05Z

Hello @manast,
Looks like the issue is not resolved. Actually, I could not reproduce in my local. Just one point it may help you detect the problem.
I'm using Redis service of Google Cloud. We It's got stuck after 2-3 days processing. We have about 9M jobs should be processed per day. About 100k jobs are added to queue per 2 minutes.
For another queue, It just process about 2M jobs per days. It work well without any issue.

Thanks,
Nhat

xairoo · 2021-10-28T07:37:45Z

Give this a try:

setInterval(function () {
  yourQueue.clients.forEach((client) => {
    client.ping();
  });
}, 10000);

royalpinto · 2021-11-03T05:31:48Z

🎉 This issue has been resolved in version 4.0.0 🎉

The release is available on:

npm package (@latest dist-tag)

GitHub release

Your semantic-release bot 📦🚀

@manast I can confirm it works for me for the environment given here redis/ioredis#1285 (comment) when the ioredis gets reconnected.

Yuliang-Lee · 2022-08-24T15:32:52Z

After dozen tests, I found the problem was cause by our redis server's deploy-way. LOL. Our redis server was run on k8s with original redis's config -- obviously there is no persistent data when redis server restarts. I did not have time to dig why redis's data lost cause the problem but it is no any problem happens when I modify redis data dir to persistence volumn.

Shahtaj-Khalid · 2023-06-28T00:54:58Z

I'm still facing the same issue, bull process stops consuming new messages upon reset connection error from redis :
created a separate issue with more details here : #2612.

setting these flags explicitly didn't help :

maxRetriesPerRequest: null,
enableReadyCheck: false,

I'm using Bull's version : 4.10.2 & ioredis : 5.2.4

It seems like this issue is happening only when redis throws Connection reset by peer error, and it's not getting caught by error event as well, right away, but it takes some hours, and once error event is received, bull starts processing the queued by jobs again.

Does anyone found a work around on how to fix this? or do I need to update to any other version of Bull? Kindly let me know as I'm really stuck, thanks.

manast added the cannot reproduce label Sep 28, 2020

manast added the enhancement label Oct 20, 2020

royalpinto mentioned this issue Aug 23, 2021

Blocking not working properly in case of reconnections redis/ioredis#1285

Open

manast mentioned this issue Oct 27, 2021

fix: force options to guarantee correct reconnects #2185

Merged

manast closed this as completed in #2185 Oct 27, 2021

manast added the released label Oct 27, 2021

manast mentioned this issue Oct 27, 2021

Queue stuck with jobs in "active" state for hours, and rest of jobs not being processed taskforcesh/bullmq#652

Closed

xairoo mentioned this issue Oct 28, 2021

Event subscriber clients won't reconnect / Jobs and events won't get triggered #2187

Closed

namoscato mentioned this issue Mar 6, 2022

Recommended Express integration taskforcesh/bullmq#982

Closed

TheBeachMaster mentioned this issue May 9, 2022

enableReadyCheck/maxRetriesPerRequest issues after upgrading to >4.6.2 #2347

Closed

Yuliang-Lee mentioned this issue Aug 23, 2022

Jobs get stuck after Redis reconnect #2417

Closed

wa0x6e mentioned this issue May 12, 2023

fix: fix high CPU usage on production server snapshot-labs/envelop#58

Merged

kathyavini mentioned this issue Jun 13, 2023

LF-3081 (Feature) Create localhost replacement for the lite farm imaginary LiteFarmOrg/LiteFarm#2728

Merged

16 tasks

Jobs get stuck after Redis reconnect #1873

Jobs get stuck after Redis reconnect #1873

Comments

royalpinto commented Sep 28, 2020

Description

Minimal, Working Test code to reproduce the issue.

Bull version

Steps to reproduce.

Things we tried

manast commented Sep 28, 2020

royalpinto commented Sep 29, 2020

manast commented Sep 29, 2020 • edited

ruscon commented Oct 13, 2020

mbieser commented Oct 17, 2020

manast commented Oct 20, 2020 • edited

elghazal-a commented Nov 18, 2020 • edited

koiszzz commented Mar 10, 2021

manast commented Mar 10, 2021

koiszzz commented Mar 10, 2021 • edited

manast commented Mar 10, 2021

koiszzz commented Mar 10, 2021 • edited

givorra-ls commented Mar 29, 2021 • edited

miwnwski commented Jun 2, 2021 • edited

manast commented Jun 3, 2021

miwnwski commented Jun 8, 2021

gabrielrtomio commented Jun 8, 2021

manast commented Jun 9, 2021

anhnhatuit commented Jul 9, 2021

batrudinych commented Jul 9, 2021

nirgilboa commented Aug 21, 2021

dbousamra commented Aug 23, 2021

ilgianlu commented Aug 23, 2021 • edited by manast

xairoo commented Sep 12, 2021

francois2metz commented Sep 13, 2021

xairoo commented Sep 17, 2021 • edited

m-gagnon commented Sep 27, 2021

xairoo commented Oct 25, 2021

xairoo commented Oct 25, 2021 • edited

anhnhatuit commented Oct 25, 2021

manast commented Oct 25, 2021

manast commented Oct 25, 2021 • edited

anhnhatuit commented Oct 25, 2021

manast commented Oct 25, 2021

anhnhatuit commented Oct 25, 2021

xairoo commented Oct 25, 2021 • edited

What I mean with closing and reconnecting

flavianh commented Oct 25, 2021

manast commented Oct 27, 2021

xairoo commented Oct 27, 2021 • edited

EPecherkin commented Oct 27, 2021

manast commented Oct 28, 2021 • edited

manast commented Oct 28, 2021

anhnhatuit commented Oct 28, 2021

xairoo commented Oct 28, 2021

royalpinto commented Nov 3, 2021

Yuliang-Lee commented Aug 24, 2022

Shahtaj-Khalid commented Jun 28, 2023 • edited

manast commented Sep 29, 2020 •

edited

manast commented Oct 20, 2020 •

edited

elghazal-a commented Nov 18, 2020 •

edited

koiszzz commented Mar 10, 2021 •

edited

koiszzz commented Mar 10, 2021 •

edited

givorra-ls commented Mar 29, 2021 •

edited

miwnwski commented Jun 2, 2021 •

edited

ilgianlu commented Aug 23, 2021 •

edited by manast

xairoo commented Sep 17, 2021 •

edited

xairoo commented Oct 25, 2021 •

edited

manast commented Oct 25, 2021 •

edited

xairoo commented Oct 25, 2021 •

edited

xairoo commented Oct 27, 2021 •

edited

manast commented Oct 28, 2021 •

edited

Shahtaj-Khalid commented Jun 28, 2023 •

edited