Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jobs get stuck after Redis reconnect #1873

Closed
royalpinto opened this issue Sep 28, 2020 · 64 comments · Fixed by #2185
Closed

Jobs get stuck after Redis reconnect #1873

royalpinto opened this issue Sep 28, 2020 · 64 comments · Fixed by #2185

Comments

@royalpinto
Copy link

Description

When the Redis gets disconnected and connected again, the queue doesn't pick up any jobs. (Easy To Reproduce)

Minimal, Working Test code to reproduce the issue.

const Queue = require('bull')
const Redis = require('ioredis')

const client = new Redis({
  host: 'localhost',
  port: 6379
})

const queue = new Queue('test', client)

client.on('ready', () => {
  console.log('client ready')
})

client.on('connect', () => {
  console.log('client connect')
})

client.on('close', () => {
  console.log('client close')
})

client.on('end', error => {
  console.log('client end', error)
})

client.on('reconnecting', error => {
  console.log('client reconnecting', error)
})

client.on('error', error => {
  console.log('client error', error.message)
})

queue
  .on('completed', (job, results) => {
    console.log('queue completed a job')
  })
  .on('error', async (error) => {
    console.error('queue error', error.message)
  })
  .on('failed', async (job, error) => {
    console.error('queue error', error.message)
  })

queue.process(async (job) => {
  console.log('running task')
})

setInterval(function () {
  queue.add({}).then(() => console.log('pushed task')).catch((e) => console.log('failed to push task'))
}, 10000)

Bull version

Bull 3.18.0
Node 12 and 14
Redis 3.2.8

Steps to reproduce.

  • Run the above script
  • Bull JS start accepting and processing Jobs
  • Stop Redis
  • Bull JS will keep retrying
  • Start Redis
  • Bull JS will connect in a few milliseconds
  • At this stage, the queue can push the jobs and but it won't process any jobs. :-(

Things we tried

@manast
Copy link
Member

manast commented Sep 28, 2020

I tested the code above (I just changed the timeout to be 1 second for faster testing), it is working perfectly on Mac os, I tried many times to stop the redis server and re-start it and it always continue processing as soon as the redis server is online. Is there anything else you can share in order to reproduce the issue?

@royalpinto
Copy link
Author

@manast It's strange only on one instance I could reproduce, but other places I could not reproduce. I will try to investigate more on this.

However, there is a similar issue we found which is reproducible everywhere. The scenario is below:

  • Stop Redis Server
  • Start above snippet
  • The snippet will keep trying to connect
  • Start the Redis server
  • The snippet will eventually connect in a few seconds.
    => Now it cannot push or process any jobs. It rather throws "UnhandledPromiseRejectionWarning: Error: connect ECONNREFUSED 127.0.0.1:6379". (This is for the client but something similar for the subscriber too. Even if we push from another process, the subscriber won't pick up the jobs).

I tried analyzing the code and found that inside the Queue class the this._initializing promise is initialized only once in the constructor and it will be rejected because Redis will not be running during the startup. When it eventually connects, this._initializing will be the same old state so throws an error. It should rather re-initialize.

https://github.com/OptimalBits/bull/blob/develop/lib/queue.js#L148

@manast
Copy link
Member

manast commented Sep 29, 2020

okok, so you mean if redis is not running when bull starts, I will look into it.

@ruscon
Copy link

ruscon commented Oct 13, 2020

@royalpinto @manast yep, I have the same issue.

@mbieser
Copy link

mbieser commented Oct 17, 2020

+1

@manast
Copy link
Member

manast commented Oct 20, 2020

I marked this as an enhancement because I consider it a new requirement, i.e. being able to work when starting without an available redis connection if that connection becomes available at a later time. May require some non trivial refactoring.

@elghazal-a
Copy link

elghazal-a commented Nov 18, 2020

Actually, I can reproduce this scenario

  • Stop Redis
  • Wait for a moment and start again Redis
  • Bull doesn't catch up tasks that should've run the time when Redis was down
  • Restart bull, then Bull catch up all tasks

I think this is a bug. Do you have any workaround for that @manast ?

@koiszzz
Copy link

koiszzz commented Mar 10, 2021

I have met the same issue. when redis server came into crash for a long time, longer than redis reconnecting times , the job just add into the queue and just waiting,never start

@manast
Copy link
Member

manast commented Mar 10, 2021

@koiszzz I cannot find how many times ioredis reconnects before it gives up, maybe it tries to reconnect forever or maybe not, this should be investigated, however I tested extensively reconnects in latest version of Bull and could not reproduce any issue. If anybody does please post some instructions on how to reproduce it, ideally a piece of code I can run so I can take a deeper look into it.

@koiszzz
Copy link

koiszzz commented Mar 10, 2021

I post an bull test project bull-test,
start redis server and then start the express server , browser the url : localhost:3000/add, the console will print job done.
then stop the redis server, and wait for a moment .
restart the redis server, browser the url : localhost:3000/add, you will find the job add to the queue, but then state will be stuck in 'waiting' @manast

@manast
Copy link
Member

manast commented Mar 10, 2021

@koiszzz I tested your code. I guess you are testing using a redis instance inside docker? I experience a bug with ioredis reported here: redis/ioredis#1285 where reconnect does not work if redis is running inside a docker container in MacOS at least. Your repo works well for me if I use a local redis instance but it does not work if I use a redis docker instance... So if this is your case I suggest you ping Luin to see if he can reproduce it, he has failed to reproduce it so far but if more people than me can reproduce it, then maybe it will get more attention... alternatively that I debug ioredis myself but not sure when I will find time for that...

@koiszzz
Copy link

koiszzz commented Mar 10, 2021

Thanks for replying, @manast . The redis server is installed in docker. Maybe I should install a redis server without docker for test.

@givorra-ls
Copy link

givorra-ls commented Mar 29, 2021

Lol.... I had the same issue, I was reading your comments and doing some tests. I realised that when running redis via docker, the redis data has no persistence and is therefore Bull jobs are deleted when the container is stopped and restarted... The "solution" is to start redis via docker-compose using a volume :)

@miwnwski
Copy link

miwnwski commented Jun 2, 2021

I encountered this error and have been investigating, I'll share our findings here - maybe it helps. As far as I can understand, when the client reconnects, Queue#processJobs is not run. Queue.prototype.process calls Queue#start which calls Queue#run which calls Queue#processJobs which enters the queue in the forever loop state by throwing promises.

The problem (as I understand it) is that when Queue#processJobs is called it can resolve if the Queue is closing or if an error is thrown. This creates a state where the connection exists but the forever promise loop never executes.

Our patch fix for this (at least locally - not tested in production yet) is basically to call queue.run(queue.concurrency) on ready. Using the above example which I could use to reproduce the problem, this would be the "fix":

const Queue = require('bull')
const Redis = require('ioredis')

const client = new Redis({
  host: 'localhost',
  port: 6379
})

const queue = new Queue('test', client)
let reconnectingEvent = false 

client.on('ready', () => {
  console.log('client ready')
  if (reconnectingEvent) { // this is actually the fix, we just set a flag to not kick run unless we are restarting
    console.log('rerun queue run')
    reconnectingEvent = false
    queue.run(queue.concurrency)
  }
})

client.on('connect', () => {
  console.log('client connect')
})

client.on('close', () => {
  console.log('client close')
})

client.on('end', error => {
  console.log('client end', error)
})

client.on('reconnecting', error => {
  console.log('client reconnecting', error)
  reconnectingEvent = true
})

client.on('error', error => {
  console.log('client error', error.message)
})

queue
  .on('completed', (job, results) => {
    console.log('queue completed a job')
  })
  .on('error', async (error) => {
    console.error('queue error', error.message)
  })
  .on('failed', async (job, error) => {
    console.error('queue error', error.message)
  })

queue.process(async (job) => {
  console.log('running task')
})

setInterval(function () {
  queue.add({}).then(() => console.log('pushed task')).catch((e) => console.log('failed to push task'))
}, 10000)

I'm not sure if this is a "good" solution, but it works for now. Maybe maintainers have a better clue on how to fix this cleanly and properly - or describe to which I can PR a fix for.

@tarunbatra do you have anything to add or do you feel this sufficiently describes our findings?

@manast
Copy link
Member

manast commented Jun 3, 2021

This sounds like a good solution actually, we just need to make sure that the old "loop" is not running when the "ready" event is fired by ioredis, so basically we should "cancel" previous loop proactively when starting the new one.

@miwnwski
Copy link

miwnwski commented Jun 8, 2021

@manast How would you approach this? Is there something we can check today (without monkey patching the lib) for an existing running loop?

@gabrielrtomio
Copy link

I have the same issue using heroku, when it restart all queues get stuck in waiting.

@manast
Copy link
Member

manast commented Jun 9, 2021

@miwnwski yeah it is a bit tricky, you need to somehow refactor the "run" method so that it can be cancellable https://github.com/OptimalBits/bull/blob/develop/lib/queue.js#L864
I could do it myself but I have a lot in my plate right now so it will take some time...

@anhnhatuit
Copy link

I have the same issue using GCP. Any quick solution about that.
Thank a lot for help.

@batrudinych
Copy link
Contributor

Faced the same issue using AWS Elasticache. Looks like there was a connection interruption or some kind of a switch. No error logs or failed jobs, discovered the issue by missing data in the database.

@nirgilboa
Copy link

+1
Also encountering this issue when using AWS Elasticache

@dbousamra
Copy link

We are also encountering this issue.

@ilgianlu
Copy link

ilgianlu commented Aug 23, 2021

Hi, I'm working in a kubernetes cluster. Having the same issue. My workaround is to let the queue processor container die whenever it cannot connect to redis container. Kubernetes will take care of restarting my processor pod.
In order for the processor to exit abruptly I'm passing this redis conf:

const redisOptions = {
        ...config.get('redis.connectParams'),
        retryStrategy: (times) => {
            console.log('could not connect to redis!');
            process.exit(1);
        },
    };

@xairoo
Copy link

xairoo commented Sep 12, 2021

Currently in dev and have the same problem. Appears after wake my system from hibernate. I am using a remote Redis server, so the connection gets dropped after a while when the system sleeps :-)

ioredis will reconnect automatically, but Bull won't pull new jobs.

@francois2metz
Copy link

Hi there. I do have some logs on our open source application https://github.com/1024pix/pix-editor

  • At 2021-09-10T06:02:16.000Z, the connection to redis is dropped and the client is trying to reconnect until 2021-09-10T06:02:29.000Z,
  • the connection is still not stable but commands have been resumed, but without the brpoplpush without means that the queues are not processed
  • The connection is restarted every few minutes during days if I don't restart the application

Hope this helps? I'm available to debug anything

extract_ioredis_logs.csv

@xairoo
Copy link

xairoo commented Sep 17, 2021

My MongoDB driver fires the disconnect event when I wake my system from sleep. It will directly reconnect.
ioredis doesn't trigger anything after sleep (but jobs get pushed to queue).

Hm, changed a while ago from Node Redis to ioredis. I think Node Redis hasn't this problem.

Anyway, this doesn't help:

maxRetriesPerRequest: null,
enableReadyCheck: false,

@m-gagnon
Copy link

I am running into this issue as well when facing any kind of redis reconnection. This is very painful since the service appears healthy but in fact never processes any jobs. Any workaround available? Thanks.

@xairoo
Copy link

xairoo commented Oct 25, 2021

Looks like i am missing some bull connections after sleep.
Happens only using a remote redis server. Locally it works.

Screenshots from RedisInsight (cmd client is the RedisInsight client):
After starting it:
Screenshot 2021-10-25 at 08-13-39 Client List RedisInsight

After sleep/reconnect:
Screenshot 2021-10-25 at 08-10-54 Client List RedisInsight

I splitted your code example in a sender/receiver and worker to test if it happens only in such cases.

@xairoo
Copy link

xairoo commented Oct 25, 2021

@manast same happens when I am closing the queue and reopening it.
It results in missing clients. But maybe I am doing it wrong.
You can reproduce it locally.

After start:
Screenshot 2021-10-25 at 08-37-22 Client List RedisInsight

After close/reconnect:
Screenshot 2021-10-25 at 08-37-29 Client List RedisInsight

New jobs get added, but they don't get processed - no wonder without a redis connection.
But again, maybe I am doing it wrong.

const Queue = require("bull");

var testQueue = new Queue("test", {
  redis: {
    maxRetriesPerRequest: null,
    enableReadyCheck: false,
  },
});

testQueue.process(async (job) => {
  console.log(`Process started for Job ${job.id}`);
});

// testQueue.add(
//   { foo: "bar" },
//   {
//     repeat: {
//       every: 1500,
//     },
//   }
// );

// Instead of `repeat`
setInterval(() => {
  console.log(testQueue.client.status);
  if (testQueue.client.status === "ready") { // Check for ready, otherwise it will crash
    testQueue.add({ foo: "bar" });
  }
}, 2500);

testQueue.on("error", (err) => console.log(err));

testQueue.on("global:completed", (jobId) => {
  console.log(`Job ${jobId} completed`);
});

setTimeout(() => {
  console.log("closing");
  testQueue.close();
  setTimeout(() => {
    console.log("reconnect");
    testQueue = new Queue("test", {
      redis: {
        maxRetriesPerRequest: null,
        enableReadyCheck: false,
      },
    });
  }, 2500);
}, 2500);

@anhnhatuit
Copy link

Hello @xairoo x,
Looks like the issue is faced if brpoplpush cmd is blocked. We are using redis of GCP. It's not lost connect but still get stuck after 2 or 3 days processing. We have about 8M jobs should be processed per day. For another queue, Just processed about 2M jobs we are not facing any issue.

@manast
Copy link
Member

manast commented Oct 25, 2021

@anhnhatuit my test code above forces a blocking brpoplpush

@manast
Copy link
Member

manast commented Oct 25, 2021

@xairoo I do not understand what you mean with "same happens when I am closing the queue and reopening it." ?

What do you mean by closing and reopening. And can you or can't you reproduce the issue with the code I provided?, if so, would you mind to explain step by step how to reproduce it?

@anhnhatuit
Copy link

Hello @manast,
So, if we set redis option as below may resolve the issue, right?

maxRetriesPerRequest: null,
enableReadyCheck: false,

Thanks

@manast
Copy link
Member

manast commented Oct 25, 2021

@anhnhatuit as far as I know, yes.

@anhnhatuit
Copy link

Many thanks @manast ,
Let me try. I will update the result soon.

Thanks,
Nhat

@xairoo
Copy link

xairoo commented Oct 25, 2021

@manast yes I can reproduce it with your code and a remote redis server.

I only changed this:

// testQueue.add(
//   { foo: "bar" },
//   {
//     repeat: {
//       every: 1500,
//     },
//   }
// );

// Instead of `repeat`
setInterval(() => {
  console.log(testQueue.client.status);
  if (testQueue.client.status === "ready") { // Check for ready, otherwise it will crash
    testQueue.add({ foo: "bar" });
  }
}, 2500);

Anyway, this shouldn't matter.

What I mean with closing and reconnecting

  • When you start your example, there are 3 bull connections.
  • After a disconnect for 5+ minutes (redis TCP keepalive is set by default to 300 seconds) you can see only 1 or 2 bull connections.

I played a bit around and came to testQueue.close().
After closing the queue the 3 connections are gone (logically).
So I had to reopen/start the queue again.
Please have a look at my example #1873 (comment) - shouldn't the queue process the added jobs after the "reconnect" @ line 43 testQueue = new Queue("test", { again?

The listeners are dead (disconnected), I think that's the problem.

testQueue.process(async (job) => {
  console.log(`Process started for Job ${job.id}`);
});

testQueue.on("global:completed", (jobId) => {
  console.log(`Job ${jobId} completed`);
});

^^ not working because the 2 required redis connections for this are missing.
Only new jobs get added (this client is present), but jobs are not processed.

@flavianh
Copy link

I'm going to attempt to sum up the above conversations.

Bull is getting stuck after the Redis server disconnects then reconnects again. It affects many Bull users as those disconnects happen from time to time in a cloud env.
The leading hypothesis now is that ioredis used to be configured by default with the options

maxRetriesPerRequest: null,
enableReadyCheck: false,

But this is no longer the case as of version 4 of ioredis.

Workaround for now: add these options explicitly (tested and confirmed by @manast, who is a maintanier of this repo )
Proposal for a durable fix: add these options explicitly in Bull. (my 2 cents: maybe also add a warning if the configuration is overriden by a user without these values?)

@manast
Copy link
Member

manast commented Oct 27, 2021

🎉 This issue has been resolved in version 4.0.0 🎉

The release is available on:

Your semantic-release bot 📦🚀

@xairoo
Copy link

xairoo commented Oct 27, 2021

Doesn't work.

If the connection gets dropped by redis TCP keepalive, the listeners queue.process() and queue.on() don't reconnect automatically. New jobs will be put to queue.

I think the listeners should check the connection based on an interval and reconnect if the connection was dropped. That will solve the problem.

@EPecherkin
Copy link

This solution doesn't work for me. The issue is still there

@manast
Copy link
Member

manast commented Oct 28, 2021

Doesn't work.

If the connection gets dropped by redis TCP keepalive, the listeners queue.process() and queue.on() don't reconnect automatically. New jobs will be put to queue.

I think the listeners should check the connection based on an interval and reconnect if the connection was dropped. That will solve the problem.

According to ioredis documentation, with the settings provided this should not happen, and I am unable to reproduce it.

@manast
Copy link
Member

manast commented Oct 28, 2021

@xairoo and @EPecherkin since it seems like you can reproduce the issue easily, can you open a new issue for your specific use case and steps on how to reproduce it so that I can look into it?

@anhnhatuit
Copy link

Hello @manast,
Looks like the issue is not resolved. Actually, I could not reproduce in my local. Just one point it may help you detect the problem.
I'm using Redis service of Google Cloud. We It's got stuck after 2-3 days processing. We have about 9M jobs should be processed per day. About 100k jobs are added to queue per 2 minutes.
For another queue, It just process about 2M jobs per days. It work well without any issue.

Thanks,
Nhat

@xairoo
Copy link

xairoo commented Oct 28, 2021

Give this a try:

setInterval(function () {
  yourQueue.clients.forEach((client) => {
    client.ping();
  });
}, 10000);

@royalpinto
Copy link
Author

🎉 This issue has been resolved in version 4.0.0 🎉

The release is available on:

Your semantic-release bot 📦🚀

@manast I can confirm it works for me for the environment given here redis/ioredis#1285 (comment) when the ioredis gets reconnected.

@Yuliang-Lee
Copy link

After dozen tests, I found the problem was cause by our redis server's deploy-way. LOL. Our redis server was run on k8s with original redis's config -- obviously there is no persistent data when redis server restarts. I did not have time to dig why redis's data lost cause the problem but it is no any problem happens when I modify redis data dir to persistence volumn.

@Shahtaj-Khalid
Copy link

Shahtaj-Khalid commented Jun 28, 2023

I'm still facing the same issue, bull process stops consuming new messages upon reset connection error from redis :
created a separate issue with more details here : #2612.

setting these flags explicitly didn't help :

maxRetriesPerRequest: null,
enableReadyCheck: false,

I'm using Bull's version : 4.10.2 & ioredis : 5.2.4

It seems like this issue is happening only when redis throws Connection reset by peer error, and it's not getting caught by error event as well, right away, but it takes some hours, and once error event is received, bull starts processing the queued by jobs again.

Does anyone found a work around on how to fix this? or do I need to update to any other version of Bull? Kindly let me know as I'm really stuck, thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.