Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Worker stopped processing jobs, and mostly delayed Jobs #2466

Open
1 task done
wernermorgenstern opened this issue Mar 8, 2024 · 116 comments · Fixed by #2515
Open
1 task done

[Bug]: Worker stopped processing jobs, and mostly delayed Jobs #2466

wernermorgenstern opened this issue Mar 8, 2024 · 116 comments · Fixed by #2515
Labels
bug Something isn't working

Comments

@wernermorgenstern
Copy link

Version

v5.4.2

Platform

NodeJS

What happened?

We have a service, where a worker runs, and processes jobs. After the processing is done, it will create another job, which is delayed (around 64 minutes).
Today, I noticed that the service and worker stopped processing jobs. There were no error messages in the logs. When I used BullBoard (I use it as a UI to see jobs), I saw the jobs were still in the delayed state, and like 24 hours overdue.

When I restarted the service, and the worker started, it immediately started processing those delayed jobs.
This is not the first it happened. Today I though first checked the delayed jobs.

In today's incident, the service has been running for 4 days.

We run in EKS on AWS (NodeJS service, using Typescript). I use BullMQ Pro. And we are using Groups and each Group has a concurrency set to 1.

How to reproduce.

I don't have any test code for this

Relevant log output

No Logs or error logs were produced

Code of Conduct

  • I agree to follow this project's Code of Conduct
@wernermorgenstern wernermorgenstern added the bug Something isn't working label Mar 8, 2024
@manast
Copy link
Contributor

manast commented Mar 8, 2024

Do you know if the worker is still connected to Redis when this happens? for instance, does the workers appear list of workers that are online for the queue?

@wernermorgenstern
Copy link
Author

That I need to check. That is a good idea. Is there a method which shows that? Or something I can look directly in Redis?

@manast
Copy link
Contributor

manast commented Mar 8, 2024

You can use https://api.docs.bullmq.io/classes/v5.Queue.html#getWorkers or Taskforce.sh, I don't know if BullBoard can also show this information.

@manast
Copy link
Contributor

manast commented Mar 8, 2024

Also important to read this if you haven't already: https://docs.bullmq.io/guide/going-to-production#maxretriesperrequest

@wernermorgenstern
Copy link
Author

Also important to read this if you haven't already: https://docs.bullmq.io/guide/going-to-production#maxretriesperrequest

I have set that one as null

    const redis = new Redis(config.redisBullUrl as string, {
      keepAlive: 1,
      autoResubscribe: true,
      autoResendUnfulfilledCommands: true,
      connectionName: `${config.serviceName}-${config.agentId}-${redisConnection}-bullmq`,
      connectTimeout: helpers.period('10s'),
      enableOfflineQueue: true,
      enableReadyCheck: true,
      enableAutoPipelining: true,
      maxRetriesPerRequest: null,
      retryStrategy: function (times) {
        return Math.max(Math.min(Math.exp(times), 20000), 1000)
      },
      offlineQueue: true
    })

@wernermorgenstern
Copy link
Author

getWorkers

I don't think Bullboard shows it. I haven't see it.

For Taskforce.sh, is there a trial version? For how long?
And what is the cost of it? I want to see if it might be better than Bullboard.

We use Redis Cloud (we don't self-host our Redis Instances)

@wernermorgenstern
Copy link
Author

So regarding checking if worker is connected.
I actually have that already in the Code, for me to make API Calls to check that. So that is good.
So next time we have a similar issue, and I notice the worker is not connected, what would the next steps be in troubleshooting, and resolving this issue?

@wernermorgenstern
Copy link
Author

I also was reading the one link you gave. I need to change this enableOfflineQueue to false, for the Redis Instance for Queue, and add a job. But leave it for true for Workers. So that is one good suggestion

@manast
Copy link
Contributor

manast commented Mar 10, 2024

For Taskforce.sh, is there a trial version? For how long?
And what is the cost of it? I want to see if it might be better than Bullboard.

Yes there is trial. You can find the pricing on the webpage: https://taskforce.sh/

@manast
Copy link
Contributor

manast commented Mar 10, 2024

So next time we have a similar issue, and I notice the worker is not connected, what would the next steps be in troubleshooting, and resolving this issue?

If you use the suggested settings, the workers should automatically reconnect as soon they can, so you should not get this issue anymore.

@wernermorgenstern
Copy link
Author

With taskforce.sh, I am a bit confused what the difference is between the plans, especially the connections between 1, 5 and 15

@wernermorgenstern
Copy link
Author

Also, we use redis cloud. Three instances, one is development, then qa test and then currently one production. So does that mean we need 3 connections?

Also, can we used direct connection or taskforce.sh connector?
what about the pro connector? Do we need that one?

I will need to make also sure we enable tls on the redis cloud instances

@manast
Copy link
Contributor

manast commented Mar 11, 2024

You need a connection for every host that you want to debug/monitor, so in your example it would be 3 as you suspected. The direct connector can be used if you have access to your Redis host from the internet. If not, you can use the connector. For example the connector is always used for local connections.

@mistersingh179
Copy link

@manast I am noticing the same issue with a bunch of delayed jobs not getting picked up by the worker.
I do notice that they stay in the delayed job forever till a new job comes in to active state and then the delayed jobs all get get picked up.

image

Here you can see that job was created 5 mins ago with delay of 1-2 seconds and the job is just sitting there.

image

Here you can see that the worker is connected.

@manast
Copy link
Contributor

manast commented Mar 15, 2024

What are your worker settings?

@manast
Copy link
Contributor

manast commented Mar 15, 2024

Another thing, do you know why these jobs have such high "started" numbers?

@manast
Copy link
Contributor

manast commented Mar 15, 2024

Also, how are those delayed jobs produced?

@wernermorgenstern
Copy link
Author

We just had a similar issue this morning.

Here is how we can create the delayed Jobs.

const createNextJob = async (
  jobRotateQueue: QueuePro,
  tid: string,
  devCreated: any,
  devConfig: any,
  delay: string
) => {
  const jobData = {
    tid,
    devCreated,
    devConfig
  }

  const jobOptions: JobsProOptions = {
    removeOnComplete: {
      count: 1000
    },
    removeOnFail: {
      count: 10
    },
    attempts: 6,
    delay: 0,
    group: {
      id: tid
    },
    delay: helpers.period(delay) // This converts e.g. '2m' to milliseconds
  }

  try {
    return await jobRotateQueue.add('jobName', jobData, jobOptions)
  } catch (err: any) {
    winstonLogger.error('Unable to add Job', {
        groupId: tid,
        error: { message: err.message, stack: err.stack, code: err.code }
      })
    }
  }
}

@mistersingh179
Copy link

What are your worker settings?

my worker options are:

{
    connection: redisClient,
    concurrency: Number(10),
    limiter: {
      max: 100,
      duration: 1000,
    },
    autorun: false,
    metrics: {
      maxDataPoints: MetricsTime.TWO_WEEKS,
    },
  },

with redisClient being

const redisClient = new Redis(REDIS_URL, {
  maxRetriesPerRequest: null,
  enableReadyCheck: false,
  retryStrategy: function (times: number) {
    return Math.max(Math.min(Math.exp(times), 20000), 1000);
  },
});

// this prevents node warning which comes from many workers adding listeners to the same redis connection
redisClient.setMaxListeners(100);

@mistersingh179
Copy link

Also, how are those delayed jobs produced?

When i add a job, in its processor function i try to get a redis lock on a custom key, if it fails to get a lock, i move the job to delayed like this

await job.moveToDelayed(Date.now() + randomInt(1000, 2000), token);
throw new DelayedError();

@mistersingh179
Copy link

Another thing, do you know why these jobs have such high "started" numbers?

they have a high started number, because they have been tried many times and every time they were unable to get a lock they were moved to delayed.

@roggervalf
Copy link
Collaborator

hi @wernermorgenstern, I can see that your are using groups but you are pointing to 5.4.2 bullmq version, not the actual bullmq-pro version. v7.3.0 that was released today is using bullmq 5.4.2. Just for curiosity, are you using bullmq along side with bullmq-pro?

@wernermorgenstern
Copy link
Author

I am actually using pro. So do I need to remove the bullmq from the package for? I use the pro versions of the functions and constructors

@roggervalf
Copy link
Collaborator

hey, if you are meaning that you have bullmq-pro and bullmq in your packages.json, yes you should only have pro version, as we used fixed versions of bullmq in pro version

@manast
Copy link
Contributor

manast commented Mar 16, 2024

another question, which versión of Redis are you using?

@wernermorgenstern
Copy link
Author

hey, if you are meaning that you have bullmq-pro and bullmq in your packages.json, yes you should only have pro version, as we used fixed versions of bullmq in pro version

I will remove and try that

@wernermorgenstern
Copy link
Author

another question, which versión of Redis are you using?

Hi, we are using Redis Cloud Enterprise, which is on 7.2

@roggervalf
Copy link
Collaborator

roggervalf commented Mar 16, 2024

hi @wernermorgenstern could you pls try upgrading to pro version 7.3.0, that's our last release https://github.com/taskforcesh/bullmq/blob/master/docs/gitbook/bullmq-pro/changelog.md#730-2024-03-16 that contains a fix that affects delayed jobs

@wernermorgenstern
Copy link
Author

hi @wernermorgenstern could you pls try upgrading to pro version 7.3.0, that's our last reléase https://github.com/taskforcesh/bullmq/blob/master/docs/gitbook/bullmq-pro/changelog.md#730-2024-03-16 that contains a fix that afects delayed jobs

I will do that. And deploy it to our production environment next week. What we saw yesterday, that we had one job stuck in active state. In Taskforce.sh (Trial version) - I love it so far - I saw it had workers, and idle time was between 1s and 10s. So the worker was still processing other jobs.
But the active job was stuck in active. I could not even delete it with Taskforce.sh
Only way I could get it unstuck is to restart the service, which created a new worker, and then create a job with the same Group ID, and then the stuck job went out of active state.

@ChrisLahaye
Copy link

5.7.6 fixes it for us

@matthewgonzalez
Copy link

Going to cut a release with 5.7.7 today. Will let you know if we experience again. Thanks so much, BullMQ team.

@matthewgonzalez
Copy link

We've actually seen this occur again twice today after having deployed our workers with 5.7.7 last night.

Here's a screenshot of the job stuck in delayed.
image

@manast
Copy link
Contributor

manast commented May 2, 2024

@matthewgonzalez is this the only job in the queue (a repeatable jobs) or do you use the same queue with other types of jobs?

@manast
Copy link
Contributor

manast commented May 2, 2024

Also, what are your settings for the repeat job, it would be golden if we could reproduce it ourselves. And which concurrency settings are you using, how many workers, etc.

@matthewgonzalez
Copy link

matthewgonzalez commented May 2, 2024

@manast

@matthewgonzalez is this the only job in the queue (a repeatable jobs) or do you use the same queue with other types of jobs?

Yes, the only job in the queue

Also, what are your settings for the repeat job, it would be golden if we could reproduce it ourselves. And which concurrency settings are you using, how many workers, etc.

Here's how we enqueue the job when the server starts:

queue.add('__default__', null, {
    jobId: queueName + '-cron-worker-job',
    repeat: {
      every: 15000, // every 15 seconds
    },
  })

No concurrency set in BullMQ, but we do autoscale our pods up to a maximum of 5 depending on resource utilization.

@manast
Copy link
Contributor

manast commented May 2, 2024

@matthewgonzalez thank you for the info, we will try to reproduce it based on those settings.

@roggervalf
Copy link
Collaborator

hi @matthewgonzalez, are you passing a ioredis instance? if yes, are you also setting maxRetriesPerRequest: null ?

@matthewgonzalez
Copy link

matthewgonzalez commented May 6, 2024

Yes we are passing in an ioredis instance. Here is how that instance is configured:

export const connection = new IORedis(REDIS_URL, {
  // a warning is thrown on redis startup if these aren't added
  enableReadyCheck: false,
  maxRetriesPerRequest: null,
}).setMaxListeners(4 * queuesMQ.length + 10)

@matthewgonzalez
Copy link

We rolled back to 5.4.0 two weeks ago and have not experienced this issue since. Still curious as to the resolution here so we know when it may be safe to attempt to keep BullMQ up-to-date again. Thanks for all your support thus far.

@manast manast reopened this May 17, 2024
@manast
Copy link
Contributor

manast commented May 17, 2024

I am reopening as it seems this is still not working, until we can reproduce it.

@manast
Copy link
Contributor

manast commented May 24, 2024

For the record, this is the test code I am using and not able to reproduce the issue. I wonder if there is not something more to it, like disconnections or something like that.

const { Queue, Worker } = require("bullmq");

const queueName = "test";

async function start() {
  const queue = new Queue(queueName, {
    connection: { host: "localhost", port: 6379 },
    // a warning is thrown on redis startup if these aren't added
    enableReadyCheck: false,
    maxRetriesPerRequest: null,
  });

  const job = await queue.add("__default__", null, {
    jobId: queueName + "-cron-worker-job",
    repeat: {
      every: 15000, // every 15 seconds
    },
    data: {
      foo: "bar",
    },
  });

  const processFn = async (job) => {
    console.log(`Processing job ${job.id} with data ${job.data}`)
    console.log(`-> ${job.id}`);
    await new Promise((res) => setTimeout(res, 1000));
    console.log(`\t<- ${job.id}`);
  };

  const worker = new Worker(queueName, processFn, {
    connection: {
      host: "localhost",
      port: 6379, // a warning is thrown on redis startup if these aren't added
      enableReadyCheck: false,
      maxRetriesPerRequest: null,
    },
  });

  worker.on("error", (err) => {
    console.error(err);
  });

  worker.on("completed", (job) => {
    console.log(`Job ${job.id} completed`);
  });

  worker.on("failed", (job, err) => {
    console.error(`Job ${job.id} failed with ${err.message}`);
  });
}

start();

@lukas-becker0
Copy link

I think we encountered this issue today,

we are using bullmq 5.7.8 and ioredis 5.4.1
and today we noticed that for some reason no job was being processed.

We first thought that it might be a redis issue,
we did not do a thoroughly investigation yet but
after restarting the service we noticed that the jobs were processed as expected
which means that adding the jobs worked but the worker did not process them.

This is the first time we encountered this issue.

Aside from maxRetriesPerRequest: null we are using lazyConnect: true and we are currently not settingenableReadyCheck at all (should we set it to false ?).

@lukas-becker0
Copy link

lukas-becker0 commented May 27, 2024

So next time we have a similar issue, and I notice the worker is not connected, what would the next steps be in troubleshooting, and resolving this issue?

If you use the suggested settings, the workers should automatically reconnect as soon they can, so you should not get this issue anymore.

Is is really necessary to explicitly set enableOfflineQueue: true for the workers in order for them to reconnect ?
So far this was not necessary.

You can use https://api.docs.bullmq.io/classes/v5.Queue.html#getWorkers or Taskforce.sh, I don't know if BullBoard can also show this information.

I connected to our Q Redis instance and checked the workers & worker count of the deployed queue locally
(Since the queues of the Q stage also stopped processing jobs)
and indeed the workers of our Q deployment are apparently no longer connected,
the worker count for the queue is 0.

@lukas-becker0
Copy link

I connected to our Q Redis instance and checked the workers & worker count of the deployed queue locally (Since the queues of the Q stage also stopped processing jobs) and indeed the workers of our Q deployment are apparently no longer connected, the worker count for the queue is 0.

Apparently the managed redis instances were restarted prior to the issue.
About two hours later the queue did accept new jobs again, the worker however stopped processing jobs
(We do not know yet why the worker stopped, we use the same options (.e.g. infinite retries) for both the queue and the worker.

@lukas-becker0
Copy link

lukas-becker0 commented May 27, 2024

Hi, @manast
I did some tests with the code you provided.

If I force close the redis connections with e.g.
redis-cli CLIENT KILL TYPE normal && redis-cli CLIENT KILL TYPE slave && redis-cli CLIENT KILL TYPE pubsub
the worker reconnects and continues to process jobs as expected.

If however I kill/shutdown the entire redis server docker-compose down, then wait a few seconds (~5 seconds) and then start redis again (docker-compose up -d),
the worker reconnects but does not continue to process jobs (although isRunning returns true) from the queue.
The queue does then no longer see any connected worker.

(Simulating a server restart and/or crash)

Is that the expected behaviour ?

docker-compose.yaml

version: '3.8'

services:
    redis:
        container_name: 'redis_test_server'
        image: redis:6.2.14-alpine
        restart: always
        ports:
            - '6379:6379'
        volumes:
            -   /tmp/redis-test-server:/data

(Setting redis-cli config set maxmemory-policy noeviction after the first start ofcourse)

I modified your code slightly:

const { Queue, Worker } = require('bullmq');

const queueName = 'test';

async function start() {
	const queue = new Queue(queueName, {
		connection: { host: 'localhost', port: 6379 },
		// a warning is thrown on redis startup if these aren't added
		enableReadyCheck: false,
		maxRetriesPerRequest: null,
		enableOfflineQueue: false,
	});

	setInterval(() => {
		queue.getWorkersCount().then((numberOfWorkers) => {
			console.warn(`Number of workers: ${numberOfWorkers}`);
		});

		queue.getJobCounts().then((numberOfJobs) => {
			console.warn(`Number of jobs: ${JSON.stringify(numberOfJobs)}`);
		});
	}, 10_000);

	const job = await queue.add('__default__', null, {
		jobId: queueName + '-cron-worker-job',
		repeat: {
			every: 3000, // every 3 seconds
		},
		data: {
			foo: 'bar',
		},
	});

	const processFn = async (job) => {
		console.log(`Processing job ${job.id} with data ${job.data}`);
		console.log(`-> ${job.id}`);
		await new Promise((res) => setTimeout(res, 1000));
		console.log(`\t<- ${job.id}`);
	};

	const worker = new Worker(queueName, processFn, {
		connection: {
			host: 'localhost',
			port: 6379, // a warning is thrown on redis startup if these aren't added
			enableReadyCheck: false,
			maxRetriesPerRequest: null,
			enableOfflineQueue: true,
		},
	});

	worker.on('error', (err) => {
		console.error(err);
	});

	worker.on('closed', () => {
		console.warn('Worker closed');
	});

	worker.on('ready', () => {
		console.warn('Worker is ready!');
	});

	worker.on('completed', (job) => {
		console.log(`Job ${job.id} completed`);
	});

	worker.on('failed', (job, err) => {
		console.error(`Job ${job.id} failed with ${err.message}`);
	});
}

start();

@lukas-becker0
Copy link

lukas-becker0 commented May 27, 2024

I can reproduce the above with bullMQ starting with version 5.0.0,
bullMQ 4.17.0 works as expected.

@manast
Copy link
Contributor

manast commented May 28, 2024

@lukas-becker0 seems like I am able to reproduce it following your instructions. I will keep you updated...

@manast
Copy link
Contributor

manast commented May 29, 2024

I hope this small fix finally resolves this issue for everybody.

@lukas-becker0
Copy link

Hi @manast,

I'm sorry but I'm still able to reproduce it with the fix and bullmq 5.7.13

Assuming it is enough to add the line from the fix PR to worker.js in bullmq/dist/cjs/classes/worker.js ?

image

Sometimes the worker can connect again but when I then restart redis for a second or third time it eventually results in the same issue as before.

image

I also tried with a custom redis.conf with the recommended AOF option being set from the start but it made no difference.
image

@lukas-becker0
Copy link

FYI, @manast
I created a repo with my current test setup here, just in case it might be helpful.

@manast
Copy link
Contributor

manast commented May 29, 2024

It is not enough adding the line with the disconnect, you must remove the other 2 as well.

@lukas-becker0
Copy link

lukas-becker0 commented May 29, 2024

@manast

It is not enough adding the line with the disconnect, you must remove the other 2 as well.

Sorry you are right, I somehow missed that 🙈 ... (due to the darkreader firefox extension ....)

I just did run the tests again and it does indeed work now as expected.
Thank you very much and sorry for the confusion.
🎆 😃

@croconut
Copy link

croconut commented May 29, 2024

Really glad this got fixed, literally just ran into this issue today during POC testing and upgrading to 5.7.14 did the trick. Thanks @manast

@wernermorgenstern
Copy link
Author

wernermorgenstern commented May 29, 2024 via email

@manast
Copy link
Contributor

manast commented May 30, 2024

@wernermorgenstern its coming very soon.

@roggervalf
Copy link
Collaborator

hi @wernermorgenstern, it's available since v7.8.2 in pro version

@matthewgonzalez
Copy link

matthewgonzalez commented May 31, 2024

We will attempt the update next week and report back.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants