Properly debugging `job stalled more than allowable limit` #412

emhagman · 2016-12-09T16:58:03Z

Hi!

I have a job that simply spins up an Amazon lambda function and awaits the return of the response. I thought that stalled jobs only had to do when there was too much CPU work occurring on the main thread and so I am confused as to why my job would be stalling.

Would you mind explaining the different ways a job could be stalled? I think I am missing something as far as how the job stalling works.

Version: 1.1.3
Redis Version: 3.2.1

Error: job stalled more than allowable limit 
at /app/node_modules/bull/lib/queue.js:569:50

The text was updated successfully, but these errors were encountered:

bradvogel · 2016-12-09T17:38:46Z

Does it happen for all jobs? Or only some? A job can only get stalled if bull isn't able to renew the lock (which it does internally). This only happens if the entire node event loop gets behind due to high CPU (and setInterval isn't run).

…

On Dec 9, 2016, at 8:58 AM, Eric Hagman ***@***.***> wrote: Hi! I have a job that simply spins up an Amazon lambda function and awaits the return of the response. I thought that stalled jobs only had to do when there was too much CPU work occurring on the main thread and so I am confused as to why my job would be stalling. Would you mind explaining the different ways a job could be stalled? I think I am missing something as far as how the job stalling works. Version: 1.1.3 Redis Version: 3.2.1 Error: job stalled more than allowable limit at /app/node_modules/bull/lib/queue.js:569:50 — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

emhagman · 2016-12-09T20:14:39Z

Only for this one job type at the moment. This job used to do a lot more work but now all of that has been moved off the server to Amazon Lambda so I find it odd that now that it does no work on the the server with bull, it stalls.

I do have multiple workers running bull if that matters at all.

I use Trace and did detect that there was event loop lag during that time. I will look into it further on my end, thanks for the explanation!

bradvogel · 2016-12-16T02:28:43Z

Any update on this?

sschizas · 2016-12-19T15:25:52Z

Started getting the same error too.

bradvogel · 2016-12-19T20:21:40Z

@n3trino does it happen for all job types, or only some? Are you seeing high CPU when the job is running (that might cause it to fail to renew the timer)?

carcinocron · 2017-01-18T16:20:05Z

Is it possible to increase the allowable limit for specific queues?

jf · 2017-01-31T04:32:23Z

Is it possible for somebody to explain what this limit is? Is it a limit on time for a worker/job? If a worker/job takes too long, will this be triggered?

zhaohanweng · 2017-03-27T22:41:30Z

are the jobs properly resolved? eg. called jobDone()? or Promise.resolve() at the end of process?

jamesearl · 2017-04-06T01:29:05Z

I've started to experience this as well, however the job doesn't truly fail. What I mean by that is, the work that I wanted to get done is finished in entirety. But because of the error (I think because of the error, anyway), the job is automatically retried, and the second run is guaranteed to error with the following:

Error: job stalled more than allowable limit
    at node_modules/bull/lib/queue.js:616:39
    at tryCatcher (node_modules/bluebird/js/release/util.js:16:23)
    at Promise._settlePromiseFromHandler (node_modules/bluebird/js/release/promise.js:512:31)
    at Promise._settlePromise (node_modules/bluebird/js/release/promise.js:569:18)
    at Promise._settlePromise0 (node_modules/bluebird/js/release/promise.js:614:10)
    at Promise._settlePromises (node_modules/bluebird/js/release/promise.js:693:18)
    at Async._drainQueue (node_modules/bluebird/js/release/async.js:133:16)
    at Async._drainQueues (node_modules/bluebird/js/release/async.js:143:10)
    at Immediate.Async.drainQueues (node_modules/bluebird/js/release/async.js:17:14)
    at runCallback (timers.js:649:20)
    at tryOnImmediate (timers.js:622:5)
    at processImmediate [as _immediateCallback] (timers.js:594:5)

@bradvogel my jobs do typically peg my machine's CPU. But I don't see failures until I run jobs that take over 45s to complete, appx.

@zhaohanweng I'm returning a promise from the process function, so I'm assuming I do not need to call Promise.resolve() in that case, but please correct me if I'm wrong?

bradvogel · 2017-04-06T01:52:23Z

Can you remove parts of your job processing function until you can get it run successfully? I bet some part (probably near the end of the processing function) is stalling the Javascript event loop and causing Bull's setInterval() call to renew the timer to lag.

jamesearl · 2017-04-06T04:39:53Z

@bradvogel yep, I reorganized the work into two separate jobs that run sequentially and things are cranking along smoothly now. The CPU is much less taxed, so it seems you were exactly right about the timer latency.

Thanks!

manast · 2017-04-06T09:54:46Z

Since this issue seems to be very recurrent, I have added a new feature request that hopefully will solve this problem once and for all: #488

cleivson · 2022-12-07T13:33:16Z

What's the feature request that replaced this bug?

bobber205 · 2022-12-08T18:03:12Z

What's the feature request that replaced this bug?

I'd love to know as well

SirPhemmiey · 2023-04-23T16:48:42Z

any update on this?

manast closed this as completed Jun 29, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Properly debugging `job stalled more than allowable limit` #412

Properly debugging `job stalled more than allowable limit` #412

emhagman commented Dec 9, 2016

bradvogel commented Dec 9, 2016 via email

emhagman commented Dec 9, 2016 •

edited

bradvogel commented Dec 16, 2016

sschizas commented Dec 19, 2016

bradvogel commented Dec 19, 2016

carcinocron commented Jan 18, 2017 •

edited

jf commented Jan 31, 2017

zhaohanweng commented Mar 27, 2017

jamesearl commented Apr 6, 2017

bradvogel commented Apr 6, 2017

jamesearl commented Apr 6, 2017

manast commented Apr 6, 2017

cleivson commented Dec 7, 2022

bobber205 commented Dec 8, 2022

SirPhemmiey commented Apr 23, 2023

Properly debugging job stalled more than allowable limit #412

Properly debugging job stalled more than allowable limit #412

Comments

emhagman commented Dec 9, 2016

bradvogel commented Dec 9, 2016 via email

emhagman commented Dec 9, 2016 • edited

bradvogel commented Dec 16, 2016

sschizas commented Dec 19, 2016

bradvogel commented Dec 19, 2016

carcinocron commented Jan 18, 2017 • edited

jf commented Jan 31, 2017

zhaohanweng commented Mar 27, 2017

jamesearl commented Apr 6, 2017

bradvogel commented Apr 6, 2017

jamesearl commented Apr 6, 2017

manast commented Apr 6, 2017

cleivson commented Dec 7, 2022

bobber205 commented Dec 8, 2022

SirPhemmiey commented Apr 23, 2023

Properly debugging `job stalled more than allowable limit` #412

Properly debugging `job stalled more than allowable limit` #412

emhagman commented Dec 9, 2016 •

edited

carcinocron commented Jan 18, 2017 •

edited