Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Properly debugging job stalled more than allowable limit #412

Closed
emhagman opened this issue Dec 9, 2016 · 15 comments
Closed

Properly debugging job stalled more than allowable limit #412

emhagman opened this issue Dec 9, 2016 · 15 comments

Comments

@emhagman
Copy link

emhagman commented Dec 9, 2016

Hi!

I have a job that simply spins up an Amazon lambda function and awaits the return of the response. I thought that stalled jobs only had to do when there was too much CPU work occurring on the main thread and so I am confused as to why my job would be stalling.

Would you mind explaining the different ways a job could be stalled? I think I am missing something as far as how the job stalling works.

Version: 1.1.3
Redis Version: 3.2.1

Error: job stalled more than allowable limit 
at /app/node_modules/bull/lib/queue.js:569:50 
@bradvogel
Copy link
Contributor

bradvogel commented Dec 9, 2016 via email

@emhagman
Copy link
Author

emhagman commented Dec 9, 2016

Only for this one job type at the moment. This job used to do a lot more work but now all of that has been moved off the server to Amazon Lambda so I find it odd that now that it does no work on the the server with bull, it stalls.

I do have multiple workers running bull if that matters at all.

I use Trace and did detect that there was event loop lag during that time. I will look into it further on my end, thanks for the explanation!

@bradvogel
Copy link
Contributor

Any update on this?

@sschizas
Copy link

Started getting the same error too.

@bradvogel
Copy link
Contributor

@n3trino does it happen for all job types, or only some? Are you seeing high CPU when the job is running (that might cause it to fail to renew the timer)?

@carcinocron
Copy link

carcinocron commented Jan 18, 2017

Is it possible to increase the allowable limit for specific queues?

@jf
Copy link

jf commented Jan 31, 2017

Is it possible for somebody to explain what this limit is? Is it a limit on time for a worker/job? If a worker/job takes too long, will this be triggered?

@zhaohanweng
Copy link

are the jobs properly resolved? eg. called jobDone()? or Promise.resolve() at the end of process?

@jamesearl
Copy link

I've started to experience this as well, however the job doesn't truly fail. What I mean by that is, the work that I wanted to get done is finished in entirety. But because of the error (I think because of the error, anyway), the job is automatically retried, and the second run is guaranteed to error with the following:

Error: job stalled more than allowable limit
    at node_modules/bull/lib/queue.js:616:39
    at tryCatcher (node_modules/bluebird/js/release/util.js:16:23)
    at Promise._settlePromiseFromHandler (node_modules/bluebird/js/release/promise.js:512:31)
    at Promise._settlePromise (node_modules/bluebird/js/release/promise.js:569:18)
    at Promise._settlePromise0 (node_modules/bluebird/js/release/promise.js:614:10)
    at Promise._settlePromises (node_modules/bluebird/js/release/promise.js:693:18)
    at Async._drainQueue (node_modules/bluebird/js/release/async.js:133:16)
    at Async._drainQueues (node_modules/bluebird/js/release/async.js:143:10)
    at Immediate.Async.drainQueues (node_modules/bluebird/js/release/async.js:17:14)
    at runCallback (timers.js:649:20)
    at tryOnImmediate (timers.js:622:5)
    at processImmediate [as _immediateCallback] (timers.js:594:5)

@bradvogel my jobs do typically peg my machine's CPU. But I don't see failures until I run jobs that take over 45s to complete, appx.

@zhaohanweng I'm returning a promise from the process function, so I'm assuming I do not need to call Promise.resolve() in that case, but please correct me if I'm wrong?

@bradvogel
Copy link
Contributor

Can you remove parts of your job processing function until you can get it run successfully? I bet some part (probably near the end of the processing function) is stalling the Javascript event loop and causing Bull's setInterval() call to renew the timer to lag.

@jamesearl
Copy link

@bradvogel yep, I reorganized the work into two separate jobs that run sequentially and things are cranking along smoothly now. The CPU is much less taxed, so it seems you were exactly right about the timer latency.

Thanks!

@manast
Copy link
Member

manast commented Apr 6, 2017

Since this issue seems to be very recurrent, I have added a new feature request that hopefully will solve this problem once and for all: #488

@manast manast closed this as completed Jun 29, 2017
@cleivson
Copy link

cleivson commented Dec 7, 2022

What's the feature request that replaced this bug?

@bobber205
Copy link

What's the feature request that replaced this bug?

I'd love to know as well

@SirPhemmiey
Copy link

any update on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests