New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add worker id to on_thread_park()
and on_thread_unpark()
callbacks (for stuck worker watchdog)
#6353
Comments
PR with proposed: #6354 |
Sorry if I missed something in your situation, but if you just want to track the number of park of workers, you may be able to use some metrics around here. |
The existing PR was closed due to a breaking change. Is there some different API we can add instead that would help with your issue? |
The change is not to track the number of parks. I am trying to find out at a given point in time which workers are parked and which ones are active. Is there a way to accomplish this today? What I'm really trying to do (maybe I didn't describe it well) is determine if there is a worker that is indefinitely blocked, and not yielding back to tokio. When a worker does that, other tasks pending for that worker might not get scheduled and languish indefinitely. Any worker that is not polling (i.e. It would be great if there's an existing way, and I could build a watch dog without needing to create a private fork of tokio (which is what i've done currently). If you agree that adding the worker index to the |
The |
Is your feature request related to a problem? Please describe.
A stuck task can cause other tasks to not run (see #6315 and #4730). Adding the worker id to
on_thread_park()
would permit a light-weight watchdog to detect when a task is stuck.Describe the solution you'd like
A watchdog thread can use
worker_poll_count()
to detect when a worker is not making progress. However as far as I could tell there's no 100% reliable way to tell if that worker is parked. If one assumes that the worker threads start running in the same order that workers are created, theon_thread_park()
andon_thread_unpark()
callbacks (together with thread local variables for the worker ids) could be used to track which workers are idle/parked vs. running. That's not a safe assumption though, and it would be much nicer if theon_thread_park()
just directly included the worker id (i.e. the sameusize
value that is passed toworker_poll_count()
).With that extra information, the parked state of each worker can be efficiently tracked in an
AtomicBool
. Any workers that are not parked and not polling for new work must be stuck. After some amount of time the watchdog can choose to alert or even kill the process to get the task unstuck.This doesn't solve the general problem of a task that runs for say 100msec and slows down the scheduling of other tasks. However we had an incident where some code with a bug went into a loop without
.await
, and it caused only some other tasks to get starved. The process continued to appear healthy externally though, so the process ran in a degraded state for a while. It would be better to detect this and crash/restart the process, rather than run in some weird half-zombie state.Describe alternatives you've considered
I'd be happy if there were an alternative existing way to achieve this.
Additional context
The text was updated successfully, but these errors were encountered: