Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add worker id to on_thread_park() and on_thread_unpark() callbacks (for stuck worker watchdog) #6353

Open
theron-eg opened this issue Feb 16, 2024 · 5 comments · May be fixed by #6370
Open
Labels
A-tokio Area: The main tokio crate C-feature-request Category: A feature request. M-runtime Module: tokio/runtime

Comments

@theron-eg
Copy link

Is your feature request related to a problem? Please describe.
A stuck task can cause other tasks to not run (see #6315 and #4730). Adding the worker id to on_thread_park() would permit a light-weight watchdog to detect when a task is stuck.

Describe the solution you'd like
A watchdog thread can use worker_poll_count() to detect when a worker is not making progress. However as far as I could tell there's no 100% reliable way to tell if that worker is parked. If one assumes that the worker threads start running in the same order that workers are created, the on_thread_park() and on_thread_unpark() callbacks (together with thread local variables for the worker ids) could be used to track which workers are idle/parked vs. running. That's not a safe assumption though, and it would be much nicer if the on_thread_park() just directly included the worker id (i.e. the same usize value that is passed to worker_poll_count()).

With that extra information, the parked state of each worker can be efficiently tracked in an AtomicBool. Any workers that are not parked and not polling for new work must be stuck. After some amount of time the watchdog can choose to alert or even kill the process to get the task unstuck.

This doesn't solve the general problem of a task that runs for say 100msec and slows down the scheduling of other tasks. However we had an incident where some code with a bug went into a loop without .await, and it caused only some other tasks to get starved. The process continued to appear healthy externally though, so the process ran in a degraded state for a while. It would be better to detect this and crash/restart the process, rather than run in some weird half-zombie state.

Describe alternatives you've considered
I'd be happy if there were an alternative existing way to achieve this.

Additional context

@theron-eg
Copy link
Author

PR with proposed: #6354

@mox692
Copy link
Member

mox692 commented Feb 17, 2024

Sorry if I missed something in your situation, but if you just want to track the number of park of workers, you may be able to use some metrics around here.
https://docs.rs/tokio/latest/tokio/runtime/struct.RuntimeMetrics.html#method.worker_park_count

@Darksonn Darksonn added the M-runtime Module: tokio/runtime label Feb 19, 2024
@Darksonn
Copy link
Contributor

The existing PR was closed due to a breaking change. Is there some different API we can add instead that would help with your issue?

@theron-eg
Copy link
Author

The change is not to track the number of parks. I am trying to find out at a given point in time which workers are parked and which ones are active. Is there a way to accomplish this today?

What I'm really trying to do (maybe I didn't describe it well) is determine if there is a worker that is indefinitely blocked, and not yielding back to tokio. When a worker does that, other tasks pending for that worker might not get scheduled and languish indefinitely.

Any worker that is not polling (i.e. worker_poll_count() is not increasing) might be stuck. Or it might just be parked. I couldn't find a way to distinguish between the two. Adding the worker to on_thread_park() makes it possible. Otherwise I don't see how to match the worker: usize parameter of worker_poll_count() against which specific worker threads are parked.

It would be great if there's an existing way, and I could build a watch dog without needing to create a private fork of tokio (which is what i've done currently).

If you agree that adding the worker index to the on_thread_park() callback is a good idea, could you provide some guidance on how to go about that? The PR makes a change to an api that is available only with --cfg tokio_unstable, which I assumed meant it was fair game to make incremental changes. If that's not the case, would you prefer a new on_thread_park2() method and deprecate the old one since the new version subsumes all of the functionality of the original?

@Darksonn
Copy link
Contributor

The on_thread_park method is stable and does not require --cfg tokio_unstable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-tokio Area: The main tokio crate C-feature-request Category: A feature request. M-runtime Module: tokio/runtime
Projects
None yet
3 participants