Add worker id to `on_thread_park()` and `on_thread_unpark()` callbacks (for stuck worker watchdog) #6353

theron-eg · 2024-02-16T20:24:55Z

Is your feature request related to a problem? Please describe.
A stuck task can cause other tasks to not run (see #6315 and #4730). Adding the worker id to on_thread_park() would permit a light-weight watchdog to detect when a task is stuck.

Describe the solution you'd like
A watchdog thread can use worker_poll_count() to detect when a worker is not making progress. However as far as I could tell there's no 100% reliable way to tell if that worker is parked. If one assumes that the worker threads start running in the same order that workers are created, the on_thread_park() and on_thread_unpark() callbacks (together with thread local variables for the worker ids) could be used to track which workers are idle/parked vs. running. That's not a safe assumption though, and it would be much nicer if the on_thread_park() just directly included the worker id (i.e. the same usize value that is passed to worker_poll_count()).

With that extra information, the parked state of each worker can be efficiently tracked in an AtomicBool. Any workers that are not parked and not polling for new work must be stuck. After some amount of time the watchdog can choose to alert or even kill the process to get the task unstuck.

This doesn't solve the general problem of a task that runs for say 100msec and slows down the scheduling of other tasks. However we had an incident where some code with a bug went into a loop without .await, and it caused only some other tasks to get starved. The process continued to appear healthy externally though, so the process ran in a degraded state for a while. It would be better to detect this and crash/restart the process, rather than run in some weird half-zombie state.

Describe alternatives you've considered
I'd be happy if there were an alternative existing way to achieve this.

Additional context

The text was updated successfully, but these errors were encountered:

theron-eg · 2024-02-16T20:28:06Z

PR with proposed: #6354

mox692 · 2024-02-17T04:14:23Z

Sorry if I missed something in your situation, but if you just want to track the number of park of workers, you may be able to use some metrics around here.
https://docs.rs/tokio/latest/tokio/runtime/struct.RuntimeMetrics.html#method.worker_park_count

Darksonn · 2024-02-19T09:24:01Z

The existing PR was closed due to a breaking change. Is there some different API we can add instead that would help with your issue?

theron-eg · 2024-02-20T07:09:09Z

The change is not to track the number of parks. I am trying to find out at a given point in time which workers are parked and which ones are active. Is there a way to accomplish this today?

What I'm really trying to do (maybe I didn't describe it well) is determine if there is a worker that is indefinitely blocked, and not yielding back to tokio. When a worker does that, other tasks pending for that worker might not get scheduled and languish indefinitely.

Any worker that is not polling (i.e. worker_poll_count() is not increasing) might be stuck. Or it might just be parked. I couldn't find a way to distinguish between the two. Adding the worker to on_thread_park() makes it possible. Otherwise I don't see how to match the worker: usize parameter of worker_poll_count() against which specific worker threads are parked.

It would be great if there's an existing way, and I could build a watch dog without needing to create a private fork of tokio (which is what i've done currently).

If you agree that adding the worker index to the on_thread_park() callback is a good idea, could you provide some guidance on how to go about that? The PR makes a change to an api that is available only with --cfg tokio_unstable, which I assumed meant it was fair game to make incremental changes. If that's not the case, would you prefer a new on_thread_park2() method and deprecate the old one since the new version subsumes all of the functionality of the original?

Darksonn · 2024-02-20T09:49:35Z

The on_thread_park method is stable and does not require --cfg tokio_unstable.

theron-eg added A-tokio Area: The main tokio crate C-feature-request Category: A feature request. labels Feb 16, 2024

theron-eg mentioned this issue Feb 16, 2024

Add worker id to on_thread_park() callback (for stuck worker watchdog) #6354

Closed

Darksonn added the M-runtime Module: tokio/runtime label Feb 19, 2024

theron-eg linked a pull request Feb 29, 2024 that will close this issue

Add unstable on_thread_park_id() to runtime Builder (for stuck task watchdog) #6370

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add worker id to `on_thread_park()` and `on_thread_unpark()` callbacks (for stuck worker watchdog) #6353

Add worker id to `on_thread_park()` and `on_thread_unpark()` callbacks (for stuck worker watchdog) #6353

theron-eg commented Feb 16, 2024

theron-eg commented Feb 16, 2024

mox692 commented Feb 17, 2024

Darksonn commented Feb 19, 2024

theron-eg commented Feb 20, 2024

Darksonn commented Feb 20, 2024

Add worker id to on_thread_park() and on_thread_unpark() callbacks (for stuck worker watchdog) #6353

Add worker id to on_thread_park() and on_thread_unpark() callbacks (for stuck worker watchdog) #6353

Comments

theron-eg commented Feb 16, 2024

theron-eg commented Feb 16, 2024

mox692 commented Feb 17, 2024

Darksonn commented Feb 19, 2024

theron-eg commented Feb 20, 2024

Darksonn commented Feb 20, 2024

Add worker id to `on_thread_park()` and `on_thread_unpark()` callbacks (for stuck worker watchdog) #6353

Add worker id to `on_thread_park()` and `on_thread_unpark()` callbacks (for stuck worker watchdog) #6353