Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add unstable on_thread_park_id() to runtime Builder (for stuck task watchdog) #6370

Draft
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

theron-eg
Copy link

Motivation

Fixes #6353. I'm looking for a way to detect when a task is not yielding back to tokio properly. Currently such a task can cause other random tasks to not get run. We had an incident where a buggy task went into a busy loop and put a service into a zombie state: it still was responding to most requests, but some background tasks were not running as expected.

If there is an existing way to detect a stuck task, I'd be delighted to be enlightened.

Solution

Add a new method on_thread_park_id() that includes the worker id being parked. This allows a watchdog process to determine which workers are not parked and also not polling (available already from RuntimeMetrics::worker_poll_count()). The watchdog code would look something like this:

fn main() {
    const WORKERS: usize = 4;
    const UNPARKED: AtomicBool = AtomicBool::new(false);
    static PARKED: [AtomicBool; WORKERS] = [UNPARKED; WORKERS];

    let runtime = runtime::Builder::new_multi_thread()
        .worker_threads(WORKERS)
        .on_thread_park_id(|id| PARKED[id].store(true, Ordering::Release))
        .on_thread_unpark_id(|id| PARKED[id].store(false, Ordering::Release))
        .build()
        .unwrap();

    let metrics = runtime.handle().metrics();
    thread::spawn(move || {
        let mut stuck_secs = [0; WORKERS];
        let mut prev_poll_counts = [None; WORKERS];
        loop {
            thread::sleep(time::Duration::from_secs(1));
            for ii in 0..WORKERS {
                if PARKED[ii].load(Ordering::Acquire) {
                    prev_poll_counts[ii] = None;
                } else {
                    let poll_count = metrics.worker_poll_count(ii);
                    if Some(poll_count) == prev_poll_counts[ii] {
                        stuck_secs[ii] += 1;
                        println!("*** worker {} is stuck >= {} secs ***", ii, stuck_secs[ii]);
                    } else {
                        prev_poll_counts[ii] = Some(poll_count);
                        stuck_secs[ii] = 0;
                    }
                }
            }
        }
    });

    runtime.spawn(async { loop {} });
    runtime.block_on(async {
        loop {
            tokio::task::yield_now().await
        }
    });
}

Rather than printing, a real watchdog implementation could call std::process::abort() if the task stays stuck past some duration (e.g. 10 seconds).

It's possible to find out from WorkerMetrics which workers are not
actively polling for tasks.  However it's not possible to tell if
non-polling workers are merely idle (parked) or if they are stuck.
By adding the worker id (same usize as used in WorkerMetrics calls)
to the on_thread_park() and on_thread_unpark() callbacks it is
possible to track which specific workers are parked.  Then any
worker that is not polling tasks and is not parked is a worker
that is stuck.

With this information it's possible to create a watchdog that alerts
or kills the process if a worker is stuck for too long.
@github-actions github-actions bot added R-loom-current-thread Run loom current-thread tests on this PR R-loom-multi-thread Run loom multi-thread tests on this PR R-loom-multi-thread-alt Run loom multi-thread alt tests on this PR labels Feb 29, 2024
@Darksonn Darksonn added A-tokio Area: The main tokio crate M-runtime Module: tokio/runtime M-metrics Module: tokio/runtime/metrics labels Feb 29, 2024
@theron-eg theron-eg marked this pull request as draft February 29, 2024 22:36
@Darksonn
Copy link
Contributor

Darksonn commented Mar 5, 2024

Maybe a better option would be add a new metric for this. It could count the number of parks/unparks. When the number is odd, the worker is parked. When it is even, the worker is active.

@Darksonn
Copy link
Contributor

Darksonn commented Mar 5, 2024

Note that you can push an empty commit when you want to trigger a new build.

You will probably need to merge in master to get the build to work. There were some changes to the CI setup.

@theron-eg
Copy link
Author

You will probably need to merge in master to get the build to work. There were some changes to the CI setup.

Ah thanks, I was puzzled what was going wrong. I hope to get back to making more progress on this in the next couple of days.

Maybe a better option would be add a new metric for this. It could count the number of parks/unparks. When the number is odd, the worker is parked. When it is even, the worker is active.

It did occur to me to add a metric for "is currently parked" or something similar but it seemed more useful to have the generic callback mechanism (with the proper worker id). With that it's possible to for anyone to collect their own set of metrics, instead of relying on WorkerMetrics having exactly the right set of info (and incurring the overhead to accurately / atomically count things that 99.9% of people don't care about). But if you'd prefer a change to WorkerMetrics instead I'm happy to go down that route. Let me at least get this ready for official review and then I'm happy to take feedback.

@Darksonn
Copy link
Contributor

I think a new worker metric makes sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-tokio Area: The main tokio crate M-metrics Module: tokio/runtime/metrics M-runtime Module: tokio/runtime R-loom-current-thread Run loom current-thread tests on this PR R-loom-multi-thread Run loom multi-thread tests on this PR R-loom-multi-thread-alt Run loom multi-thread alt tests on this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add worker id to on_thread_park() and on_thread_unpark() callbacks (for stuck worker watchdog)
2 participants