rt: reduce the impact of CPU bound tasks on the overall runtime shceduler #6251

wathenjiang · 2023-12-27T14:54:18Z

We discussed in #4730 that CPU-bound tasks will cause increased scheduling delays in the entire Tokio runtime. We recommend using tokio::task::block_in_place for these tasks to prevent latency spikes for other tasks. However, even with a multi-threaded runtime and multiple worker threads, just one single CPU-bound task can still cause significant latency.

Some async runtimes, like Go, have dedicated threads for polling drivers (Go's sysmon thread). Tokio, however, has each worker thread responsible for polling the driver under certain conditions, allowing only one worker thread to do so at a time by using Condvar. Go's strategy may not be suitable for Tokio, as its sysmon thread also handles GC-related operations and goroutine preemption.

This PR aims to fully harness the CPU computing potential of the multi_thread tokio runtime, reduce the delay impact of CPU bound tasks on IO event processing.

Note: This does not solve the problem that asynchronous tasks in Rust cannot currently implement preemptive scheduling. The number of CPU bound tasks exceeding the number of worker threads will still cause the entire tokio runtime to run blocked.

Tokio's scheduling mechanism is very great. It minimizes the need for worker thread wake-ups and has significantly improved performance, which can been seen in #4383 . However, the thread wake-up mechanism seems a bit too conservative. It might be worth making it slightly more aggressive to ensure a quicker I/O event response time. This PR includes the following change:

When the worker responsible for polling the driver gets unparked, no thread will continue to poll the driver at that moment, as the current worker is going to run tasks. So, we quickly try to wake up another worker, hoping it can poll the driver.

Here is the test case for the current PR:

use std::time::Duration;
use std::time::Instant;

async fn handle_request(start_time: Instant) {
    tokio::time::sleep(Duration::from_millis(500)).await;
    println!("request took {}ms", start_time.elapsed().as_millis());
}

async fn background_job() {
    loop {
        tokio::time::sleep(Duration::from_secs(3)).await;
        // Adjust as needed
        for _ in 0..1_000 { vec![1; 1_000_000].sort() }
    }
}

#[tokio::main(flavor = "multi_thread", worker_threads = 8)]
async fn main() {
    for _ in 0..7{
        tokio::spawn(async { background_job().await });
    }

    loop {
        let start = Instant::now();
        tokio::spawn(async move {
            handle_request(start).await
        }).await.unwrap();
    }
}

The test result on master:

request took 502ms
request took 500ms
request took 501ms
request took 501ms
request took 502ms
request took 501ms
request took 501ms
request took 501ms
request took 502ms
request took 502ms
request took 501ms
request took 502ms
request took 500ms
request took 501ms
request took 899ms    <==== high-latency request
request took 502ms
request took 502ms
request took 502ms
request took 502ms
request took 501ms
request took 1072ms   <==== high-latency request
request took 1007ms   <==== high-latency request
request took 501ms
request took 502ms
request took 501ms
request took 1093ms   <==== high-latency request
request took 1067ms   <==== high-latency request
request took 856ms    <==== high-latency request
...

The test result on this PR:

request took 502ms
request took 502ms
request took 503ms
request took 502ms
request took 502ms
request took 502ms
request took 506ms
request took 502ms
request took 502ms
request took 502ms
request took 500ms
request took 502ms
request took 501ms
request took 502ms
request took 504ms

Referring to #4383 , I did a performance test of Hyper's "hello" server:

This benches Hyper's "hello" server using wrk -t1 -c400 -d10s http://127.0.0.1:3000/

Master

Requests/sec: 162342.60
Transfer/sec:     13.62MB

This PR

Requests/sec: 161073.47
Transfer/sec:     13.52MB

There is almost no performance difference in this test case.

This PR will not completely solve the negative impact of CPU bound tasks (or blocking tasks), but it can reduce the processing delay of I/O events when the task schedule is not busy.

…eduling delay

carllerche · 2024-01-30T19:10:08Z

Hey, thanks for taking a stab at this. I appreciate you jumping into complex code and trying to resolve a real issue that causes users pain.

Unfortunately, I cannot accept this PR as it is, and I will try to break down why.

First, a point that isn't critical for accepting the PR, but since you took some time to dig into this, you might want to learn more about the code. This is where a worker notifies a peer. You would probably want to update should_notify_others to consolidate the wakeups; otherwise, there will be two workers woken up per tick.

That said, there is tension in the scheduler when deciding if a peer thread should be woken up. Synchronization overhead is a big issue on larger systems (over 50 cores). When I first implemented the PR to reduce no-op wakeups, it was because I was observing significant CPU overhead at scale due to thread synchronization as part of the wakeup logic. One option could be a configuration option to tune how aggressive the scheduler is at waking up peer threads.

Also, looking through worker.rs, I wonder if there are some changes that I experimented with in the alt scheduler that we may want to pull in, e.g. skipping the LIFO slot and batching events from the I/O driver.

Anyway, all of this work relates to #6315, so we should try to figure out steps to move forward there.

feat: reduce the impact of CPU bound tasks on the overall runtime sch…

08d5efa

…eduling delay

github-actions bot added the R-loom-multi-thread Run loom multi-thread tests on this PR label Dec 27, 2023

wathenjiang changed the title ~~feat: reduce the impact of CPU bound tasks on the overall runtime shceduler~~ rt: reduce the impact of CPU bound tasks on the overall runtime shceduler Dec 27, 2023

wathenjiang mentioned this pull request Dec 28, 2023

runtime: Notify another worker if unparked from driver #6245

Closed

Darksonn added A-tokio Area: The main tokio crate M-runtime Module: tokio/runtime labels Dec 28, 2023

Darksonn requested a review from carllerche January 30, 2024 09:55

carllerche closed this Jan 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rt: reduce the impact of CPU bound tasks on the overall runtime shceduler #6251

rt: reduce the impact of CPU bound tasks on the overall runtime shceduler #6251

wathenjiang commented Dec 27, 2023

carllerche commented Jan 30, 2024

rt: reduce the impact of CPU bound tasks on the overall runtime shceduler #6251

rt: reduce the impact of CPU bound tasks on the overall runtime shceduler #6251

Conversation

wathenjiang commented Dec 27, 2023

carllerche commented Jan 30, 2024