Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

time: lazy init TimerShared in TimerEntry #6512

Merged
merged 25 commits into from May 3, 2024

Conversation

wathenjiang
Copy link
Contributor

@wathenjiang wathenjiang commented Apr 23, 2024

This commit is part of reducing timeout performance overhead.

See #6504

Below are relevant benchmark results of this PR on m1:

single_thread_timeout   time:   [21.869 ns 21.987 ns 22.135 ns]
                        change: [-3.4429% -2.0709% -0.8759%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 7 outliers among 100 measurements (7.00%)
  3 (3.00%) high mild
  4 (4.00%) high severe

multi_thread_timeout-8  time:   [4.4835 ns 4.6138 ns 4.7614 ns]
                        change: [-4.3554% +0.1643% +4.5114%] (p = 0.95 > 0.05)
                        No change in performance detected.
Found 9 outliers among 100 measurements (9.00%)
  8 (8.00%) high mild
  1 (1.00%) high severe

Below are relevant benchmark results of the current version on m1:

single_thread_timeout   time:   [40.227 ns 40.416 ns 40.691 ns]
                        change: [+81.321% +82.817% +84.121%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 14 outliers among 100 measurements (14.00%)
  3 (3.00%) high mild
  11 (11.00%) high severe

multi_thread_timeout-8  time:   [183.16 ns 186.02 ns 188.21 ns]
                        change: [+3765.0% +3880.4% +3987.4%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
  4 (4.00%) low severe
  6 (6.00%) low mild

@wathenjiang wathenjiang changed the title Timeout lazy init sleep time: lazy init sleep in timeout Apr 23, 2024
    See tokio-rs#6504

    Below are relevant benchmark results of this PR on m1 mac:
    single_thread_timeout   time:   [21.869 ns 21.987 ns 22.135 ns]
                            change: [-3.4429% -2.0709% -0.8759%] (p = 0.00 < 0.05)
                            Change within noise threshold.
    Found 7 outliers among 100 measurements (7.00%)
      3 (3.00%) high mild
      4 (4.00%) high severe

    multi_thread_timeout-8  time:   [4.4835 ns 4.6138 ns 4.7614 ns]
                            change: [-4.3554% +0.1643% +4.5114%] (p = 0.95 > 0.05)
                            No change in performance detected.
    Found 9 outliers among 100 measurements (9.00%)
      8 (8.00%) high mild
      1 (1.00%) high severe

    Below are relevant benchmark results of current version on m1 mac:

    single_thread_timeout   time:   [40.227 ns 40.416 ns 40.691 ns]
                            change: [+81.321% +82.817% +84.121%] (p = 0.00 < 0.05)
                            Performance has regressed.
    Found 14 outliers among 100 measurements (14.00%)
      3 (3.00%) high mild
      11 (11.00%) high severe

    multi_thread_timeout-8  time:   [183.16 ns 186.02 ns 188.21 ns]
                            change: [+3765.0% +3880.4% +3987.4%] (p = 0.00 < 0.05)
                            Performance has regressed.
    Found 10 outliers among 100 measurements (10.00%)
      4 (4.00%) low severe
      6 (6.00%) low mild
@wathenjiang
Copy link
Contributor Author

Below are relevant benchmark results of this PR on Linux AMD64:

single_thread_timeout   time:   [59.230 ns 59.308 ns 59.389 ns]
Found 7 outliers among 100 measurements (7.00%)
  4 (4.00%) high mild
  3 (3.00%) high severe

multi_thread_timeout-8  time:   [94.476 ns 95.247 ns 95.877 ns]
Found 7 outliers among 100 measurements (7.00%)
  3 (3.00%) low severe
  1 (1.00%) low mild
  2 (2.00%) high mild
  1 (1.00%) high severe

Below are relevant benchmark results of the current version on Linux AMD64:

single_thread_timeout   time:   [106.43 ns 106.69 ns 107.01 ns]
                        change: [+80.061% +81.176% +82.266%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 16 outliers among 100 measurements (16.00%)
  3 (3.00%) high mild
  13 (13.00%) high severe

multi_thread_timeout-8  time:   [181.75 ns 189.82 ns 199.37 ns]
                        change: [+89.155% +96.135% +103.05%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 19 outliers among 100 measurements (19.00%)
  1 (1.00%) high mild
  18 (18.00%) high severe

@wathenjiang
Copy link
Contributor Author

Should we lazily initialize the TimerEntry within the Sleep instead of Sleep within Timeout?

@mox692 mox692 added A-tokio Area: The main tokio crate M-time Module: tokio/time labels Apr 24, 2024
@Darksonn
Copy link
Contributor

Lazily initializing the TimerEntry instead sounds like a good idea.

benches/Cargo.toml Outdated Show resolved Hide resolved
benches/time_timeout.rs Outdated Show resolved Hide resolved
@wathenjiang
Copy link
Contributor Author

Lazily initializing the TimerEntry instead sounds like a good idea.

I think so. After some time, I find there are many custom timeouts in tokio, such as Timeout<S> in the tokio-stream/src/stream_ext/timeout.rs.

@github-actions github-actions bot added the R-loom-time-driver Run loom time driver tests on this PR label Apr 24, 2024
Comment on lines 228 to 235
pub struct Sleep {
inner: Inner,

deadline : Instant,
handle: scheduler::Handle,
// The link between the `Sleep` instance and the timer that drives it.
#[pin]
entry: TimerEntry,
entry: Option<TimerEntry>,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this change the size of Sleep? Could TimerEntry be changed to reduce the change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be possible. Your idea looks better, I'll give it a try.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put deadline and handle in Sleep here for do not init the TimerEntry, but I have to put them to a place to save them temporarily.

Saving handle here is to ensure the timeout will panic if there is no runtime handle in TLS or we do not enable_time or enable_all.

The size of Sleep is an issue worth considering. Is there any way to solve this problem?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if you just don't call clear_entry in this method when it hasn't yet been registered?

Based on @conradludgate's comment here, it sounds like this triggers a loom failure, but that's most likely due to some path where the timer is dropped concurrently with firing or something like that. If we've never been registered with the driver, then not calling clear_entry should be okay.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my test, it can improve the performance by reduicng lock contention when we always let timeout register into the driver. But, on the other hand, if the timeout never register into the driver, this can not bring significant performance improvement.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That said, maybe we can use inner: StdUnsafeCell<Option<TimerShared>> for this purpose, the size of Sleep should be barely changed.

Avoiding clear_entry for unregistered timeouts is not necessary. In this case, we should avoid constructing the intrusive linked list item.

@wathenjiang
Copy link
Contributor Author

wathenjiang commented Apr 24, 2024

Before blaming the performance issue on lock contention, it's best to explore what causes slow timeout performance in multi-threaded concurrency.

The below flame charts are performed on Linux amd64(which has 16 cores) based on the following code.

fn main() {
    let runtime = tokio::runtime::Builder::new_multi_thread()
        .enable_all()
        .worker_threads(8)
        .build()
        .unwrap();

    let _r = runtime.block_on(async{
        let mut handles = Vec::with_capacity(8);
        for _ in 0..8 {
            handles.push(tokio::spawn(async move {
                loop{
                    let h = timeout(Duration::from_secs(1), quick_job());
                    assert_eq!(h.await.unwrap(), 1);
                }
            })); 
        }
        for handle in handles {
            handle.await.unwrap();
        }
    });
}

Figure 1 shows the result of executing on the current PR code.
image
Figure 2 shows the result of executing on the current master branch.
image

It turns out that we are overly blaming lock contention. In the master branch, lock contention only takes up 7.72% of the execution time.

So the lazy initialization is the main reason why this PR wins in performance.

I did an interesting test, which is to do not call cancel in Drop of TimerEntry.

Compared with the cancel version, the benchmark performance test results of the master branch version are as follows:

single_thread_timeout   time:   [40.916 ns 41.337 ns 41.783 ns]
                        change: [+9.1588% +10.095% +11.148%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 13 outliers among 100 measurements (13.00%)
  12 (12.00%) high mild
  1 (1.00%) high severe

multi_thread_timeout-8  time:   [180.39 ns 184.15 ns 186.61 ns]
                        change: [+11.290% +13.768% +16.199%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 9 outliers among 100 measurements (9.00%)
  4 (4.00%) low severe
  5 (5.00%) low mild

single_thread_sleep     time:   [41.389 ns 41.648 ns 42.020 ns]
                        change: [+10.927% +12.228% +14.122%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 32 outliers among 100 measurements (32.00%)
  10 (10.00%) low severe
  4 (4.00%) low mild
  9 (9.00%) high mild
  9 (9.00%) high severe

multi_thread_sleep-8    time:   [175.35 ns 178.70 ns 181.46 ns]
                        change: [+11.991% +14.868% +17.832%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
  7 (7.00%) low severe
  3 (3.00%) low mild

This test result also proves that for the performance of timeout and sleep, lazy initialization has a greater effect than reducing lock conflicts.

@Darksonn
Copy link
Contributor

Thank you for doing that benchmark. It sounds like this particular change is still a good idea, but we should reconsider the other changes mentioned on the original issue.

That said, I wonder what we would get on a benchmark where the timer always gets registered with the driver.

@wathenjiang
Copy link
Contributor Author

wathenjiang commented Apr 24, 2024

Forgive me. I saw it wrong :)

@wathenjiang
Copy link
Contributor Author

Of course, if all timers will be registered in the Driver, and the concurrency is very high. Then lock contention will become the most important concern.

The following flame shows the global lock ops take 7.15% in register, and 8.75% in clear_entry.

image
The benchmark test code:

fn main() {
    let runtime = tokio::runtime::Builder::new_multi_thread()
        .enable_all()
        .worker_threads(8)
        .build()
        .unwrap();

    let _r = runtime.block_on(async {
        let mut handles = Vec::with_capacity(1024);
        for _ in 0..1024 {
            handles.push(tokio::spawn(async move {
                loop {
                    for i in 1..10 {
                        for _ in 0..100 {
                            let h = timeout(Duration::from_millis(i), never_ready_job2(i));
                            let _r = h.await;
                        }
                    }
                }
            }));
        }
        for handle in handles {
            handle.await.unwrap();
        }
    });
}

pub fn fibonacci(n: u64) -> u64 {
    match n {
        0 => 1,
        1 => 1,
        n => fibonacci(n - 1) + fibonacci(n - 2),
    }
}

// a job never ready
async fn never_ready_job2(n : u64) -> u64 {
    let res = fibonacci(n+5); // do some logic
    pending::<()>().await;
    res
}

The global lock in multi-threads is usually not good enough in performance.

So, going back to the issue itself, both methods may need to be done to reduce the performance overhead of timeout.

  • In most cases where timeout is not registered with the driver, lazy initialization optimization is used
  • Most timeouts need to be registered to the driver, using a mechanism similar to shard locks

@wathenjiang wathenjiang changed the title time: lazy init sleep in timeout time: lazy init TimerShared in TimerEntry Apr 25, 2024
@wathenjiang
Copy link
Contributor Author

By lazily initializing the TimerShared in TimerEntry, we get a performance improvement. The below are the benchmark results (On Linux AMD64 which has 16 cores ):

The results of this PR:

single_thread_timeout   time:   [72.079 ns 72.146 ns 72.220 ns]
multi_thread_timeout-8  time:   [104.14 ns 104.90 ns 105.49 ns]
single_thread_sleep     time:   [69.147 ns 69.344 ns 69.592 ns]
multi_thread_sleep-8    time:   [106.08 ns 106.60 ns 106.98 ns]

The results of master:

single_thread_timeout   time:   [107.02 ns 107.24 ns 107.52 ns]
multi_thread_timeout-8  time:   [195.11 ns 201.26 ns 208.99 ns]
single_thread_sleep     time:   [116.30 ns 116.44 ns 116.61 ns]
multi_thread_sleep-8    time:   [210.79 ns 223.30 ns 236.26 ns]

To achieve all of this, our price is that the byte size of Sleep has increased from 112 to 120.

benches/time_timeout.rs Outdated Show resolved Hide resolved
benches/time_timeout.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@Darksonn Darksonn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently I didn't actually submit this comment. It's been "pending" for 5 days. Sorry about that.

Comment on lines 498 to 500
fn inner(&self) -> &TimerShared {
unsafe { &*self.inner.get() }
unsafe { &mut *self.inner.get() }.get_or_insert(TimerShared::new())
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you change this to only create a mutable reference if it is None?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have used additional is_done code to achieve this goal.

benches/time_timeout.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@Darksonn Darksonn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

@Darksonn Darksonn merged commit f6eb1ee into tokio-rs:master May 3, 2024
77 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-tokio Area: The main tokio crate M-time Module: tokio/time R-loom-time-driver Run loom time driver tests on this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants