`AtomicWaker::take` performance issue #2760

wyfo · 2023-07-09T10:27:20Z

Currently, AtomicWaker::take does a lot of work for nothing when there is no registered waker. It causes performance issue in a contended environment where the function is called a lot.

I've designed a new algorithm to optimize this function, and compared the performance. A simple micro-benchmarks give a huge x100 improvement:

futures::task::AtomicWaker 6.585227ms
crate::AtomicWaker 65.72µs

Benchmark code:

let waker = AtomicWaker::new(); // no registered waker
let start = std::Instant::now();
(0..100000).into_par_iter().for_each(|_| waker.wake()); // use rayon for contention
println!("{:?}", start.elapsed());

These results can easily be explained: no waker registered means a single AtomicU8::load with the new algorithm. The function is less than 20 assembly instructions (on my computer), so I've added #[inline], but even without inlining, it's still around 100µs, way faster than the current implementation.

For the context, I'm developping an MPSC queue, where each enqueuing operation notify the consumer. AtomicWaker seems to fit for the consumer, but, as stated above, the performance impact is huge.
In fact, I can't even use AtomicWaker because my code also support synchronous blocking operations, see my proposal https://internals.rust-lang.org/t/thread-park-waker-a-waker-calling-thread-unpark/19114. But, as I've implemented this new algorithm (generic in my case, to handle Thread::unpark), I thought it would be a good idea to propose it for AtomicWaker optimization.

Disclaimer: This algorithm may be bugged, memory order may be improvable, I'm not an expert, and I have not take the time for now to test it with loom. However, I've already tested in "real" situation, in my queue benchmarks, and it seems to work well.

EDIT: The original implementation is here

The text was updated successfully, but these errors were encountered:

taiki-e · 2023-07-17T12:52:45Z

Thanks for the proposal.

Is the performance improvement from algorithm changes or the addition of fast-path that doesn't call CAS/RMW or both? Analyzing them would be helpful.

In particular, if the second, it can be adopted into our code without the risk of affecting the correctness of the code.

notgull mentioned this issue Jul 9, 2023

Add benchmarks smol-rs/atomic-waker#14

Merged

taiki-e added the A-task Area: futures::task label Jul 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`AtomicWaker::take` performance issue #2760

`AtomicWaker::take` performance issue #2760

wyfo commented Jul 9, 2023 •

edited

taiki-e commented Jul 17, 2023 •

edited

AtomicWaker::take performance issue #2760

AtomicWaker::take performance issue #2760

Comments

wyfo commented Jul 9, 2023 • edited

taiki-e commented Jul 17, 2023 • edited

`AtomicWaker::take` performance issue #2760

`AtomicWaker::take` performance issue #2760

wyfo commented Jul 9, 2023 •

edited

taiki-e commented Jul 17, 2023 •

edited