You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, AtomicWaker::take does a lot of work for nothing when there is no registered waker. It causes performance issue in a contended environment where the function is called a lot.
I've designed a new algorithm to optimize this function, and compared the performance. A simple micro-benchmarks give a huge x100 improvement:
futures::task::AtomicWaker 6.585227ms
crate::AtomicWaker 65.72µs
Benchmark code:
let waker = AtomicWaker::new();// no registered wakerlet start = std::Instant::now();(0..100000).into_par_iter().for_each(|_| waker.wake());// use rayon for contentionprintln!("{:?}", start.elapsed());
These results can easily be explained: no waker registered means a single AtomicU8::load with the new algorithm. The function is less than 20 assembly instructions (on my computer), so I've added #[inline], but even without inlining, it's still around 100µs, way faster than the current implementation.
For the context, I'm developping an MPSC queue, where each enqueuing operation notify the consumer. AtomicWaker seems to fit for the consumer, but, as stated above, the performance impact is huge.
In fact, I can't even use AtomicWaker because my code also support synchronous blocking operations, see my proposal https://internals.rust-lang.org/t/thread-park-waker-a-waker-calling-thread-unpark/19114. But, as I've implemented this new algorithm (generic in my case, to handle Thread::unpark), I thought it would be a good idea to propose it for AtomicWaker optimization.
Disclaimer: This algorithm may be bugged, memory order may be improvable, I'm not an expert, and I have not take the time for now to test it with loom. However, I've already tested in "real" situation, in my queue benchmarks, and it seems to work well.
Currently,
AtomicWaker::take
does a lot of work for nothing when there is no registered waker. It causes performance issue in a contended environment where the function is called a lot.I've designed a new algorithm to optimize this function, and compared the performance. A simple micro-benchmarks give a huge x100 improvement:
futures::task::AtomicWaker
6.585227mscrate::AtomicWaker
65.72µsBenchmark code:
These results can easily be explained: no waker registered means a single
AtomicU8::load
with the new algorithm. The function is less than 20 assembly instructions (on my computer), so I've added#[inline]
, but even without inlining, it's still around 100µs, way faster than the current implementation.For the context, I'm developping an MPSC queue, where each enqueuing operation notify the consumer.
AtomicWaker
seems to fit for the consumer, but, as stated above, the performance impact is huge.In fact, I can't even use
AtomicWaker
because my code also support synchronous blocking operations, see my proposal https://internals.rust-lang.org/t/thread-park-waker-a-waker-calling-thread-unpark/19114. But, as I've implemented this new algorithm (generic in my case, to handleThread::unpark
), I thought it would be a good idea to propose it forAtomicWaker
optimization.Disclaimer: This algorithm may be bugged, memory order may be improvable, I'm not an expert, and I have not take the time for now to test it with loom. However, I've already tested in "real" situation, in my queue benchmarks, and it seems to work well.
EDIT: The original implementation is here
The text was updated successfully, but these errors were encountered: