Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AtomicWaker::take performance issue #2760

Open
wyfo opened this issue Jul 9, 2023 · 1 comment
Open

AtomicWaker::take performance issue #2760

wyfo opened this issue Jul 9, 2023 · 1 comment
Labels
A-task Area: futures::task

Comments

@wyfo
Copy link

wyfo commented Jul 9, 2023

Currently, AtomicWaker::take does a lot of work for nothing when there is no registered waker. It causes performance issue in a contended environment where the function is called a lot.

I've designed a new algorithm to optimize this function, and compared the performance. A simple micro-benchmarks give a huge x100 improvement:

  • futures::task::AtomicWaker 6.585227ms
  • crate::AtomicWaker 65.72µs

Benchmark code:

let waker = AtomicWaker::new(); // no registered waker
let start = std::Instant::now();
(0..100000).into_par_iter().for_each(|_| waker.wake()); // use rayon for contention
println!("{:?}", start.elapsed());

These results can easily be explained: no waker registered means a single AtomicU8::load with the new algorithm. The function is less than 20 assembly instructions (on my computer), so I've added #[inline], but even without inlining, it's still around 100µs, way faster than the current implementation.

For the context, I'm developping an MPSC queue, where each enqueuing operation notify the consumer. AtomicWaker seems to fit for the consumer, but, as stated above, the performance impact is huge.
In fact, I can't even use AtomicWaker because my code also support synchronous blocking operations, see my proposal https://internals.rust-lang.org/t/thread-park-waker-a-waker-calling-thread-unpark/19114. But, as I've implemented this new algorithm (generic in my case, to handle Thread::unpark), I thought it would be a good idea to propose it for AtomicWaker optimization.

Disclaimer: This algorithm may be bugged, memory order may be improvable, I'm not an expert, and I have not take the time for now to test it with loom. However, I've already tested in "real" situation, in my queue benchmarks, and it seems to work well.

EDIT: The original implementation is here

@taiki-e
Copy link
Member

taiki-e commented Jul 17, 2023

Thanks for the proposal.

Is the performance improvement from algorithm changes or the addition of fast-path that doesn't call CAS/RMW or both? Analyzing them would be helpful.

In particular, if the second, it can be adopted into our code without the risk of affecting the correctness of the code.

@taiki-e taiki-e added the A-task Area: futures::task label Jul 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-task Area: futures::task
Projects
None yet
Development

No branches or pull requests

2 participants