Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sync: fix notify_waiters notifying sequential awaits #5404

Closed
wants to merge 21 commits into from

Conversation

satakuma
Copy link
Member

Motivation

Closes: #5396

tokio::sync::Notify has an internal waiters queue locked behind a mutex. Notify::notify_waiters is a function to notify all waiters from this queue. It first acquires the lock, then removes waiters from the queue and wakes them up. However, the whole process is done in batches of 32 waiters to avoid deadlocks. The function has to release the lock before waking up a batch and re-acquire it before proceeding to the next one. In this short timespan another thread can acquire the lock and modify the queue and internal state. For example, it is possible to insert a new waiter to the queue, which will be included in one of the later batches. Current implementation of notify_waiters does not account for that, what leads to the unexpected behavior described in #5396.

Solution

Notify tracks the number of calls to notify_waiters internally and each time a new Notified future is created, the current value of this counter is stored inside the future. The solution is to utilize this counter and compare its value with the number of the current call to notify_waiters inside this function. This way, futures created during the notify_waiters execution can be filtered out and kept in the queue.

@github-actions github-actions bot added the R-loom Run loom tests on this PR label Jan 28, 2023
Comment on lines 779 to 788
// Safety: the waiter is still not inserted
let initial_notify_waiters_calls =
unsafe { (*waiter.get()).notify_waiters_calls };

// Optimistically check if notify_waiters has been called
// after the future was created.
if get_num_notify_waiters_calls(curr) != initial_notify_waiters_calls {
*state = Done;
return Poll::Ready(());
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This additional check is basically free and should help to avoid waiting for the lock.

tokio/tests/sync_notify.rs Outdated Show resolved Hide resolved
@@ -222,8 +222,11 @@ struct Waiter {
/// Waiting task's waker.
waker: Option<Waker>,

/// `true` if the notification has been assigned to this waiter.
notified: Option<NotificationType>,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed this field because it was confusing to me because of the similarly named future.

@Darksonn
Copy link
Contributor

It seems like there is a compilation failure with some choice of feature flags:

 info: running `cargo build -Z avoid-dev-deps --no-default-features --features signal,time` on tokio (104/107)
   Compiling tokio v1.24.2 (/home/runner/work/tokio/tokio/tokio)
error[E0599]: no method named `drain_filter` found for struct `MutexGuard<'_, util::linked_list::LinkedList<Waiter, Waiter>>` in the current scope
   --> tokio/src/sync/notify.rs:526:36
    |
526 |             let mut iter = waiters.drain_filter(|w| w.notify_waiters_calls <= call_num_bound);
    |                                    ^^^^^^^^^^^^ method not found in `MutexGuard<'_, util::linked_list::LinkedList<Waiter, Waiter>>`

@satakuma
Copy link
Member Author

Yeah, thanks, I fixed it.

There is something complicated going on with atomic ordering causing a regression. I will look into this more later.

@satakuma
Copy link
Member Author

Turned out to be a logic bug unrelated to atomics. This is ready for a review @Darksonn. Sorry for so many scattered changes, but I found some comments inaccurate and decided to change them as well.

@Darksonn Darksonn added A-tokio Area: The main tokio crate M-sync Module: tokio/sync labels Jan 31, 2023
@Darksonn
Copy link
Contributor

Let's say we have two pending notified() futures fut1 and fut2 initially, and then we call notify_waiters in parallel with the following:

let res1 = fut1.poll();
let res2 = fut2.poll();
assert!(res1.is_pending() || res2.is_ready()); // if res1 is ready, then res2 must also be ready

Can the assert fail under this change?

@satakuma
Copy link
Member Author

If we have only two futures, then this assert cannot fail.

To see why, note that notify_waiters increments the call number and sets notifications while holding the lock. All code paths for pending futures have to acquire the lock and either check the call number (State::Init) or check if a notification is set (State::Waiting). Therefore the call to notify_waiters and both polls can be ordered on a timeline. If fut1 observes the effects of notify_waiters, then fut2 also has to because its poll happens after the fut1.poll. I will add a loom test to confirm it.

However, this reasoning falls apart if we consider more pending futures. If we have an array of 33 not yet polled notify.notified() futures and we enable them in a specific order (enabling pushes a future to the waiters queue) then your assert can fail:

futs[1].enable();
for i in 2..33 {
    futs[i].enable();
}
futs[0].enable();
// Waiters queue: head -> [futs[0], ..., futs[2] | futs[1]]
//                          0 ^          31 ^      32 ^

tokio::spawn(async move {
    notify.notify_waiters();
});

let res1 = futs[0].poll();
let res2 = futs[1].poll();

// this can fail
assert!(res1.is_pending() || res2.is_ready()); // if res1 is ready, then res2 must also be ready

It can fail because futs[0] will be in an earlier batch than futs[1]. While the lock is released between the batches, we can poll both futures and fail the assertion. Note that for this to happen we have to enable them in the reverse order of the order in which we poll them, which in my opinion is quite unusual.

This change uses drain_filter which traverses the queue starting from the newest entries. This differs from the current behavior of notify_waiters, which pops futures from the queue in the order they were added. It means that the above example can also fail under current implementation, but we would have to enable futures in the same order we poll them, which in my opinion is more likely.

futs[0].enable();
for i in 2..33 {
    futs[i].enable();
}
futs[1].enable();
// ...

@Darksonn
Copy link
Contributor

Oh, I was thinking of the case where there are also other waiters.

@satakuma
Copy link
Member Author

Then it is possible to fail this assertion under this change and on master branch. Please let me know if this should be fixed together in this PR, I think it will require more work then.

@Darksonn
Copy link
Contributor

It seems like an important property to me. I don't care whether it is fixed in this PR, or in a separate one.

@satakuma
Copy link
Member Author

satakuma commented Feb 5, 2023

It should be fixed now. Please take a look @Darksonn.

@Darksonn
Copy link
Contributor

Darksonn commented Feb 5, 2023

Did you double-check that notify_waiters_poll_consistency_wake_up_batches fails on current master?

@satakuma
Copy link
Member Author

satakuma commented Feb 5, 2023

It does not fail because the code on current master traverses the queue in reverse order, but it fails under 572db8d where this test was added without the actual fix.

@satakuma
Copy link
Member Author

satakuma commented Feb 5, 2023

I've added a test variant which fails under the current master, double-checked that.

@satakuma
Copy link
Member Author

satakuma commented Feb 5, 2023

As a side note: the whole reason behind the issue this PR is trying to fix is that we have to wake up waiters in batches to avoid deadlocks. As far as I can tell from previous discussion, a deadlock can happen if a waker has a drop impl which tries to acquire the mutex again, for example by calling notify_waiters. I've noticed that many poll functions (including Notified::poll) can replace a waker with a new one, effectively dropping the old one, all while holding the lock, what can lead to a similar deadlock. It's hard for me to imagine a proper use case where this would happen, so I decided not to touch it, but it may be a thing to consider. A simple fix is obviously to drop wakers after dropping the lock.

@Darksonn
Copy link
Contributor

Darksonn commented Feb 5, 2023

Ok, that sounds good. I don't have time to do a full review right now, but I'm happy with the tests and the claimed behavior.

Please file bugs for wakers that are dropped/woken while a mutex is held.

@Darksonn
Copy link
Contributor

Darksonn commented Feb 9, 2023

Hmm, something complicated is going on. For example, it seems like one can still observe it not being atomic in the following manner:

  1. Have one thread call notify_waiters.
  2. On another thread, notice that notify_waiters has triggered by having a Notified future complete.
  3. Create a new Notified future and poll it.
  4. Call notify_one to notify the new Notified future.

Then, the notify_one call might just consume one of the old Notified futures that the notify_waiters call still hasn't notified, rather than the new one. This means that the notify_one is lost.

I'm not sure what to do about this. I will need to think about it.

@satakuma
Copy link
Member Author

satakuma commented Feb 9, 2023

That's a good call. I think one way to fix it would be to make every call to notify_waiters take ownership of the waiters list by replacing it with a new, empty one. Looks doable to me, but it will be a bit tricky to implement, because a waiter will have to be agnostic about the list it belongs to and still be able to somehow remove itself from it. I think this might be a better way to go than adding more ifs to the notify_one code accounting for this weird "between batches" state.

@Darksonn
Copy link
Contributor

Darksonn commented Feb 9, 2023

Yeah, adding even more special cases to handle the thing I mentioned seems too error prone. Moving all of the to-be-notified waiters to another list seems reasonable. We might be able to do it by giving each waiter a raw pointer to the list it is stored in? (It has to be able to remove itself from the list in case the waiter is destroyed during the notify_waiters call.)

@satakuma
Copy link
Member Author

satakuma commented Feb 9, 2023

Hmm, but to keep those raw pointers valid, we would have to either box the list or set each pointer at notify_waiters call by traversing the whole list even before starting to wake-up waiters.

I think it can be avoided by creating a pinned version of a linked list in every notify_waiters call, let's say Pin<PinnedLinkedList>. Such list would be created from a normal LinkedList by adding a special head entry which would allow removing any entry simply by swapping adjacent pointers, without knowing the head's address (similarly to how lists work in the Linux kernel). Note that we have to pin the head entry to make it work.

A waiter can then determine whether the list it belongs to is a pinned one by comparing its notify_waiters call number with the current number. If the numbers are different, then the list is pinned, and a waiter can use something like PinnedLinkedList::remove(entry) to remove itself from the list, instead of using waiters.remove, which requires a reference to the list.

@Darksonn
Copy link
Contributor

Darksonn commented Feb 9, 2023

I imagined that we would traverse the entire list, yes.

Some sort of solution where the waiters can remove themselves without knowing which list it is in, is an interesting idea. I don't think we have support for that in our current linked list implementation, though.

@satakuma
Copy link
Member Author

I went ahead with the idea of giving each waiter a raw pointer. The idea of a pinned linked list is definitely a possibility for future improvement, but the amount of unsafe code it would require is a bit too much for me right now.

I think there is still some work to do to make the aliasing correct. I will get back to it later and let you know when it's ready for another review.

@@ -222,8 +228,9 @@ struct Waiter {
/// Waiting task's waker.
waker: Option<Waker>,

/// `true` if the notification has been assigned to this waiter.
notified: Option<NotificationType>,
notify_waiters_queue: Option<NonNull<WaitList>>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell, the invariant of this field is that it is None when it is in the Notify's own list, and Some when it is in some list. Please document that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, I've added a note about this.

Comment on lines +992 to 1006
// See if the node was notified but not received. In this case, if
// the notification was triggered via `notify_one`, it must be sent
// to the next waiter.
//
// Safety: with the entry removed from the linked list, there can be
// no concurrent access to the entry
if matches!(
unsafe { (*waiter.get()).notification },
Some(NotificationType::OneWaiter)
) {
if let Some(waker) = notify_locked(&mut waiters, &notify.state, notify_state) {
drop(waiters);
waker.wake();
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can also cause weird things ... I know this was already here, and maybe the answer is to just not change it.

Imagine this sequence of actions:

  1. Create two Notified futures and poll them.
  2. Call notify_one twice.
  3. Create two new Notified futures.
  4. Drop the original two Notified futures.
  5. Both of the futures from step 3 complete.

This is weird since the futures that completed were created after the notify_one calls, and the fact that there are two means that we can't just pretend that this is due to the permit being stored.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this for a while and I think that's a tough problem. The issue is that we would like to think about cancelling a Notified future as simply removing it from the queue, like it's written in the docs. In reality, a future is removed much earlier and at drop we only simulate what would happen, if the future was removed before receiving the permit. In this code, it is done by passing the permit to another waiting future.

However, we can pass the permit from a future A to a waiting future B if and only if at any point between receiving the permit by A and pushing B to the queue the state of Notify didn't reach NOTIFIED. We could check for that by enumerating all events of calling notify_one and enabling Notified futures and keeping the last number when the state was NOTIFIED. In Drop it would suffice to check if the future being dropped isn't older than this number.

Unfortunately, the problem is deeper than that, and your example demonstrates this. We can drop multiple futures, and simulating that one of them was removed from the queue before receiving the permit will affect how we simulate the same for others.

I like to think about it the following way: let $e_t$ represent the $t$-th event. $e_t = 1$ if the $t$-th event is enabling a Notified future and $e_t = -1$ if it is a notify_one call. Then we can define the number of waiters after the $t$-th event $w_t = \max(-1, w_{t-1} + e_t)$. There is a special case $w_t = -1$ if there are no waiters and the state is NOTIFIED. To check whether we can transfer the permit from a dropped future, we should track the last number when the state was NOTIFIED which is the greatest $T$ which fulfills $w_T = -1$.

The issue is that dropping a future enabled in the $t$-th event means we change $e_t$ from $1$ to $0$, what affects all newer values $w_t, w_{t+1}, \ldots$. In particular, it can change one of them to $-1$. Such a change can increase $T$ and we have to somehow account for that inside the Drop impl.

To sum up, we would need a complicated data structure to effectively track whether we should transfer the permit from a dropped future, for example a deque or some sort of segment tree, which probably is not feasible here. I think the best we can do is the heuristic I've mentioned earlier, but it doesn't even fix your case, so in my opinion we should probably leave this code as it is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that we ran into a similar problem during a discussion about adding a condition variable to Tokio. To me, it seems that the best option is to just document that dropping a future that has consumed a permit from a notify_one call will pass on its permit as-if notify_one was called in the destructor.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I will be happy to open a PR improving the docs.

Comment on lines 531 to 532
let waiters_to_notify = UnsafeCell::new(std::mem::take(&mut *waiters));
pin!(waiters_to_notify);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The LinkedList type is Unpin, so pinning it is a no-op.

Copy link
Member Author

@satakuma satakuma Feb 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it felt awkward to use Pin here...
My intention was to only shadow the original variable binding to prevent from accidentally moving it. I am not sure what is the best way to express that this variable won't change its address. Anyway, I replaced the pin macro with manual shadowing.

Comment on lines 539 to 546
for mut waiter in unsafe { &*waiters_to_notify.get() } {
// Safety: we hold the `waiters` lock.
let waiter = unsafe { waiter.as_mut() };

// Safety: address of `waiters_to_notify` is not null.
waiter.notify_waiters_queue =
unsafe { Some(NonNull::new_unchecked(waiters_to_notify.get())) };
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to create a raw pointer once and reuse it. This has clearer semantics aliasing-wise.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored this part.

@satakuma
Copy link
Member Author

I've noticed a slight degradation in performance when all waiters can fit into a one batch. In such case there is no need to traverse the list and set all the pointers, as they will be immediately removed. For that reason, now notify_waiters will only decouple the list if there is a need to do so, i.e. there are two or more batches. I have also addressed your recent review comments, please take another look @Darksonn.

Just for the record, as there are many new loom tests, almost all of them complete instantaneously. Only the notify_waiters_sequential test takes around 1-2 minutes on my machine.

@satakuma
Copy link
Member Author

Closing in favor of #5458.

@satakuma satakuma closed this Feb 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-tokio Area: The main tokio crate M-sync Module: tokio/sync R-loom Run loom tests on this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

one call tokio::sync::Notify::notify_waiters() can wake two sequential tokio::sync::Notify::notified().await
2 participants