Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible deadlock when a waker is replaced #5429

Closed
satakuma opened this issue Feb 6, 2023 · 11 comments · Fixed by #5471
Closed

Possible deadlock when a waker is replaced #5429

satakuma opened this issue Feb 6, 2023 · 11 comments · Fixed by #5471
Labels
A-tokio Area: The main tokio crate C-bug Category: This is a bug. E-help-wanted Call for participation: Help is requested to fix this issue. E-medium Call for participation: Experience needed to fix: Medium / intermediate M-sync Module: tokio/sync

Comments

@satakuma
Copy link
Member

satakuma commented Feb 6, 2023

Version
1.25.0

Description

There are some types in tokio::sync which keep a shared state behind a mutex and expose methods to modify it through a shared reference. For example: Notify::notify_waiters(&self), Semaphore::add_permits(&self, n: usize).
It is possible to trigger a deadlock with custom wakers backed by ArcWake, which call those methods in their destructors. This can happen because tokio drops a waker while holding the lock to the shared state, and the waker tries to re-enter the lock in its Drop impl. Examples of wakers dropped with the lock held: Notify::notified, Semaphore::acquire.

To trigger a deadlock, you can poll a future twice with two different wakers. At the second poll, the old waker will be replaced and instantly dropped. Examples of deadlocking code:

Notify::notified

Adapted from this test

fn notify_in_drop() {
    use futures::task::ArcWake;
    use std::future::Future;
    use std::sync::Arc;

    let notify = Arc::new(Notify::new());

    struct NotifyOnDrop(Arc<Notify>);

    impl ArcWake for NotifyOnDrop {
        fn wake_by_ref(_arc_self: &Arc<Self>) {}
    }

    impl Drop for NotifyOnDrop {
        fn drop(&mut self) {
            self.0.notify_waiters();
        }
    }

    let mut fut = Box::pin(async {
        notify.notified().await;
    });

    // Second iteration deadlocks.
    for _ in 0..2 {
        let waker = futures::task::waker(Arc::new(NotifyOnDrop(notify.clone())));
        let mut cx = std::task::Context::from_waker(&waker);
        assert!(fut.as_mut().poll(&mut cx).is_pending());
    }
}
Semaphore::acquire
fn release_permits_at_drop() {
    use futures::task::ArcWake;
    use std::future::Future;
    use std::sync::Arc;

    let sem = Arc::new(Semaphore::new(1));

    struct ReleaseOnDrop(Option<OwnedSemaphorePermit>);

    impl ArcWake for ReleaseOnDrop {
        fn wake_by_ref(_arc_self: &Arc<Self>) {}
    }

    let mut fut = Box::pin(async {
        let _permit = sem.acquire().await.unwrap();
    });

    // Second iteration deadlocks.
    for _ in 0..2 {
        let waker = futures::task::waker(Arc::new(ReleaseOnDrop(
            sem.clone().try_acquire_owned().ok(),
        )));
        let mut cx = std::task::Context::from_waker(&waker);
        assert!(fut.as_mut().poll(&mut cx).is_pending());
    }
}

This issue has been partially addressed in the past (#401, PR discussion), but only about calling waker.wake() while holding a lock, not dropping a waker in general.

Simple solution
When an old waker has to be replaced, it can be stored on the side and dropped outside the critical section.

@satakuma satakuma added A-tokio Area: The main tokio crate C-bug Category: This is a bug. labels Feb 6, 2023
@Darksonn Darksonn added M-sync Module: tokio/sync E-help-wanted Call for participation: Help is requested to fix this issue. E-medium Call for participation: Experience needed to fix: Medium / intermediate labels Feb 6, 2023
@amab8901
Copy link
Contributor

I'll give it a try

@amab8901
Copy link
Contributor

I'm thinking that a good solution might be the following:

Trigger:
Process P is holding a resource X and requesting another resource Y. 1 second after the request, it still doesn't have the requested resource Y.

Response:
Process A starts a timer and lets go of the held resource X. When the timer hits 5 seconds, it will request the resource X and resource Y and try to execute to completion. If it misses some needed resource and doesn't get it after 1 second, then it repeats the above process.

What are your thoughts on it? Would this be a good approach?

@satakuma
Copy link
Member Author

satakuma commented Feb 14, 2023

What you've proposed seems like a way to modify the examples of deadlocking code from the issue description. These examples are only meant to demonstrate the problem which lies within Tokio's internal code. I would like to address this issue by changing the code in Tokio so that such deadlocks are impossible to trigger from a program using Tokio as a library.

To resolve this issue, I would suggest you the following approach:

  1. Browse through Tokio's code, starting with the places I've linked in the description (Notify, BatchSemaphore), and look for statements which replace a waker while holding a lock.
  2. Modify them so that the old waker is kept in a separate variable, which goes out of scope after the mutex is released.
  3. Ensure that it is impossible to produce deadlocking examples such as those from the description.

@amab8901
Copy link
Contributor

Why are we asserting that the future has to be pending, in the Notify test?

assert!(fut.as_mut().poll(&mut cx).is_pending());

@satakuma
Copy link
Member Author

satakuma commented Feb 15, 2023

That assert is quite arbitrary, I am uncertain whether it is correct if the code does not deadlock. It was just quicker for me to write the examples that way. Surely you can ignore the result from poll in the second iteration.

@Darksonn
Copy link
Contributor

Closing a semaphore is another instance of this.

@satakuma
Copy link
Member Author

satakuma commented Feb 19, 2023

Reopening as it seems this issue got closed on accident (there is still the problem with closing a BatchSemaphore).

@satakuma satakuma reopened this Feb 19, 2023
@amab8901
Copy link
Contributor

amab8901 commented Feb 27, 2023

Reopening as it seems this issue got closed on accident (there is still the problem with closing a BatchSemaphore).

What do you mean with BatchSemaphore? I couldn't find any such item in this project

@satakuma
Copy link
Member Author

I mean the Semaphore type from src/sync/batch_semaphore.rs. Sorry for the confusion.

@bbmena
Copy link

bbmena commented Jan 4, 2024

Should this issue be closed? It looks like BatchSemaphore has been fixed as well.

@Darksonn
Copy link
Contributor

Darksonn commented Jan 4, 2024

Yes, thank you for pointing that out.

@Darksonn Darksonn closed this as completed Jan 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-tokio Area: The main tokio crate C-bug Category: This is a bug. E-help-wanted Call for participation: Help is requested to fix this issue. E-medium Call for participation: Experience needed to fix: Medium / intermediate M-sync Module: tokio/sync
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants