Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

console freezes in CONNECTING state when there are too many tasks #133

Closed
hatoo opened this issue Sep 11, 2021 · 4 comments
Closed

console freezes in CONNECTING state when there are too many tasks #133

hatoo opened this issue Sep 11, 2021 · 4 comments
Labels
C-subscriber Crate: console-subscriber. S-bug Severity: bug

Comments

@hatoo
Copy link
Contributor

hatoo commented Sep 11, 2021

console freezes in CONNECTING state to this application.

use std::time::Duration;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error + Send + Sync>> {
    console_subscriber::init();

    let futures = (0..10/* A smaller number like 5 was Ok */)
        .map(|_| {
            tokio::spawn(async {
                loop {
                    tokio::time::sleep(Duration::from_millis(
                        10, /* A larger number like 1000 was Ok */
                    ))
                    .await;
                }
            })
        })
        .collect::<Vec<_>>();

    for f in futures {
        f.await?;
    }

    Ok(())
}

image

As commented above, it's Ok if the number of spawned futures is small or sleep duration is large.

@Freax13
Copy link
Contributor

Freax13 commented Sep 18, 2021

This is caused by a deadlock:

  1. Spawning many tasks and Sleeps causes a lot of events.
  2. The capacity of TasksLayers::tx drops under TasksLayers::flush_under_capacity.
  3. Flush::trigger is called, the triggered flag is set and the aggregator is notified.
  4. The aggregator picks up the notification and clears the triggered flag.
  5. The aggregator processes some but not all events because tokio's mpsc implementation yields after the tasks budget has been used up even if there still are items left in the channel. This is usually fine but .now_or_never() is called on the future, so only the future is canceled and the task doesn't yield.
  6. The aggregator waits for the next notification and polls Notified which has no notification, so it locks a mutex and tries to clone the tasks waker to put it on its list of wakers.
  7. Cloning the waker causes another event.
  8. This event is put on the channel.
  9. Because the channel is still mostly full and the triggered flag is not set, Flush::trigger called once again.
  10. Notifier::notify_one tries to lock the mutex locked in step 6. and a deadlock occurs.

This happens so quickly because the channel might not be fully drained, but even if it was, Notified cloning the waker could still be the one event pushing it above the limit (although that is much less likely).

One way to deal with this might be to suppress sending the notification by setting a flag when the Notified future is polled.

@hawkw hawkw added C-subscriber Crate: console-subscriber. S-bug Severity: bug labels Sep 18, 2021
@seanmonstar
Copy link
Member

It seems this is another case of using instrumented synchronization primitives inside the layer trying to react to the instrumentation causing unhappy times.

seanmonstar added a commit to tokio-rs/tokio that referenced this issue Sep 22, 2021
Since a waker can trigger arbitrary code, such as with a custom waker,
and even more so now that it can emit a tracing event that could do
respond, we must be careful about the internal state when that code is
triggered. The clone method of a waker is one of those instances.

This changes the internals of `Notify` so that the waker is cloned
*before* locking the waiter list. While this does mean that in some
contended cases, we'll have made an optimistic clone, it makes `Notify`
more robust and correct.

Note that the included test case is built from an instance that did
happen naturally in another project, see
tokio-rs/console#133.
seanmonstar added a commit to tokio-rs/tokio that referenced this issue Sep 22, 2021
Since a waker can trigger arbitrary code, such as with a custom waker,
and even more so now that it can emit a tracing event that could do
respond, we must be careful about the internal state when that code is
triggered. The clone method of a waker is one of those instances.

This changes the internals of `Notify` so that the waker is cloned
*before* locking the waiter list. While this does mean that in some
contended cases, we'll have made an optimistic clone, it makes `Notify`
more robust and correct.

Note that the included test case is built from an instance that did
happen naturally in another project, see
tokio-rs/console#133.
Darksonn pushed a commit to tokio-rs/tokio that referenced this issue Sep 23, 2021
…4129)

Since a waker can trigger arbitrary code, such as with a custom waker,
and even more so now that it can emit a tracing event that could do
respond, we must be careful about the internal state when that code is
triggered. The clone method of a waker is one of those instances.

This changes the internals of `Notify` so that the waker is cloned
*before* locking the waiter list. While this does mean that in some
contended cases, we'll have made an optimistic clone, it makes `Notify`
more robust and correct.

Note that the included test case is built from an instance that did
happen naturally in another project, see
tokio-rs/console#133.
suikammd pushed a commit to suikammd/tokio that referenced this issue Oct 7, 2021
…okio-rs#4129)

Since a waker can trigger arbitrary code, such as with a custom waker,
and even more so now that it can emit a tracing event that could do
respond, we must be careful about the internal state when that code is
triggered. The clone method of a waker is one of those instances.

This changes the internals of `Notify` so that the waker is cloned
*before* locking the waiter list. While this does mean that in some
contended cases, we'll have made an optimistic clone, it makes `Notify`
more robust and correct.

Note that the included test case is built from an instance that did
happen naturally in another project, see
tokio-rs/console#133.
@seanmonstar seanmonstar changed the title console freezes in CONNECTING state when there are a little many tasks console freezes in CONNECTING state when there are too many tasks Oct 12, 2021
@hatoo
Copy link
Contributor Author

hatoo commented Dec 21, 2021

It seems that the console doesn't freeze for the example now.

@pnkfelix
Copy link
Contributor

pnkfelix commented Jan 6, 2022

thanks @hatoo !

closing as fixed then. (I'm not sure if it was fixed by PR tokio-rs/tokio#4129, or by something else that happened in the meantime, but I don't think we are going to attempt to track down the exact cause of the fix.)

@pnkfelix pnkfelix closed this as completed Jan 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-subscriber Crate: console-subscriber. S-bug Severity: bug
Projects
None yet
Development

No branches or pull requests

5 participants