Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Livelock in poll_proceed when using a futures 0.1 FuturesUnordered on Tokio 0.2.16 #2390

Closed
krallin opened this issue Apr 9, 2020 · 9 comments
Labels
A-tokio Area: The main tokio crate C-bug Category: This is a bug. I-hang Program never terminates, resulting from infinite loops, deadlock, livelock, etc. M-coop Module: tokio/coop

Comments

@krallin
Copy link
Contributor

krallin commented Apr 9, 2020

Version

  • Tokio 0.2.16
  • Futures 0.1 for FuturesUnordered

Platform

Linux, OSX.

Subcrates

sync, coop

Description

We are seeing a livelock in poll_proceed when executing a futures 0.1 stream built using poll_fn and FuturesUnordered where the futures driven by FuturesUnordered calls Semaphore::acquire.

I have implemented a repro here: https://github.com/krallin/repro-tokio-livelock. To repro, just cargo run. It'll hang forever.

The code is:

use futures::{
    compat::Stream01CompatExt,
    future::{FutureExt, TryFutureExt},
    stream::TryStreamExt,
};
use futures_old::stream::FuturesUnordered; // <-- Futures 0.1
use tokio::sync::Semaphore;

#[tokio::main]
async fn main() {
    let sem = Semaphore::new(1);
    let sem = &sem;

    let mut futs = FuturesUnordered::new();
    for _ in 0..200 { // If you do e.g. 10 it works
        futs.push(async move {
            let _ = sem.acquire().await;
            Result::<_, ()>::Ok(())
        }.boxed().compat());
    }

    let n = futs.compat().try_collect::<Vec<_>>().await.unwrap().len();
    println!("done: {}", n);
}

The livelock is here:

pub fn poll_proceed(cx: &mut Context<'_>) -> Poll<()> {
    HITS.with(|hits| {
        let n = hits.get();
        if n == UNCONSTRAINED {
            // opted out of budgeting
            Poll::Ready(())
        } else if n == 0 {
           # BUG HERE >> We constantly re-enter this code path.
            cx.waker().wake_by_ref();
            Poll::Pending
        } else {
            hits.set(n.saturating_sub(1));
            Poll::Ready(())
        }
    })
}

We'd love some help understanding troubleshooting this issue. Thanks! Let us know if there is anything we can do to help root-cause further.

@krallin krallin changed the title Livelock in poll_proceed when using a futures FuturesUnordered on Tokio 0.2.16 Livelock in poll_proceed when using a futures 0.1 FuturesUnordered on Tokio 0.2.16 Apr 9, 2020
@Darksonn
Copy link
Contributor

Darksonn commented Apr 9, 2020

Are you using old futures because it has FuturesUnordered? Just to confirm, are you aware that futures v0.3 also has an implementation of FuturesUnordered? Additionally, can you link to the poll_proceed function source, so I can see context?

@krallin
Copy link
Contributor Author

krallin commented Apr 9, 2020

@Darksonn,

Thanks for replying!

Are you using old futures because it has FuturesUnordered? Just to confirm, are you aware that futures v0.3 also has an implementation of FuturesUnordered?

Yep, we're indeed aware that FuturesUnordered in also in Futures 0.3. We simply still have some older code that still uses Futures 0.1 that is being driven by a Tokio 0.2 executor.

The repro I shared is definitely a contrived, reduced example (the actual callsite for this is here: https://github.com/facebookexperimental/eden/blob/master/eden/mononoke/pushrebase/src/lib.rs#L281, and goes through layers before finally getting to FuturesUnordered).

Additionally, can you link to the poll_proceed function source, so I can see context?

Here you go:

pub fn poll_proceed(cx: &mut Context<'_>) -> Poll<()> {

@Darksonn
Copy link
Contributor

Darksonn commented Apr 9, 2020

The issue is almost certainly the new preemption feature (blog post), which makes IO resources return Poll::Pending once a per-task-poll budget has been used up. As you can see in poll_proceed, that IO resource will in this case immediately emit a wakeup and return pending.

This means that since the wakeup is emitted to the FuturesUnordered subscheduler, that sub scheduler decides that since it got a wakeup from that IO resource, it should be polled again immediately. Unfortunately this means that the poll of the Tokio task never completes, as the FuturesUnordered repeatedly polls that task without yielding to Tokio, and this means the budget is not reset.

@jonhoo
Copy link
Sponsor Contributor

jonhoo commented Apr 9, 2020

Yeah, this is basically rust-lang/futures-rs#2047, which was fixed in futures-util 0.3 (rust-lang/futures-rs#2049), but the fix wasn't backported to 0.1 :/

@krallin
Copy link
Contributor Author

krallin commented Apr 10, 2020

@jonhoo thanks — sounds like I should submit this backport then. That said, just to make sure I'm following this properly: this means my FuturesUnordered will poll the underlying future 32 times "unsuccessfully" before actually yielding and polling it successfully, right?

krallin added a commit to krallin/futures-rs that referenced this issue Apr 10, 2020
This backports rust-lang#2049 to the
0.1 branch. Without this change, polling > 200 futures trough a
FuturesUnordered on a Tokio 0.2 executor results in a busy loop in
Tokio's cooperative scheduling module.

See for a repro of where this breaks:
tokio-rs/tokio#2390

Tested by running the reproducer I submitted there. Without this change,
it hangs forever (spinning on CPU). With the change, it doesn't.
@krallin
Copy link
Contributor Author

krallin commented Apr 10, 2020

I submitted rust-lang/futures-rs#2122 to backport the fix.

Out of curiosity, is this new cooperative scheduling mechanism configurable? (i.e. can it be turned on or off on a per-runtime basis?) Looking at the code, it looks like it isn't, but I'd love to make sure: we were somewhat lucky that we had a test that happened to catch this particular bug, but it'd be nice to be able to canary this individual mechanism in order to make sure we don't have further instances of it lying around in the codebase 😄

@jonhoo
Copy link
Sponsor Contributor

jonhoo commented Apr 10, 2020

@krallin Yes, your observation is (sadly) correct — FuturesUnordered will end up polling the perpetually-pending future 32 times unsuccessfully before yielding. What I really want to happen is for sub-executors (like FuturesUnordered) to have a way to check if the budget has been expended, and if so, to stop polling more pending futures. It'd be a sort of heuristic like:

If two futures in a row that were marked as ready return Pending when you poll them, check the budget.

The tricky part of this is how we expose that functionality without requiring that the implementor having to depend on tokio. For many crates, that dependency wouldn't be too onerous, but for something like futures-util, it probably shouldn't. Arguably what's needed is for the coop "building blocks" to be provided by some low-level, separate library that everyone can then hook into. Though that comes with the cost of adding external dependencies, which has its own set of problems.

There isn't currently a way to turn off coop, though in theory it should be pretty straightforward. Essentially you'd want to make these calls not wrap the inner closure with coop::budget if budgeting is disabled. Or we could do it with a feature I suppose? @carllerche may have thoughts.

facebook-github-bot pushed a commit to facebookexperimental/rust-shed that referenced this issue Apr 15, 2020
Summary:
This is needed because the tonic crate (see the diff stack) relies on tokio ^0.2.13

We can't go to a newer version because a bug that affects mononoke was introduced on 0.2.14 (discussion started on T65261126). The issue was reported upstream tokio-rs/tokio#2390

This diff simply changed the version number on `fbsource/third-party/rust/Cargo.toml` and ran `fbsource/third-party/rust/reindeer/vendor`.

Also ran `buck run //common/rust/cargo_from_buck:cargo_from_buck` to fix the tokio version on generated cargo files

Reviewed By: krallin

Differential Revision: D21043344

fbshipit-source-id: e61797317a581aa87a8a54e9e2ae22655f22fb97
facebook-github-bot pushed a commit to facebookincubator/below that referenced this issue Apr 15, 2020
Summary:
This is needed because the tonic crate (see the diff stack) relies on tokio ^0.2.13

We can't go to a newer version because a bug that affects mononoke was introduced on 0.2.14 (discussion started on T65261126). The issue was reported upstream tokio-rs/tokio#2390

This diff simply changed the version number on `fbsource/third-party/rust/Cargo.toml` and ran `fbsource/third-party/rust/reindeer/vendor`.

Also ran `buck run //common/rust/cargo_from_buck:cargo_from_buck` to fix the tokio version on generated cargo files

Reviewed By: krallin

Differential Revision: D21043344

fbshipit-source-id: e61797317a581aa87a8a54e9e2ae22655f22fb97
facebook-github-bot pushed a commit to facebook/sapling that referenced this issue Apr 15, 2020
Summary:
This is needed because the tonic crate (see the diff stack) relies on tokio ^0.2.13

We can't go to a newer version because a bug that affects mononoke was introduced on 0.2.14 (discussion started on T65261126). The issue was reported upstream tokio-rs/tokio#2390

This diff simply changed the version number on `fbsource/third-party/rust/Cargo.toml` and ran `fbsource/third-party/rust/reindeer/vendor`.

Also ran `buck run //common/rust/cargo_from_buck:cargo_from_buck` to fix the tokio version on generated cargo files

Reviewed By: krallin

Differential Revision: D21043344

fbshipit-source-id: e61797317a581aa87a8a54e9e2ae22655f22fb97
cramertj pushed a commit to rust-lang/futures-rs that referenced this issue Apr 22, 2020
This backports #2049 to the
0.1 branch. Without this change, polling > 200 futures trough a
FuturesUnordered on a Tokio 0.2 executor results in a busy loop in
Tokio's cooperative scheduling module.

See for a repro of where this breaks:
tokio-rs/tokio#2390

Tested by running the reproducer I submitted there. Without this change,
it hangs forever (spinning on CPU). With the change, it doesn't.
@Darksonn Darksonn added A-tokio Area: The main tokio crate C-bug Category: This is a bug. I-hang Program never terminates, resulting from infinite loops, deadlock, livelock, etc. M-coop Module: tokio/coop labels Apr 29, 2020
@aclysma
Copy link

aclysma commented Jul 8, 2020

+1 for being able to turn off coop. I know some people are going to have different opinions on this, but I explicitly don't want fairness in tasks. (For context, my usage is loading assets for games, and possibly in the future executing jobs that may wait on locks, GPU fences, or each other)

  • If I have something long-lived and want fairness, I would use threads. In that case the unit of work dominates the cost of context switching threads and the OS provides stronger guarantees of fairness.
  • If I have something short-lived, then it's not going to starve other tasks. So if it can make progress, I want it to continue to make progress, finish, and then pick up the next task.
  • If I really did want fairness in short lived tasks, then I would prefer to inject my own yields based on heuristics that I choose that can be informed by application-specific logic.

I would probably still be of this opinion, even if I was writing something more "vanilla" like a web service, but I do understand not everyone is going to share my opinion on this! And long term, maybe my usage isn't precisely what tokio is designed for. But maybe it's not too difficult to allow disabling it (or allow modifying the heuristics to an impossibly high value to effectively disable it)

My current workaround is to use tokio 0.2.13. Otherwise my application hangs consistently with even a relatively small number of in-flight asset loads.

@taiki-e
Copy link
Member

taiki-e commented Jan 12, 2021

Fixed in futures 0.1.30

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-tokio Area: The main tokio crate C-bug Category: This is a bug. I-hang Program never terminates, resulting from infinite loops, deadlock, livelock, etc. M-coop Module: tokio/coop
Projects
None yet
Development

No branches or pull requests

5 participants