Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tokio::process::Command::spawn can fail with WouldBlock #2841

Closed
Yoric opened this issue Sep 17, 2020 · 11 comments
Closed

tokio::process::Command::spawn can fail with WouldBlock #2841

Yoric opened this issue Sep 17, 2020 · 11 comments
Labels
A-tokio Area: The main tokio crate C-bug Category: This is a bug. M-process Module: tokio/process

Comments

@Yoric
Copy link

Yoric commented Sep 17, 2020

Version

  • tokio v0.2.22
  • tokio-macros v0.2.5
  • tokio-test v0.2.1
  • rustc 1.48.0-nightly (9b4154193 2020-09-14)

Platform

Darwin * 19.5.0 Darwin Kernel Version 19.5.0: Tue May 26 20:41:44 PDT 2020; root:xnu-6153.121.2~2/RELEASE_X86_64 x86_64

Description

While stress-testing a process library I'm writing, I encountered somewhat reproducible instances of

let child = tokio::process::Command::new("...")
   .stdout(std::process::Stdio::piped())
   .spawn()
   .unwrap();

failing with a WouldBlock error.

I expected that the underlying error would simply cause a future to be blocked, rather than returning a Result::Err.

@Yoric Yoric added A-tokio Area: The main tokio crate C-bug Category: This is a bug. labels Sep 17, 2020
@Yoric
Copy link
Author

Yoric commented Sep 17, 2020

By the way, if this is confirmed to be an unwanted behavior, I'm interested in working on a patch.

@ipetkov
Copy link
Member

ipetkov commented Sep 17, 2020

Hi @Yoric thanks for the report!

Can't think of what would raise a WouldBlock error here, other than a potential issue with registering a signal handler under the hood (which internally uses a pipe for receiving wake events).

Could you try to reproduce this while running under strace and post the log here?

@Darksonn Darksonn added the M-process Module: tokio/process label Sep 17, 2020
@Yoric
Copy link
Author

Yoric commented Sep 17, 2020

I'll try and do that.
The test lasted 50 minutes, though, so... I'm a bit afraid of the size of the log.

@ipetkov
Copy link
Member

ipetkov commented Sep 17, 2020

@Yoric if you're interested in checking out the tokio source and annotating the code with extra debug statements or context of where the error comes out from that might also help narrow down the cause

@Yoric
Copy link
Author

Yoric commented Sep 18, 2020

Here's a minimal test. It runs tokio::process::Command::spawn then immediately kills the process, but proceeds without waiting for the process to actually terminate.

// Attempt to stress tokio::process::Command::spawn
// to reproduce a `WouldBlock` error.
#[test]
fn main() {
    env_logger::init();
    tokio_test::block_on(async {
        // A little time to launch dtruss.
        // The error shows up even without this wait.
        tokio::time::delay_for(std::time::Duration::new(3, 0))
            .await;

        let mut cmd = tokio::process::Command::new("ls");
        cmd.kill_on_drop(true) // Make sure we always cleanup
            .stdout(std::process::Stdio::piped());
        for i in 0..10000 {
            eprintln!("Process {}", i);
            cmd.spawn()
               .expect("Could not spawn process")
               .kill()
               .expect("Could not kill process");
        }
    });
}

If fails around 2400 processes launched (and killed).

As far as I can tell from dtruss, the EAGAIN appears here:

22654/0x1111246:  posix_spawn(0x7000040ECF58, 0x7FF7BD704080, 0x7000040EC8A0)		 = -1 Err#35

Attaching

@Yoric
Copy link
Author

Yoric commented Sep 18, 2020

I fear that any solution would require a rewrite of spawn() with a different signature – the current definition is sync, so cannot be await-ed upon and any launch error can happen after the Child has been returned.

By the way, shouldn't kill() be async and/or return a Future?

@Yoric
Copy link
Author

Yoric commented Sep 18, 2020

@Yoric if you're interested in checking out the tokio source and annotating the code with extra debug statements or context of where the error comes out from that might also help narrow down the cause

Actually, I think the reason is simple: the Unix implementation of spawn_child immediately calls cmd.spawn()?. This can error with a WouldBlock if posix_spawn or exec errors with EAGAIN.

I'm not entirely sure how we can wait for posix_spawn or exec to be available once more without polling.

@ipetkov
Copy link
Member

ipetkov commented Sep 18, 2020

Thanks for pulling some logs @Yoric, I'll take a look at them.

Actually I also came across this comment which points towards EAGAIN being raised when the process limit is exhausted which I suspect may be the case in your stress test. I also noticed that the test calls kill but does not await the child, which leaves the possibility of the process not getting reaped (tokio will perform a best effort for reaping orphaned children in the background, so long as new process are being spawned/polled, but I suppose it is possible for that to get stuck).

The master branch has a change which makes kill() and async operation which also internally reaps the child to minimize the possibility of leaving a zombie process. Could you try rerunning your test with the latest master commit and await the kill call to see if the issue persists (or at least takes even longer to hit)?

@Yoric
Copy link
Author

Yoric commented Sep 18, 2020

I also noticed that the test calls kill but does not await the child, which leaves the possibility of the process not getting reaped (tokio will perform a best effort for reaping orphaned children in the background, so long as new process are being spawned/polled, but I suppose it is possible for that to get stuck).

Yes, that's also my assumption. I have added a wait_with_output call and that solves the issue.

So, the question is which behavior is desirable:

  • somehow making spawn (or a variant thereof) wait until a process can be spawned; or
  • keeping it a failure as it is at the moment, perhaps with more documentation.

@ipetkov
Copy link
Member

ipetkov commented Sep 18, 2020

We should definitely update the documentation to mention that a WouldBlock error can be raised during spawn if the pid space is full!

As for waiting until the spawn succeeds I'm not sure what we can do to be woken up when the process space clears out. It is entirely possible that the system has too many other processes running even beyond tokio's control. Happy to explore any suggestions if anyone has them! But until then I think the best course of action is to surface the error to the caller and allow them to retry if necessary...

@ipetkov
Copy link
Member

ipetkov commented Oct 13, 2020

The documentation around returning WouldBlock has been updated as part of #2952. Unfortunately there doesn't seem like we can do anything about the system pid exhaustion, or how to be alerted/woken if more space opens up. Going to close this out for now, feel free to reopen if there are other questions!

@ipetkov ipetkov closed this as completed Oct 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-tokio Area: The main tokio crate C-bug Category: This is a bug. M-process Module: tokio/process
Projects
None yet
Development

No branches or pull requests

3 participants