Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

recover when OS fails to spawn a new thread #4485

Merged
merged 5 commits into from Feb 25, 2022

Conversation

gwik
Copy link
Contributor

@gwik gwik commented Feb 9, 2022

Motivation

As described in #2309 the runtime panics when trying to spawn a new thread and the OS have reaches the limit of threads / processes.

This can be observed on a Linux machine decreasing the limit with ulimit -u and running the program in #2309.

As noted by @Darksonn there is a pool of threads, and failing to spawn a new one is not mandatory for the program to make progress.

Solution

Ignore the temporary errors when failing to spawn a new thread from the OS.
The task will be naturally scheduled by threads already in the pool.

Fixes: #2309

Avoid panicking when the OS reaches the limit of the number of
threads / processes and the error is temporary.

Spawning a new thread is not mandatory to make progress as
long as there is a least one thread in the pool already processing
the task queue.

Fixes: tokio-rs#2309
@github-actions github-actions bot added the R-loom Run loom tests on this PR label Feb 9, 2022
@gwik
Copy link
Contributor Author

gwik commented Feb 9, 2022

I looked at how I could test this, but it let me puzzled...
Guidance appreciated.

@Darksonn Darksonn added A-tokio Area: The main tokio crate M-blocking Module: tokio/task/blocking labels Feb 9, 2022
tokio/src/runtime/blocking/pool.rs Outdated Show resolved Hide resolved
tokio/src/runtime/blocking/pool.rs Outdated Show resolved Hide resolved
gwik and others added 2 commits February 9, 2022 23:26
Copy link
Member

@hawkw hawkw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me!

tokio/src/runtime/blocking/pool.rs Outdated Show resolved Hide resolved
Co-authored-by: Eliza Weisman <eliza@buoyant.io>
@Darksonn
Copy link
Contributor

There are some CI failures that you will have to investigate.

@gwik
Copy link
Contributor Author

gwik commented Feb 10, 2022

error[E0599]: no variant or associated item named `OutOfMemory` found for enum `ErrorKind` in the current scope
   --> /home/gwik/src/playground/tokio/tokio/src/runtime/blocking/pool.rs:311:62
    |
311 |         std::io::ErrorKind::WouldBlock | std::io::ErrorKind::OutOfMemory
    |                                                              ^^^^^^^^^^^ variant or associated item not found in `ErrorKind`

Looks like OutOfMemory was added in 1.54 and therefore it fails on 1.49.
Anyway, I tried to simulate memory exhaustion with cgroup and the process gets OOM killed and I can't make pthread_create to fail with because of no memory available. Even when configuring with huge thread stack. So I guess it's ok to remove the OutOfMemory which wasn't part the issue in the first place anyway.

I confirmed that the fix is working on MacOS and this is a non-issue on Windows since there doesn't seem to be a limit of the number of threads (available memory is the limit). I tried on a VM and I couldn't trigger the bug.

About the test, I wonder if something like this would work on the CI:

  • spawn a task A on the runtime that waits
  • separately spawn (std::thread::spawn) as many threads as possible until the OS refuse to create one.
  • spawn a task B on the runtime
  • kill a thread spawn separately
  • make task A finish
  • verify that task B ran

I wonder how accurate and stable a test like this would be. I doubt it is worth. Ideas ?

OutOfMemory is not available in 1.49 and I can't find
a reproducible scenario anyway.

The fix works on Windows and mac OS so removing the comment.
@hawkw
Copy link
Member

hawkw commented Feb 15, 2022

Looks like OutOfMemory was added in 1.54 and therefore it fails on 1.49.
Anyway, I tried to simulate memory exhaustion with cgroup and the process gets OOM killed and I can't make pthread_create to fail with because of no memory available. Even when configuring with huge thread stack. So I guess it's ok to remove the OutOfMemory which wasn't part the issue in the first place anyway.

It could be worth using autocfg or something to include the OutOfMemory variant in this code only on Rust >= 1.54.0, but this may not be particularly important if we can't easily reproduce a scenario where pthread_create fails with that error. If anyone does hit it in the wild, we can always come back and add Rust version detection in order to handle the OutOfMemory variant...

About the test, I wonder if something like this would work on the CI:

  • spawn a task A on the runtime that waits
  • separately spawn (std::thread::spawn) as many threads as possible until the OS refuse to create one.
  • spawn a task B on the runtime
  • kill a thread spawn separately
  • make task A finish
  • verify that task B ran
    I wonder how accurate and stable a test like this would be. I doubt it is worth. Ideas ?

I'm a bit skeptical about running a test like this on CI, it seems potentially flaky. This code seems simple enough that I'm not sure if it's really worth adding such a complex test or not. If we are going to do this, though, I think it's better to lower the thread limit using ulimit, rather than just spawning a huge pile of threads until we run out...

@gwik
Copy link
Contributor Author

gwik commented Feb 15, 2022

It could be worth using autocfg or something to include the OutOfMemory variant in this code only on Rust >= 1.54.0, but this may not be particularly important if we can't easily reproduce a scenario where pthread_create fails with that error. If anyone does hit it in the wild, we can always come back and add Rust version detection in order to handle the OutOfMemory variant...

I agree it's fine.

I'm a bit skeptical about running a test like this on CI, it seems potentially flaky. This code seems simple enough that I'm not sure if it's really worth adding such a complex test or not. If we are going to do this, though, I think it's better to lower the thread limit using ulimit, rather than just spawning a huge pile of threads until we run out...

I'm very skeptical too, using ulimit on my machine I had to play with many different settings to find the value that reproduce the crash. It's probably going to be more of a burden than an asset to have this test.

If it's fine for you like this, it's fine for me. Thanks for the review.

@gwik
Copy link
Contributor Author

gwik commented Feb 24, 2022

@hawkw is this ok to merge ?

@hawkw
Copy link
Member

hawkw commented Feb 24, 2022

@hawkw is this ok to merge ?

I was hoping to get a second approval from another maintainer, just to make sure I haven't missed anything, but I think it seems good.

Copy link
Contributor

@Darksonn Darksonn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No concerns from me.

@Darksonn Darksonn merged commit 70c10ba into tokio-rs:master Feb 25, 2022
hawkw added a commit that referenced this pull request Apr 27, 2022
# 1.18.0 (April 27, 2022)

This release adds a number of new APIs in `tokio::net`, `tokio::signal`, and
`tokio::sync`. In addition, it adds new unstable APIs to `tokio::task` (`Id`s
for uniquely identifying a task, and `AbortHandle` for remotely cancelling a
task), as well as a number of bugfixes.

### Fixed

- blocking: add missing `#[track_caller]` for `spawn_blocking` ([#4616])
- macros: fix `select` macro to process 64 branches ([#4519])
- net: fix `try_io` methods not calling Mio's `try_io` internally ([#4582])
- runtime: recover when OS fails to spawn a new thread ([#4485])

### Added

- macros: support setting a custom crate name for `#[tokio::main]` and
  `#[tokio::test]` ([#4613])
- net: add `UdpSocket::peer_addr` ([#4611])
- net: add `try_read_buf` method for named pipes ([#4626])
- signal: add `SignalKind` `Hash`/`Eq` impls and `c_int` conversion ([#4540])
- signal: add support for signals up to `SIGRTMAX` ([#4555])
- sync: add `watch::Sender::send_modify` method ([#4310])
- sync: add `broadcast::Receiver::len` method ([#4542])
- sync: add `watch::Receiver::same_channel` method ([#4581])
- sync: implement `Clone` for `RecvError` types ([#4560])

### Changed

- update `nix` to 0.24, limit features ([#4631])
- update `mio` to 0.8.1 ([#4582])
- macros: rename `tokio::select!`'s internal `util` module ([#4543])
- runtime: use `Vec::with_capacity` when building runtime ([#4553])

### Documented

- improve docs for `tokio_unstable` ([#4524])
- runtime: include more documentation for thread_pool/worker ([#4511])
- runtime: update `Handle::current`'s docs to mention `EnterGuard` ([#4567])
- time: clarify platform specific timer resolution ([#4474])
- signal: document that `Signal::recv` is cancel-safe ([#4634])
- sync: `UnboundedReceiver` close docs ([#4548])

### Unstable

The following changes only apply when building with `--cfg tokio_unstable`:

- task: add `task::Id` type ([#4630])
- task: add `AbortHandle` type for cancelling tasks in a `JoinSet` ([#4530],
  [#4640])
- task: fix missing `doc(cfg(...))` attributes for `JoinSet` ([#4531])
- task: fix broken link in `AbortHandle` RustDoc ([#4545])
- metrics: add initial IO driver metrics ([#4507])

[#4616]: #4616
[#4519]: #4519
[#4582]: #4582
[#4485]: #4485
[#4613]: #4613
[#4611]: #4611
[#4626]: #4626
[#4540]: #4540
[#4555]: #4555
[#4310]: #4310
[#4542]: #4542
[#4581]: #4581
[#4560]: #4560
[#4631]: #4631
[#4582]: #4582
[#4543]: #4543
[#4553]: #4553
[#4524]: #4524
[#4511]: #4511
[#4567]: #4567
[#4474]: #4474
[#4634]: #4634
[#4548]: #4548
[#4630]: #4630
[#4530]: #4530
[#4640]: #4640
[#4531]: #4531
[#4545]: #4545
[#4507]: #4507
hawkw added a commit that referenced this pull request Apr 27, 2022
# 1.18.0 (April 27, 2022)

This release adds a number of new APIs in `tokio::net`, `tokio::signal`, and
`tokio::sync`. In addition, it adds new unstable APIs to `tokio::task` (`Id`s
for uniquely identifying a task, and `AbortHandle` for remotely cancelling a
task), as well as a number of bugfixes.

### Fixed

- blocking: add missing `#[track_caller]` for `spawn_blocking` ([#4616])
- macros: fix `select` macro to process 64 branches ([#4519])
- net: fix `try_io` methods not calling Mio's `try_io` internally ([#4582])
- runtime: recover when OS fails to spawn a new thread ([#4485])

### Added

- macros: support setting a custom crate name for `#[tokio::main]` and
  `#[tokio::test]` ([#4613])
- net: add `UdpSocket::peer_addr` ([#4611])
- net: add `try_read_buf` method for named pipes ([#4626])
- signal: add `SignalKind` `Hash`/`Eq` impls and `c_int` conversion ([#4540])
- signal: add support for signals up to `SIGRTMAX` ([#4555])
- sync: add `watch::Sender::send_modify` method ([#4310])
- sync: add `broadcast::Receiver::len` method ([#4542])
- sync: add `watch::Receiver::same_channel` method ([#4581])
- sync: implement `Clone` for `RecvError` types ([#4560])

### Changed

- update `nix` to 0.24, limit features ([#4631])
- update `mio` to 0.8.1 ([#4582])
- macros: rename `tokio::select!`'s internal `util` module ([#4543])
- runtime: use `Vec::with_capacity` when building runtime ([#4553])

### Documented

- improve docs for `tokio_unstable` ([#4524])
- runtime: include more documentation for thread_pool/worker ([#4511])
- runtime: update `Handle::current`'s docs to mention `EnterGuard` ([#4567])
- time: clarify platform specific timer resolution ([#4474])
- signal: document that `Signal::recv` is cancel-safe ([#4634])
- sync: `UnboundedReceiver` close docs ([#4548])

### Unstable

The following changes only apply when building with `--cfg tokio_unstable`:

- task: add `task::Id` type ([#4630])
- task: add `AbortHandle` type for cancelling tasks in a `JoinSet` ([#4530],
  [#4640])
- task: fix missing `doc(cfg(...))` attributes for `JoinSet` ([#4531])
- task: fix broken link in `AbortHandle` RustDoc ([#4545])
- metrics: add initial IO driver metrics ([#4507])

[#4616]: #4616
[#4519]: #4519
[#4582]: #4582
[#4485]: #4485
[#4613]: #4613
[#4611]: #4611
[#4626]: #4626
[#4540]: #4540
[#4555]: #4555
[#4310]: #4310
[#4542]: #4542
[#4581]: #4581
[#4560]: #4560
[#4631]: #4631
[#4582]: #4582
[#4543]: #4543
[#4553]: #4553
[#4524]: #4524
[#4511]: #4511
[#4567]: #4567
[#4474]: #4474
[#4634]: #4634
[#4548]: #4548
[#4630]: #4630
[#4530]: #4530
[#4640]: #4640
[#4531]: #4531
[#4545]: #4545
[#4507]: #4507
hawkw added a commit that referenced this pull request Apr 27, 2022
# 1.18.0 (April 27, 2022)

This release adds a number of new APIs in `tokio::net`, `tokio::signal`, and
`tokio::sync`. In addition, it adds new unstable APIs to `tokio::task` (`Id`s
for uniquely identifying a task, and `AbortHandle` for remotely cancelling a
task), as well as a number of bugfixes.

### Fixed

- blocking: add missing `#[track_caller]` for `spawn_blocking` ([#4616])
- macros: fix `select` macro to process 64 branches ([#4519])
- net: fix `try_io` methods not calling Mio's `try_io` internally ([#4582])
- runtime: recover when OS fails to spawn a new thread ([#4485])

### Added

- macros: support setting a custom crate name for `#[tokio::main]` and
  `#[tokio::test]` ([#4613])
- net: add `UdpSocket::peer_addr` ([#4611])
- net: add `try_read_buf` method for named pipes ([#4626])
- signal: add `SignalKind` `Hash`/`Eq` impls and `c_int` conversion ([#4540])
- signal: add support for signals up to `SIGRTMAX` ([#4555])
- sync: add `watch::Sender::send_modify` method ([#4310])
- sync: add `broadcast::Receiver::len` method ([#4542])
- sync: add `watch::Receiver::same_channel` method ([#4581])
- sync: implement `Clone` for `RecvError` types ([#4560])

### Changed

- update `nix` to 0.24, limit features ([#4631])
- update `mio` to 0.8.1 ([#4582])
- macros: rename `tokio::select!`'s internal `util` module ([#4543])
- runtime: use `Vec::with_capacity` when building runtime ([#4553])

### Documented

- improve docs for `tokio_unstable` ([#4524])
- runtime: include more documentation for thread_pool/worker ([#4511])
- runtime: update `Handle::current`'s docs to mention `EnterGuard` ([#4567])
- time: clarify platform specific timer resolution ([#4474])
- signal: document that `Signal::recv` is cancel-safe ([#4634])
- sync: `UnboundedReceiver` close docs ([#4548])

### Unstable

The following changes only apply when building with `--cfg tokio_unstable`:

- task: add `task::Id` type ([#4630])
- task: add `AbortHandle` type for cancelling tasks in a `JoinSet` ([#4530],
  [#4640])
- task: fix missing `doc(cfg(...))` attributes for `JoinSet` ([#4531])
- task: fix broken link in `AbortHandle` RustDoc ([#4545])
- metrics: add initial IO driver metrics ([#4507])

[#4616]: #4616
[#4519]: #4519
[#4582]: #4582
[#4485]: #4485
[#4613]: #4613
[#4611]: #4611
[#4626]: #4626
[#4540]: #4540
[#4555]: #4555
[#4310]: #4310
[#4542]: #4542
[#4581]: #4581
[#4560]: #4560
[#4631]: #4631
[#4582]: #4582
[#4543]: #4543
[#4553]: #4553
[#4524]: #4524
[#4511]: #4511
[#4567]: #4567
[#4474]: #4474
[#4634]: #4634
[#4548]: #4548
[#4630]: #4630
[#4530]: #4530
[#4640]: #4640
[#4531]: #4531
[#4545]: #4545
[#4507]: #4507

Signed-off-by: Eliza Weisman <eliza@buoyant.io>
crapStone pushed a commit to Calciumdibromid/CaBr2 that referenced this pull request May 1, 2022
This PR contains the following updates:

| Package | Type | Update | Change |
|---|---|---|---|
| [tokio](https://tokio.rs) ([source](https://github.com/tokio-rs/tokio)) | dependencies | minor | `1.17.0` -> `1.18.0` |
| [tokio](https://tokio.rs) ([source](https://github.com/tokio-rs/tokio)) | dev-dependencies | minor | `1.17.0` -> `1.18.0` |

---

### Release Notes

<details>
<summary>tokio-rs/tokio</summary>

### [`v1.18.0`](https://github.com/tokio-rs/tokio/releases/tokio-1.18.0)

[Compare Source](tokio-rs/tokio@tokio-1.17.0...tokio-1.18.0)

##### 1.18.0 (April 27, 2022)

This release adds a number of new APIs in `tokio::net`, `tokio::signal`, and
`tokio::sync`. In addition, it adds new unstable APIs to `tokio::task` (`Id`s
for uniquely identifying a task, and `AbortHandle` for remotely cancelling a
task), as well as a number of bugfixes.

##### Fixed

-   blocking: add missing `#[track_caller]` for `spawn_blocking` ([#&#8203;4616](tokio-rs/tokio#4616))
-   macros: fix `select` macro to process 64 branches ([#&#8203;4519](tokio-rs/tokio#4519))
-   net: fix `try_io` methods not calling Mio's `try_io` internally ([#&#8203;4582](tokio-rs/tokio#4582))
-   runtime: recover when OS fails to spawn a new thread ([#&#8203;4485](tokio-rs/tokio#4485))

##### Added

-   macros: support setting a custom crate name for `#[tokio::main]` and
    `#[tokio::test]` ([#&#8203;4613](tokio-rs/tokio#4613))
-   net: add `UdpSocket::peer_addr` ([#&#8203;4611](tokio-rs/tokio#4611))
-   net: add `try_read_buf` method for named pipes ([#&#8203;4626](tokio-rs/tokio#4626))
-   signal: add `SignalKind` `Hash`/`Eq` impls and `c_int` conversion ([#&#8203;4540](tokio-rs/tokio#4540))
-   signal: add support for signals up to `SIGRTMAX` ([#&#8203;4555](tokio-rs/tokio#4555))
-   sync: add `watch::Sender::send_modify` method ([#&#8203;4310](tokio-rs/tokio#4310))
-   sync: add `broadcast::Receiver::len` method ([#&#8203;4542](tokio-rs/tokio#4542))
-   sync: add `watch::Receiver::same_channel` method ([#&#8203;4581](tokio-rs/tokio#4581))
-   sync: implement `Clone` for `RecvError` types ([#&#8203;4560](tokio-rs/tokio#4560))

##### Changed

-   update `mio` to 0.8.1 ([#&#8203;4582](tokio-rs/tokio#4582))
-   macros: rename `tokio::select!`'s internal `util` module ([#&#8203;4543](tokio-rs/tokio#4543))
-   runtime: use `Vec::with_capacity` when building runtime ([#&#8203;4553](tokio-rs/tokio#4553))

##### Documented

-   improve docs for `tokio_unstable` ([#&#8203;4524](tokio-rs/tokio#4524))
-   runtime: include more documentation for thread_pool/worker ([#&#8203;4511](tokio-rs/tokio#4511))
-   runtime: update `Handle::current`'s docs to mention `EnterGuard` ([#&#8203;4567](tokio-rs/tokio#4567))
-   time: clarify platform specific timer resolution ([#&#8203;4474](tokio-rs/tokio#4474))
-   signal: document that `Signal::recv` is cancel-safe ([#&#8203;4634](tokio-rs/tokio#4634))
-   sync: `UnboundedReceiver` close docs ([#&#8203;4548](tokio-rs/tokio#4548))

##### Unstable

The following changes only apply when building with `--cfg tokio_unstable`:

-   task: add `task::Id` type ([#&#8203;4630](tokio-rs/tokio#4630))
-   task: add `AbortHandle` type for cancelling tasks in a `JoinSet` ([#&#8203;4530](tokio-rs/tokio#4530)],
    \[[#&#8203;4640](tokio-rs/tokio#4640))
-   task: fix missing `doc(cfg(...))` attributes for `JoinSet` ([#&#8203;4531](tokio-rs/tokio#4531))
-   task: fix broken link in `AbortHandle` RustDoc ([#&#8203;4545](tokio-rs/tokio#4545))
-   metrics: add initial IO driver metrics ([#&#8203;4507](tokio-rs/tokio#4507))

</details>

---

### Configuration

📅 **Schedule**: At any time (no schedule defined).

🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied.

♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.

🔕 **Ignore**: Close this PR and you won't be reminded about these updates again.

---

 - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, click this checkbox.

---

This PR has been generated by [Renovate Bot](https://github.com/renovatebot/renovate).

Co-authored-by: cabr2-bot <cabr2.help@gmail.com>
Reviewed-on: https://codeberg.org/Calciumdibromid/CaBr2/pulls/1327
Reviewed-by: crapStone <crapstone@noreply.codeberg.org>
Co-authored-by: Calciumdibromid Bot <cabr2_bot@noreply.codeberg.org>
Co-committed-by: Calciumdibromid Bot <cabr2_bot@noreply.codeberg.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-tokio Area: The main tokio crate M-blocking Module: tokio/task/blocking R-loom Run loom tests on this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

tokio::task::spawn_blocking panics when exceeding the thread limit
3 participants