Improve resiliency to process corruption #79

casperisfine · 2023-12-08T08:10:50Z

Context

A fundamental problem Pitchfork has to deal with is that both POSIX and Linux don't quite support running anything but async-signal safe function after a fork().

In practice, as long as you never spawned any background thread, you are fine. But many ruby applications and gems do spawn threads, and in presence of such background threads if we happen to fork at the wrong time, it can result in a sub process that is in an unrecoverable state.

The typical case is forking while a background thread hold a lock, in the child this lock will remain locked and trying to access it will dead lock.

For instance this can happen with OpenSSL 3:

    [/usr/lib/x86_64-linux-gnu/libc.so.6] pthread_rwlock_wrlock
    [/usr/lib/x86_64-linux-gnu/libcrypto.so.3] CRYPTO_THREAD_write_lock
    [/usr/lib/x86_64-linux-gnu/libcrypto.so.3] CRYPTO_alloc_ex_data
    [/usr/lib/x86_64-linux-gnu/libcrypto.so.3] OPENSSL_thread_stop
    [/usr/lib/x86_64-linux-gnu/libcrypto.so.3] OPENSSL_cleanup
    [/usr/lib/x86_64-linux-gnu/libc.so.6] secure_getenv

So any background thread that use a SSL connection may break reforking.

That's what Pitchfork.prevent_fork is for, but still, we should try to handle such scenario as gracefully as possible.

Action Plan

If we detect such case we should terminate the affected process.
Ideally we replace that process with a new one, but if for some reason we can't, we should gracefully terminate the whole server (last resort).
We should consider "reverting" Spawn molds instead of promoting workers #42.
- Spawning the new mold out of a worker has the nice property of not impacting capacity as much
- However that fork is risky because workers are even more likely than molds to have background threads.
- We should probably warn for every thread in the mold (Puma does something similar)
- (optional) We could provide a way to run background threads in a dedicated process outside the mold.
- Provide a callback to validate post-fork processes
  - Maybe even validate the usual suspects by default (OpenSSL)

The text was updated successfully, but these errors were encountered:

Ref: #79 It can happen that the new mold was forked while at an unsafe point, causing the middle process to crash. When we detect this happens, we should abandon this mold. Currently abandoning the mold cause a graceful shutdown, in the future we could try creating a new mold to replace it. Co-Authored-By: Étienne Barrié <etienne.barrie@gmail.com>

casperisfine · 2023-12-08T13:12:34Z

We should consider "reverting" Spawn molds instead of promoting workers #42

Another possibility would be to "feature test" that the mold isn't corrupted before registering to the monitor.

e.g. most issues I've seen happen during exit, so Process.wait(fork {}) may be enough to confirm the process isn't hosed.

casperisfine mentioned this issue Dec 8, 2023

Handle fork_sibling failing repeatedly #80

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve resiliency to process corruption #79

Improve resiliency to process corruption #79

casperisfine commented Dec 8, 2023 •

edited

casperisfine commented Dec 8, 2023

Improve resiliency to process corruption #79

Improve resiliency to process corruption #79

Comments

casperisfine commented Dec 8, 2023 • edited

Context

Action Plan

casperisfine commented Dec 8, 2023

casperisfine commented Dec 8, 2023 •

edited