gRPC crash in a forked process (python) #31240

romanek-adam-b2c2 · 2022-10-05T11:42:44Z

What version of gRPC and what language are you using?

gRPC v1.48.0
Python

What operating system (Linux, Windows,...) and version?

macOS 12.6

What runtime / compiler are you using (e.g. python version or version of gcc)

Python 3.10.5

What did you do?

Use gRPC client in a forked process, like in a Celery worker when Celery runs in "prefork" pool mode. A gRPC call must fail for the issue to occur.

Here's the minimal reproducible example:

#!/usr/bin/env python

import os
import sys
import threading
import time

import grpc
from grpc_health.v1.health_pb2 import HealthCheckRequest
from grpc_health.v1.health_pb2_grpc import HealthStub

# we intentionally choose a bad endpoint, we want the gRPC call to fail
ENDPOINT = "127.0.0.1:1"

# we need at least one non-main thread running, otherwise macOS SDK doesn't consider fork() to be unsafe
thread = threading.Thread(target=lambda: time.sleep(9999), daemon=True)
thread.start()

pid = os.fork()
if pid != 0:
    os.waitpid(pid, 0)
    sys.exit(0)
else:
    # running in a child...
    print("Starting gRPC client...")

    channel = grpc.insecure_channel(ENDPOINT)
    stub = HealthStub(channel)

    # boom!
    stub.Check(HealthCheckRequest())

What did you expect to see?

The child process should exit without any issues (although the gRPC call is expected to fail, but it doesn't matter)

What did you see instead?

The child process exits due to SIGABRT signal:

Starting gRPC client...
objc[29041]: +[__NSTimeZone initialize] may have been in progress in another thread when fork() was called.
objc[29041]: +[__NSTimeZone initialize] may have been in progress in another thread when fork() was called. We cannot safely call it or ignore it in the fork() child process. Crashing instead. Set a breakpoint on objc_initializeAfterForkError to debug.

Process finished with exit code 6

Here's a full backtrace from the breakpoint mentioned in the above message:

  * frame #0: 0x000000019b9fc478 libobjc.A.dylib`objc_initializeAfterForkError
    frame #1: 0x000000019b9fc5fc libobjc.A.dylib`performForkChildInitialize(objc_class*, objc_class*) + 376
    frame #2: 0x000000019b9e37c0 libobjc.A.dylib`initializeNonMetaClass + 496
    frame #3: 0x000000019b9e3250 libobjc.A.dylib`initializeAndMaybeRelock(objc_class*, objc_object*, mutex_tt<false>&, bool) + 184
    frame #4: 0x000000019b9e2fe8 libobjc.A.dylib`lookUpImpOrForward + 1052
    frame #5: 0x000000019b9e28e4 libobjc.A.dylib`_objc_msgSend_uncached + 68
    frame #6: 0x000000019bbf644c CoreFoundation`__NSTimeZone_newWithCache + 108
    frame #7: 0x000000019bbf604c CoreFoundation`-[__NSPlaceholderTimeZone __initWithName:cache:] + 116
    frame #8: 0x000000019bbf5f2c CoreFoundation`+[NSTimeZone timeZoneWithName:] + 40
    frame #9: 0x000000019bbf5e58 CoreFoundation`+[NSTimeZone systemTimeZone] + 576
    frame #10: 0x000000019bbf5bbc CoreFoundation`+[NSTimeZone defaultTimeZone] + 80
    frame #11: 0x000000019bbf5b40 CoreFoundation`CFTimeZoneCopyDefault + 44
    frame #12: 0x000000016a8904f4 cygrpc.cpython-310-darwin.so`absl::lts_20220623::time_internal::cctz::local_time_zone() + 28
    frame #13: 0x000000016a88258c cygrpc.cpython-310-darwin.so`absl::lts_20220623::FormatTime(absl::lts_20220623::Time) + 32
    frame #14: 0x000000016a6b2e30 cygrpc.cpython-310-darwin.so`void absl::lts_20220623::functional_internal::InvokeObject<grpc_core::StatusToString(absl::lts_20220623::Status const&)::$_0, void, absl::lts_20220623::string_view, absl::lts_20220623::Cord const&>(absl::lts_20220623::functional_internal::VoidPtr, absl::lts_20220623::functional_internal::ForwardT<absl::lts_20220623::string_view>::type, absl::lts_20220623::functional_internal::ForwardT<absl::lts_20220623::Cord const&>::type) + 1432
    frame #15: 0x000000016a847e5c cygrpc.cpython-310-darwin.so`absl::lts_20220623::Status::ForEachPayload(absl::lts_20220623::FunctionRef<void (absl::lts_20220623::string_view, absl::lts_20220623::Cord const&)>) const + 296
    frame #16: 0x000000016a6b20b0 cygrpc.cpython-310-darwin.so`grpc_core::StatusToString(absl::lts_20220623::Status const&) + 364
    frame #17: 0x000000016a5ee358 cygrpc.cpython-310-darwin.so`grpc_core::Subchannel::OnConnectingFinishedLocked(absl::lts_20220623::Status) + 240
    frame #18: 0x000000016a5ed070 cygrpc.cpython-310-darwin.so`grpc_core::Subchannel::OnConnectingFinished(void*, absl::lts_20220623::Status) + 96
    frame #19: 0x000000016a6bfd88 cygrpc.cpython-310-darwin.so`grpc_core::ExecCtx::Flush() + 124
    frame #20: 0x000000016a6c04f4 cygrpc.cpython-310-darwin.so`grpc_core::Executor::RunClosures(char const*, grpc_closure_list) + 216
    frame #21: 0x000000016a6c06e4 cygrpc.cpython-310-darwin.so`grpc_core::Executor::ThreadMain(void*) + 296
    frame #22: 0x000000016a6b3658 cygrpc.cpython-310-darwin.so`grpc_core::(anonymous namespace)::ThreadInternalsPosix::ThreadInternalsPosix(char const*, void (*)(void*), void*, bool*, grpc_core::Thread::Options const&)::'lambda'(void*)::__invoke(void*) + 140
    frame #23: 0x000000019bb5c26c libsystem_pthread.dylib`_pthread_start + 148

Additional context

On macOS there is a need to "initialize" certain types before forking (for references see links below) to avoid crashing in forked-off processes. Otherwise, there's a great chance to see messages like:

objc[82289]: +[__NSTimeZone initialize] may have been in progress in another thread when fork() was called.
objc[82289]: +[__NSTimeZone initialize] may have been in progress in another thread when fork() was called. We cannot safely call it or ignore it in the fork() child process. Crashing instead. Set a breakpoint on objc_initializeAfterForkError to debug.

followed by SIGABRT which terminates the process.

This is a known but not well handled issue on macOS 10.13+, often affecting Python, Ruby (and some other scripting languages), typically in combination with 3rd party libraries which rely on threads, where the call to fork() is not followed by exec*() (which according to various sources is not so uncommon in scripting languages; some examples in Python are the "multiprocessing" package or Celery with its "prefork" worker pool).

Some sources recommend exporting OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES but this only hides the problem that still exists and can potentially result in hard-to-diagnose deadlocks.

Related issues:

References:

Possible workarounds/solutions

One possible workaround/solution is to initialise certain types before forking. Below is a snippet of code which we use in our codebase:

import platform
from ctypes import c_void_p, cdll
from ctypes.util import find_library


# based on: https://github.com/mountainstorm/MobileDevice/blob/master/CoreFoundation.py
class CoreFoundationLib:
    def __init__(self):
        CoreFoundation = cdll.LoadLibrary(find_library("CoreFoundation"))

        CFTypeRef = c_void_p

        CFRelease = CoreFoundation.CFRelease
        CFRelease.restype = None
        CFRelease.argtypes = [CFTypeRef]
        self.CFRelease = CFRelease

        CFTimeZoneCopyDefault = CoreFoundation.CFTimeZoneCopyDefault
        CFTimeZoneCopyDefault.restype = CFTypeRef
        CFTimeZoneCopyDefault.argtypes = []
        self.CFTimeZoneCopyDefault = CFTimeZoneCopyDefault


def fix_grpc_client():
    if platform.system() != "Darwin":
        return

    cf_lib = CoreFoundationLib()

    cf_time_zone = cf_lib.CFTimeZoneCopyDefault()
    cf_lib.CFRelease(cf_time_zone)

If you add a call to fix_grpc_client() in the minimal reproducible example shown at the top of this issue, before calling fork(), the issue will be gone.

However, this only works just for one type, NSTimeZone in this case, so it's not an ideal solution. Meaning you can't turn this into a generalised fix for all potential types which require "initialisation" prior forking (or at least I'm not aware of any way of doing so).

Additionally, in our codebase we simply call fix_grpc_client() early during process init, whereas we could potentially use pthread_atfork() (as suggested in https://www.wefearchange.org/2018/11/forkmacos.rst.html, although the author claims it didn't work for him) to do this right before forking and only if forking at all.

The text was updated successfully, but these errors were encountered:

gpshead · 2022-11-13T07:46:13Z

This is not a gRPC bug.

The only reliable solution is to not use os.fork() on macOS. Period.

If a process has any threads running at the time the fork system call happened, on any platform, all bets are off. The Python runtime is not async-signal-safe (an never will be - that is impossible) so you cannot execute any Python code after fork from a process that had threads. Difficult to debug random deadlocks and crashes are normal in that situation. When you see an application working fine despite that, it is running on borrowed time and has gotten lucky. Its luck will run out at some unplanned point in the future.

Your fix_grpc_client() code might work for you today, but realize that is merely a tiny band-aid that happens to paper over an unsolvable problem. The reliable advice is to move away from fork.

gnossen · 2022-11-15T22:29:39Z

@gpshead gave a great explanation. I don't think there's much we can do here.

achimnol · 2022-11-20T06:45:52Z

@gpshead @gnossen I think the correct direction is that grpcio should NOT create any implicit background threads until the user actually calls any resource-initialization API after forking.

I agree with @gpshead's explanation of general danger of forking threaded processes, but if the user controls the order of forking and threading carefully, we should be able to use grpcio without any problem.

How could we track such implicit threads in grpcio, in a holistic view?

gpshead · 2022-11-20T08:03:48Z

Examine your process and see what started the threads, from python/cpython#77906 it sounds like Apple system APIs themselves are starting background threads. Was grpc even involved at all?

While it is polite for libraries to not start threads without an explicit "okay, go" API call, this applies to all transitive dependencies of your application. If you understand and manage all of the API calls made by your entire process up until you fork to make sure that nothing that spawns threads is called you can probably do it safely. This is not easy to do and becomes less easy as time goes on as things change out from underneath you.

stanhu · 2023-06-10T06:34:27Z

#33400 might help here.

romanek-adam-b2c2 added kind/bug lang/Python priority/P2 labels Oct 5, 2022

romanek-adam-b2c2 assigned gnossen Oct 5, 2022

gnossen closed this as completed Nov 15, 2022

belm0 mentioned this issue Feb 18, 2023

CI: macos target hangs during pytest python-trio/purerpc#39

Open

stanhu mentioned this issue May 31, 2023

ruby: grpc v1.48.0+ crashes on macOS in pre-forking app servers #33281

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gRPC crash in a forked process (python) #31240

gRPC crash in a forked process (python) #31240

romanek-adam-b2c2 commented Oct 5, 2022 •

edited

gpshead commented Nov 13, 2022

gnossen commented Nov 15, 2022

achimnol commented Nov 20, 2022 •

edited

gpshead commented Nov 20, 2022

stanhu commented Jun 10, 2023

gRPC crash in a forked process (python) #31240

gRPC crash in a forked process (python) #31240

Comments

romanek-adam-b2c2 commented Oct 5, 2022 • edited

What version of gRPC and what language are you using?

What operating system (Linux, Windows,...) and version?

What runtime / compiler are you using (e.g. python version or version of gcc)

What did you do?

What did you expect to see?

What did you see instead?

Additional context

Possible workarounds/solutions

gpshead commented Nov 13, 2022

gnossen commented Nov 15, 2022

achimnol commented Nov 20, 2022 • edited

gpshead commented Nov 20, 2022

stanhu commented Jun 10, 2023

romanek-adam-b2c2 commented Oct 5, 2022 •

edited

achimnol commented Nov 20, 2022 •

edited