Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gRPC crash in a forked process (python) #31240

Closed
romanek-adam-b2c2 opened this issue Oct 5, 2022 · 5 comments
Closed

gRPC crash in a forked process (python) #31240

romanek-adam-b2c2 opened this issue Oct 5, 2022 · 5 comments

Comments

@romanek-adam-b2c2
Copy link

romanek-adam-b2c2 commented Oct 5, 2022

What version of gRPC and what language are you using?

gRPC v1.48.0
Python

What operating system (Linux, Windows,...) and version?

macOS 12.6

What runtime / compiler are you using (e.g. python version or version of gcc)

Python 3.10.5

What did you do?

Use gRPC client in a forked process, like in a Celery worker when Celery runs in "prefork" pool mode. A gRPC call must fail for the issue to occur.

Here's the minimal reproducible example:

#!/usr/bin/env python

import os
import sys
import threading
import time

import grpc
from grpc_health.v1.health_pb2 import HealthCheckRequest
from grpc_health.v1.health_pb2_grpc import HealthStub

# we intentionally choose a bad endpoint, we want the gRPC call to fail
ENDPOINT = "127.0.0.1:1"

# we need at least one non-main thread running, otherwise macOS SDK doesn't consider fork() to be unsafe
thread = threading.Thread(target=lambda: time.sleep(9999), daemon=True)
thread.start()

pid = os.fork()
if pid != 0:
    os.waitpid(pid, 0)
    sys.exit(0)
else:
    # running in a child...
    print("Starting gRPC client...")

    channel = grpc.insecure_channel(ENDPOINT)
    stub = HealthStub(channel)

    # boom!
    stub.Check(HealthCheckRequest())

What did you expect to see?

The child process should exit without any issues (although the gRPC call is expected to fail, but it doesn't matter)

What did you see instead?

The child process exits due to SIGABRT signal:

Starting gRPC client...
objc[29041]: +[__NSTimeZone initialize] may have been in progress in another thread when fork() was called.
objc[29041]: +[__NSTimeZone initialize] may have been in progress in another thread when fork() was called. We cannot safely call it or ignore it in the fork() child process. Crashing instead. Set a breakpoint on objc_initializeAfterForkError to debug.

Process finished with exit code 6

Here's a full backtrace from the breakpoint mentioned in the above message:

  * frame #0: 0x000000019b9fc478 libobjc.A.dylib`objc_initializeAfterForkError
    frame #1: 0x000000019b9fc5fc libobjc.A.dylib`performForkChildInitialize(objc_class*, objc_class*) + 376
    frame #2: 0x000000019b9e37c0 libobjc.A.dylib`initializeNonMetaClass + 496
    frame #3: 0x000000019b9e3250 libobjc.A.dylib`initializeAndMaybeRelock(objc_class*, objc_object*, mutex_tt<false>&, bool) + 184
    frame #4: 0x000000019b9e2fe8 libobjc.A.dylib`lookUpImpOrForward + 1052
    frame #5: 0x000000019b9e28e4 libobjc.A.dylib`_objc_msgSend_uncached + 68
    frame #6: 0x000000019bbf644c CoreFoundation`__NSTimeZone_newWithCache + 108
    frame #7: 0x000000019bbf604c CoreFoundation`-[__NSPlaceholderTimeZone __initWithName:cache:] + 116
    frame #8: 0x000000019bbf5f2c CoreFoundation`+[NSTimeZone timeZoneWithName:] + 40
    frame #9: 0x000000019bbf5e58 CoreFoundation`+[NSTimeZone systemTimeZone] + 576
    frame #10: 0x000000019bbf5bbc CoreFoundation`+[NSTimeZone defaultTimeZone] + 80
    frame #11: 0x000000019bbf5b40 CoreFoundation`CFTimeZoneCopyDefault + 44
    frame #12: 0x000000016a8904f4 cygrpc.cpython-310-darwin.so`absl::lts_20220623::time_internal::cctz::local_time_zone() + 28
    frame #13: 0x000000016a88258c cygrpc.cpython-310-darwin.so`absl::lts_20220623::FormatTime(absl::lts_20220623::Time) + 32
    frame #14: 0x000000016a6b2e30 cygrpc.cpython-310-darwin.so`void absl::lts_20220623::functional_internal::InvokeObject<grpc_core::StatusToString(absl::lts_20220623::Status const&)::$_0, void, absl::lts_20220623::string_view, absl::lts_20220623::Cord const&>(absl::lts_20220623::functional_internal::VoidPtr, absl::lts_20220623::functional_internal::ForwardT<absl::lts_20220623::string_view>::type, absl::lts_20220623::functional_internal::ForwardT<absl::lts_20220623::Cord const&>::type) + 1432
    frame #15: 0x000000016a847e5c cygrpc.cpython-310-darwin.so`absl::lts_20220623::Status::ForEachPayload(absl::lts_20220623::FunctionRef<void (absl::lts_20220623::string_view, absl::lts_20220623::Cord const&)>) const + 296
    frame #16: 0x000000016a6b20b0 cygrpc.cpython-310-darwin.so`grpc_core::StatusToString(absl::lts_20220623::Status const&) + 364
    frame #17: 0x000000016a5ee358 cygrpc.cpython-310-darwin.so`grpc_core::Subchannel::OnConnectingFinishedLocked(absl::lts_20220623::Status) + 240
    frame #18: 0x000000016a5ed070 cygrpc.cpython-310-darwin.so`grpc_core::Subchannel::OnConnectingFinished(void*, absl::lts_20220623::Status) + 96
    frame #19: 0x000000016a6bfd88 cygrpc.cpython-310-darwin.so`grpc_core::ExecCtx::Flush() + 124
    frame #20: 0x000000016a6c04f4 cygrpc.cpython-310-darwin.so`grpc_core::Executor::RunClosures(char const*, grpc_closure_list) + 216
    frame #21: 0x000000016a6c06e4 cygrpc.cpython-310-darwin.so`grpc_core::Executor::ThreadMain(void*) + 296
    frame #22: 0x000000016a6b3658 cygrpc.cpython-310-darwin.so`grpc_core::(anonymous namespace)::ThreadInternalsPosix::ThreadInternalsPosix(char const*, void (*)(void*), void*, bool*, grpc_core::Thread::Options const&)::'lambda'(void*)::__invoke(void*) + 140
    frame #23: 0x000000019bb5c26c libsystem_pthread.dylib`_pthread_start + 148

Additional context

On macOS there is a need to "initialize" certain types before forking (for references see links below) to avoid crashing in forked-off processes. Otherwise, there's a great chance to see messages like:

objc[82289]: +[__NSTimeZone initialize] may have been in progress in another thread when fork() was called.
objc[82289]: +[__NSTimeZone initialize] may have been in progress in another thread when fork() was called. We cannot safely call it or ignore it in the fork() child process. Crashing instead. Set a breakpoint on objc_initializeAfterForkError to debug.

followed by SIGABRT which terminates the process.

This is a known but not well handled issue on macOS 10.13+, often affecting Python, Ruby (and some other scripting languages), typically in combination with 3rd party libraries which rely on threads, where the call to fork() is not followed by exec*() (which according to various sources is not so uncommon in scripting languages; some examples in Python are the "multiprocessing" package or Celery with its "prefork" worker pool).

Some sources recommend exporting OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES but this only hides the problem that still exists and can potentially result in hard-to-diagnose deadlocks.

Related issues:

References:

Possible workarounds/solutions

One possible workaround/solution is to initialise certain types before forking. Below is a snippet of code which we use in our codebase:

import platform
from ctypes import c_void_p, cdll
from ctypes.util import find_library


# based on: https://github.com/mountainstorm/MobileDevice/blob/master/CoreFoundation.py
class CoreFoundationLib:
    def __init__(self):
        CoreFoundation = cdll.LoadLibrary(find_library("CoreFoundation"))

        CFTypeRef = c_void_p

        CFRelease = CoreFoundation.CFRelease
        CFRelease.restype = None
        CFRelease.argtypes = [CFTypeRef]
        self.CFRelease = CFRelease

        CFTimeZoneCopyDefault = CoreFoundation.CFTimeZoneCopyDefault
        CFTimeZoneCopyDefault.restype = CFTypeRef
        CFTimeZoneCopyDefault.argtypes = []
        self.CFTimeZoneCopyDefault = CFTimeZoneCopyDefault


def fix_grpc_client():
    if platform.system() != "Darwin":
        return

    cf_lib = CoreFoundationLib()

    cf_time_zone = cf_lib.CFTimeZoneCopyDefault()
    cf_lib.CFRelease(cf_time_zone)

If you add a call to fix_grpc_client() in the minimal reproducible example shown at the top of this issue, before calling fork(), the issue will be gone.

However, this only works just for one type, NSTimeZone in this case, so it's not an ideal solution. Meaning you can't turn this into a generalised fix for all potential types which require "initialisation" prior forking (or at least I'm not aware of any way of doing so).

Additionally, in our codebase we simply call fix_grpc_client() early during process init, whereas we could potentially use pthread_atfork() (as suggested in https://www.wefearchange.org/2018/11/forkmacos.rst.html, although the author claims it didn't work for him) to do this right before forking and only if forking at all.

@gpshead
Copy link

gpshead commented Nov 13, 2022

This is not a gRPC bug.

The only reliable solution is to not use os.fork() on macOS. Period.

If a process has any threads running at the time the fork system call happened, on any platform, all bets are off. The Python runtime is not async-signal-safe (an never will be - that is impossible) so you cannot execute any Python code after fork from a process that had threads. Difficult to debug random deadlocks and crashes are normal in that situation. When you see an application working fine despite that, it is running on borrowed time and has gotten lucky. Its luck will run out at some unplanned point in the future.

Your fix_grpc_client() code might work for you today, but realize that is merely a tiny band-aid that happens to paper over an unsolvable problem. The reliable advice is to move away from fork.

@gnossen
Copy link
Contributor

gnossen commented Nov 15, 2022

@gpshead gave a great explanation. I don't think there's much we can do here.

@gnossen gnossen closed this as completed Nov 15, 2022
@achimnol
Copy link

achimnol commented Nov 20, 2022

@gpshead @gnossen I think the correct direction is that grpcio should NOT create any implicit background threads until the user actually calls any resource-initialization API after forking.

I agree with @gpshead's explanation of general danger of forking threaded processes, but if the user controls the order of forking and threading carefully, we should be able to use grpcio without any problem.

How could we track such implicit threads in grpcio, in a holistic view?

@gpshead
Copy link

gpshead commented Nov 20, 2022

Examine your process and see what started the threads, from python/cpython#77906 it sounds like Apple system APIs themselves are starting background threads. Was grpc even involved at all?

While it is polite for libraries to not start threads without an explicit "okay, go" API call, this applies to all transitive dependencies of your application. If you understand and manage all of the API calls made by your entire process up until you fork to make sure that nothing that spawns threads is called you can probably do it safely. This is not easy to do and becomes less easy as time goes on as things change out from underneath you.

@stanhu
Copy link
Contributor

stanhu commented Jun 10, 2023

#33400 might help here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants