Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ruby: grpc v1.48.0+ crashes on macOS in pre-forking app servers #33281

Open
stanhu opened this issue May 31, 2023 · 5 comments
Open

ruby: grpc v1.48.0+ crashes on macOS in pre-forking app servers #33281

stanhu opened this issue May 31, 2023 · 5 comments

Comments

@stanhu
Copy link
Contributor

stanhu commented May 31, 2023

What version of gRPC and what language are you using?

Tested in v1.48.0 - v1.55.0 with Ruby 3.1.4

What operating system (Linux, Windows,...) and version?

macOS Ventura 13.4

What runtime / compiler are you using (e.g. python version or version of gcc)

Ruby v3.1.4

What did you do?

After upgrading to grpc v1.48.0, the Puma app server crashes after issuing a call:

objc[60607]: +[__NSTimeZone initialize] may have been in progress in another thread when fork() was called.
objc[60607]: +[__NSTimeZone initialize] may have been in progress in another thread when fork() was called. We cannot safely call it or ignore it in the fork() child process. Crashing instead. Set a breakpoint on objc_initializeAfterForkError to debug.

This is related to #31240 for Python. The statement to avoid os.fork() in macOS is not really practical. We have been running preforking app servers like Puma and Unicorn for years in production on Linux servers, and making our development environment completely different from production does not make sense.

Other related links:

  1. https://blog.phusion.nl/2017/10/13/why-ruby-app-servers-break-on-macos-high-sierra-and-what-can-be-done-about-it/
  2. Interoperability issues with dependencies using Objective-C in Cluster Mode on macOS 10.13 High Sierra puma/puma#1421
  3. https://bugs.ruby-lang.org/issues/14009
  4. grpc crash in a forked process (ruby) #26257

What did you expect to see?

No crash.

What did you see instead?

If I compile the v1.48.0 gem withGRPC_CONFIG=dbg and attach lldb to the worker process, I see this backtrace:

(lldb) bt
* thread #33, name = 'puma srv tp 003', stop reason = signal SIGABRT
  * frame #0: 0x0000000187124d4c libsystem_kernel.dylib`__abort_with_payload + 8
    frame #1: 0x0000000187149e50 libsystem_kernel.dylib`abort_with_payload_wrapper_internal + 104
    frame #2: 0x0000000187149de8 libsystem_kernel.dylib`abort_with_reason + 32
    frame #3: 0x0000000186de9504 libobjc.A.dylib`_objc_fatalv(unsigned long long, unsigned long long, char const*, char*) + 128
    frame #4: 0x0000000186de9484 libobjc.A.dylib`_objc_fatal(char const*, ...) + 44
    frame #5: 0x0000000186dbed54 libobjc.A.dylib`initializeNonMetaClass + 1088
    frame #6: 0x0000000186dd90e8 libobjc.A.dylib`initializeAndMaybeRelock(objc_class*, objc_object*, locker_mixin >&, bool) + 156
    frame #7: 0x0000000186dbe5c4 libobjc.A.dylib`lookUpImpOrForward + 884
    frame #8: 0x0000000186dbdf64 libobjc.A.dylib`_objc_msgSend_uncached + 68
    frame #9: 0x0000000187206768 CoreFoundation`__NSTimeZone_newWithCache + 100
    frame #10: 0x0000000187206384 CoreFoundation`-[__NSPlaceholderTimeZone __initWithName:cache:] + 116
    frame #11: 0x000000018720626c CoreFoundation`+[NSTimeZone timeZoneWithName:] + 32
    frame #12: 0x00000001872061a8 CoreFoundation`+[NSTimeZone systemTimeZone] + 528
    frame #13: 0x0000000187205f38 CoreFoundation`+[NSTimeZone defaultTimeZone] + 76
    frame #14: 0x0000000187205ec0 CoreFoundation`CFTimeZoneCopyDefault + 36
    frame #15: 0x000000015e668504 grpc_c.bundle`absl::lts_20220623::time_internal::cctz::local_time_zone() at time_zone_lookup.cc:139:30
    frame #16: 0x000000015e649884 grpc_c.bundle`absl::lts_20220623::LocalTimeZone() at time.h:1110:19
    frame #17: 0x000000015e649848 grpc_c.bundle`absl::lts_20220623::FormatTime(t=Time @ 0x00000001043ed868) at format.cc:88:44
    frame #18: 0x000000015e423380 grpc_c.bundle`grpc_core::StatusToString(this=0x00000001043ee5b8, type_url=(ptr_ = "created_time", length_ = 12), payload=0x00000002976eac08)::$_0::operator()(absl::lts_20220623::string_view, absl::lts_20220623::Cord const&) const at status_helper.cc:321:45
    frame #19: 0x000000015e422f2c grpc_c.bundle`decltype(f=0x00000001043ee5b8, args=0x00000001043ee2e8, args=0x00000002976eac08)::$_0 const&>()(std::declval(), std::declval())) absl::lts_20220623::base_internal::Callable::Invoke(grpc_core::StatusToString(absl::lts_20220623::Status const&)::$_0 const&, absl::lts_20220623::string_view&&, absl::lts_20220623::Cord const&) at invoke.h:185:12
    frame #20: 0x000000015e422ee8 grpc_c.bundle`decltype(f=0x00000001043ee5b8, args=0x00000001043ee2e8, args=0x00000002976eac08)::$_0 const&, absl::lts_20220623::string_view, absl::lts_20220623::Cord const&>::type::Invoke(std::declval(), std::declval(), std::declval())) absl::lts_20220623::base_internal::invoke(grpc_core::StatusToString(absl::lts_20220623::Status const&)::$_0 const&, absl::lts_20220623::string_view&&, absl::lts_20220623::Cord const&) at invoke.h:212:10
    frame #21: 0x000000015e422ea4 grpc_c.bundle`void absl::lts_20220623::functional_internal::InvokeObject(ptr=VoidPtr @ 0x00000001043ee2f8, args=(ptr_ = "type.googleapis.com/grpc.status.time.created_time", length_ = 49), args=0x00000002976eac08) at function_ref.h:74:7
    frame #22: 0x000000015e5d56c8 grpc_c.bundle`absl::lts_20220623::FunctionRef::operator(this=0x00000001043ee3e0, args=(ptr_ = "type.googleapis.com/grpc.status.time.created_time", length_ = 49), args=0x00000002976eac08)(absl::lts_20220623::string_view, absl::lts_20220623::Cord const&) const at function_ref.h:132:12
    frame #23: 0x000000015e5d55fc grpc_c.bundle`absl::lts_20220623::Status::ForEachPayload(this=0x00000001043ee980, visitor=FunctionRef @ 0x00000001043ee3e0)>) const at status.cc:184:7
    frame #24: 0x000000015e421348 grpc_c.bundle`grpc_core::StatusToString(status=0x00000001043ee980) at status_helper.cc:293:10
    frame #25: 0x000000015e436880 grpc_c.bundle`grpc_error_std_string(error=) at error.cc:68:10
    frame #26: 0x000000015e5777a0 grpc_c.bundle`grpc_error_get_status(error=(rep_ = 6417858225), deadline=(millis_ = 53301), code=0x00000002982e0a24, message="", http_error=0x0000000000000000, error_string=0x00000002982e0a48) at error_utils.cc:123:32
    frame #27: 0x000000015e51fbc4 grpc_c.bundle`grpc_core::FilterStackCall::SetFinalStatus(this=0x000000015a5c6270, error=(rep_ = 6417858225)) at call.cc:804:5
    frame #28: 0x000000015e520c7c grpc_c.bundle`grpc_core::FilterStackCall::RecvTrailingFilter(this=0x000000015a5c6270, b=0x000000015a5c6a38, batch_error=(rep_ = 0)) at call.cc:978:7
    frame #29: 0x000000015e522524 grpc_c.bundle`grpc_core::FilterStackCall::BatchControl::ReceivingTrailingMetadataReady(this=0x000000015a5ca140, error=(rep_ = 0)) at call.cc:1285:10
    frame #30: 0x000000015e52c368 grpc_c.bundle`grpc_core::FilterStackCall::StartBatch(this=0x000000015a5ca140, bctl=0x000000015a5ca140, error=)::$_8::operator()(void*, absl::lts_20220623::Status) const at call.cc:1644:9
    frame #31: 0x000000015e52c320 grpc_c.bundle`grpc_core::FilterStackCall::StartBatch(bctl=0x000000015a5ca140, error=)::$_8::__invoke(void*, absl::lts_20220623::Status) at call.cc:1644:9
    frame #32: 0x000000015e5218cc grpc_c.bundle`grpc_core::Closure::Run(location=0x00000001043eef90, closure=0x000000015a5c7018, error=(rep_ = 0)) at closure.h:265:5
    frame #33: 0x000000015e2a6188 grpc_c.bundle`recv_trailing_metadata_ready(arg=0x000000015a5c7180, error=(rep_ = 0)) at deadline_filter.cc:145:3
    frame #34: 0x000000015e5218cc grpc_c.bundle`grpc_core::Closure::Run(location=0x00000001043ef0b8, closure=0x000000015a5c71a0, error=(rep_ = 0)) at closure.h:265:5
    frame #35: 0x000000015e0bd9f4 grpc_c.bundle`grpc_core::ClientChannel::CallData::RecvTrailingMetadataReadyForConfigSelectorCommitCallback(arg=0x000000015a5c7160, error=(rep_ = 0)) at client_channel.cc:2312:3
    frame #36: 0x000000015e43d958 grpc_c.bundle`exec_ctx_run(closure=0x000000015a5c72b8) at exec_ctx.cc:44:3
    frame #37: 0x000000015e43d7c0 grpc_c.bundle`grpc_core::ExecCtx::Flush(this=0x00000001043ef280) at exec_ctx.cc:79:9
    frame #38: 0x000000015e4967bc grpc_c.bundle`grpc_core::ExecCtx::~ExecCtx(this=0x00000001043ef280) at exec_ctx.h:117:5
    frame #39: 0x000000015e495200 grpc_c.bundle`grpc_core::ExecCtx::~ExecCtx(this=0x00000001043ef280) at exec_ctx.h:115:22
    frame #40: 0x000000015e524170 grpc_c.bundle`::grpc_call_start_batch(call=0x000000015a5c6270, ops=0x00000002982e0730, nops=1, tag=0x00000001043ef3b8, reserved=0x0000000000000000) at call.cc:1844:3
    frame #41: 0x000000015e097a44 grpc_c.bundle`grpc_rb_call_run_batch + 296
    frame #42: 0x0000000100f224e4 libruby.3.1.dylib`vm_call_cfunc_with_frame + 232
    frame #43: 0x0000000100f24c08 libruby.3.1.dylib`vm_sendish + 1336
    frame #44: 0x0000000100f0753c libruby.3.1.dylib`vm_exec_core + 8128
    frame #45: 0x0000000100f19d2c libruby.3.1.dylib`rb_vm_exec + 2212
    frame #46: 0x0000000100f2811c libruby.3.1.dylib`rb_call0 + 772
    frame #47: 0x0000000100f14314 libruby.3.1.dylib`rb_iterate0 + 628
    frame #48: 0x0000000100f13f84 libruby.3.1.dylib`rb_block_call_kw + 116
    frame #49: 0x0000000100f11560 libruby.3.1.dylib`vm_call0_body + 988
    frame #50: 0x0000000100f2811c libruby.3.1.dylib`rb_call0 + 772
    frame #51: 0x0000000100f14314 libruby.3.1.dylib`rb_iterate0 + 628
    frame #52: 0x0000000100f13ee8 libruby.3.1.dylib`rb_block_call + 116
    frame #53: 0x0000000100d767e4 libruby.3.1.dylib`next_i + 112
    frame #54: 0x0000000100f23560 libruby.3.1.dylib`vm_yield_with_cfunc + 208
    frame #55: 0x0000000100d581b4 libruby.3.1.dylib`rb_fiber_start + 620
    frame #56: 0x0000000100d5a1f4 libruby.3.1.dylib`fiber_entry + 36

Workaround

The main problem appears to be that abseil-cpp lazily calls CFTimeZoneCopyDefault in https://github.com/abseil/abseil-cpp/blob/78be63686ba732b25052be15f8d6dee891c05749/absl/time/internal/cctz/src/time_zone_lookup.cc#L139. If CFTimeZoneCopyDefault is first called inside a forked child process, macOS kills the process.

I worked around this problem by calling CFTimeZoneCopyDefault in a pre-fork block:

# frozen_string_literal: true

if RUBY_PLATFORM.include?('darwin')
  require 'fiddle'
  require 'ffi'

  module CFTimeZone
    extend FFI::Library

    ffi_lib '/System/Library/Frameworks/CoreFoundation.framework/CoreFoundation'
    attach_function :CFTimeZoneCopyDefault, [], :pointer
    attach_function :CFRelease, [:pointer], :void
  end

  # Dynamically load Foundation.framework, ~implicitly~ initializing
  # the Objective-C runtime before any forking happens in webserver
  #
  # From https://bugs.ruby-lang.org/issues/14009
  Fiddle.dlopen '/System/Library/Frameworks/Foundation.framework/Foundation'

  # grpc uses abseil-cpp to retrieve the local time zone via macOS APIs:
  # https://github.com/abseil/abseil-cpp/blob/20230125.rc3/absl/time/internal/cctz/src/time_zone_lookup.cc#L139
  #
  # To ensure these APIs are not called in a forked process (https://github.com/grpc/grpc/issues/26257),
  # load the required framework, retrieve the default time zone, and then release the resource.
  default_time_zone = CFTimeZone.CFTimeZoneCopyDefault
  CFTimeZone.CFRelease(default_time_zone)
end

Note that the CoreFoundation library loading could likely be dropped now that https://bugs.ruby-lang.org/issues/14009 is closed.

Proposal

I suspect if absl::InitializeLog() were called in a pre-fork block, we'd be able to avoid this. I'd propose we add some initialize function for Ruby and Python that can be called to ensure any OS-dependent setup happens before gRPC operations begin.

stanhu added a commit to stanhu/grpc that referenced this issue Jun 10, 2023
This prevents a crash on macOS that occurs if Objective C calls are
made after a process is forked. This ensures that the Apple system
call `CFTimeZoneCopyDefault` is invoked first in the parent process,
which prevents macOS from killing the child process with the message:

```
+[__NSTimeZone initialize] may have been in progress in another
thread when fork() was called. We cannot safely call it or ignore it
in the fork() child process. Crashing instead. Set a breakpoint on
objc_initializeAfterForkError to debug.
```

Closes grpc#33281
stanhu added a commit to stanhu/grpc that referenced this issue Jun 10, 2023
This prevents a crash on macOS that occurs if Objective C calls are
made after a process is forked. This ensures that the Apple system
call `CFTimeZoneCopyDefault` is invoked first in the parent process,
which prevents macOS from killing the child process with the message:

```
+[__NSTimeZone initialize] may have been in progress in another
thread when fork() was called. We cannot safely call it or ignore it
in the fork() child process. Crashing instead. Set a breakpoint on
objc_initializeAfterForkError to debug.
```

Closes grpc#33281
stanhu added a commit to stanhu/grpc that referenced this issue Jun 10, 2023
This prevents a crash on macOS that occurs if Objective C calls are
made after a process is forked. This ensures that the Apple system
call `CFTimeZoneCopyDefault` is invoked first in the parent process,
which prevents macOS from killing the child process with the message:

```
+[__NSTimeZone initialize] may have been in progress in another
thread when fork() was called. We cannot safely call it or ignore it
in the fork() child process. Crashing instead. Set a breakpoint on
objc_initializeAfterForkError to debug.
```

Closes grpc#33281
stanhu added a commit to stanhu/grpc that referenced this issue Jun 10, 2023
This prevents a crash on macOS that occurs if Objective C calls are
made after a process is forked. This ensures that the Apple system
call `CFTimeZoneCopyDefault` is invoked first in the parent process,
which prevents macOS from killing the child process with the message:

```
+[__NSTimeZone initialize] may have been in progress in another
thread when fork() was called. We cannot safely call it or ignore it
in the fork() child process. Crashing instead. Set a breakpoint on
objc_initializeAfterForkError to debug.
```

Closes grpc#33281
stanhu added a commit to stanhu/grpc that referenced this issue Jun 12, 2023
This prevents a crash on macOS that occurs if Objective C calls are
made after a process is forked. This ensures that the Apple system
call `CFTimeZoneCopyDefault` is invoked first in the parent process,
which prevents macOS from killing the child process with the message:

```
+[__NSTimeZone initialize] may have been in progress in another
thread when fork() was called. We cannot safely call it or ignore it
in the fork() child process. Crashing instead. Set a breakpoint on
objc_initializeAfterForkError to debug.
```

Closes grpc#33281
stanhu added a commit to stanhu/grpc that referenced this issue Jun 12, 2023
This prevents a crash on macOS that occurs if Objective C calls are
made after a process is forked. This ensures that the Apple system
call `CFTimeZoneCopyDefault` is invoked first in the parent process,
which prevents macOS from killing the child process with the message:

```
+[__NSTimeZone initialize] may have been in progress in another
thread when fork() was called. We cannot safely call it or ignore it
in the fork() child process. Crashing instead. Set a breakpoint on
objc_initializeAfterForkError to debug.
```

Closes grpc#33281
stanhu added a commit to stanhu/grpc that referenced this issue Jun 12, 2023
This prevents a crash on macOS that occurs if Objective C calls are
made after a process is forked. This ensures that the Apple system
call `CFTimeZoneCopyDefault` is invoked first in the parent process,
which prevents macOS from killing the child process with the message:

```
+[__NSTimeZone initialize] may have been in progress in another
thread when fork() was called. We cannot safely call it or ignore it
in the fork() child process. Crashing instead. Set a breakpoint on
objc_initializeAfterForkError to debug.
```

Closes grpc#33281
stanhu added a commit to stanhu/grpc that referenced this issue Jun 12, 2023
This prevents a crash on macOS that occurs if Objective C calls are
made after a process is forked. This ensures that the Apple system
call `CFTimeZoneCopyDefault` is invoked first in the parent process,
which prevents macOS from killing the child process with the message:

```
+[__NSTimeZone initialize] may have been in progress in another
thread when fork() was called. We cannot safely call it or ignore it
in the fork() child process. Crashing instead. Set a breakpoint on
objc_initializeAfterForkError to debug.
```

Closes grpc#33281
stanhu added a commit to stanhu/grpc that referenced this issue Jun 12, 2023
This prevents a crash on macOS that occurs if Objective C calls are
made after a process is forked. This ensures that the Apple system
call `CFTimeZoneCopyDefault` is invoked first in the parent process,
which prevents macOS from killing the child process with the message:

```
+[__NSTimeZone initialize] may have been in progress in another
thread when fork() was called. We cannot safely call it or ignore it
in the fork() child process. Crashing instead. Set a breakpoint on
objc_initializeAfterForkError to debug.
```

Closes grpc#33281
@garybethea
Copy link

Hi, are there any plans to make a fix here? I can reproduce with the following versions:

ruby: 3.1.4
grpc: v1.48.0+ (tested up to current dev build)
OS: Ventura 13.5.2

This happens any time a grpc server returns an error response, which makes grpc completely unusable in development. The workaround shared above doesn't work for me, as running default_time_zone = CFTimeZone.CFTimeZoneCopyDefault actually crashes the process in the same way.

All I could do for now was pin to v1.47, which works but doesn't seem sustainable. Thanks!

@stanhu
Copy link
Contributor Author

stanhu commented Sep 26, 2023

The workaround shared above doesn't work for me, as running default_time_zone = CFTimeZone.CFTimeZoneCopyDefault actually crashes the process in the same way.

Are you sure you are calling this before the process forks?

I tried to fix this in #33400, but calling absl:InitializeLog() may be problematic. Calling absl::LocalTimeZone() might be an alternative solution. Generally I've had difficulty getting timely feedback on pull requests, so I've become less inclined to submit changes here.

@garybethea
Copy link

Are you sure you are calling this before the process forks?

Pretty sure! I'm actually opening a Rails console and running it immediately.

@stanhu
Copy link
Contributor Author

stanhu commented Sep 26, 2023

Can you show the exact error messages you're seeing? You might also want to attach the program to llvm to see where the seg fault is. It's possible there's a separate problem, or that you aren't calling this workaround soon enough.

@dentarg
Copy link

dentarg commented Jan 15, 2024

Did #33430 help this in any way?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants