gRPC poll segfault in subprocess calls #13998

Sakawa · 2018-01-12T01:51:05Z

Please answer these questions before submitting your issue.

Should this be an issue in the gRPC issue tracker?

Yes

What version of gRPC and what language are you using?

1.8.2, Python

What operating system (Linux, Windows, …) and version?

Linux

What runtime / compiler are you using (e.g. python version or version of gcc)

Python 2.7

What did you do?

We started seeing segfaults in our Dropbox code after upgrading from 1.6.0 to 1.8.2, when subprocess32 calls began intermittently segfaulting. A backtrace extracted from the core dump of a segfault is below:

#0 pollset_kick (exec_ctx=0x7f7c81ffae40, pollset=0x7f7d38001460, specific_worker=0x0) at src/core/lib/iomgr/ev_epoll1_linux.cc:1037
#1 0x00007f7d85c521f3 in cq_end_op_for_next (exec_ctx=0x7f7c81ffae40, cq=0x7f7d38001390, tag=<optimized out>, error=0x7f7d6c00d9d0,
done=0x7f7d85c95c10 <finished_completion(grpc_exec_ctx*, void*, grpc_cq_completion*)>, done_arg=<optimized out>, storage=0x7f7d3800f280)
at src/core/lib/surface/completion_queue.cc:692
#2 0x00007f7d85c95de1 in partly_done (error=0x0, due_to_completion=true, w=0x7f7d3800f1a0, exec_ctx=0x7f7c81ffae40)
at src/core/ext/filters/client_channel/channel_connectivity.cc:153
#3 watch_complete (exec_ctx=0x7f7c81ffae40, pw=0x7f7d3800f1a0, error=<optimized out>) at src/core/ext/filters/client_channel/channel_connectivity.cc:166
#4 0x00007f7d85c300d6 in exec_ctx_run (exec_ctx=<optimized out>, closure=<optimized out>, error=0x4) at src/core/lib/iomgr/exec_ctx.cc:70
#5 0x00007f7d85c1d011 in grpc_combiner_continue_exec_ctx (exec_ctx=0x7f7c81ffae40) at src/core/lib/iomgr/combiner.cc:260
#6 0x00007f7d85c302da in grpc_exec_ctx_flush (exec_ctx=0x7f7c81ffae40) at src/core/lib/iomgr/exec_ctx.cc:93
#7 0x00007f7d85c3d583 in run_some_timers (exec_ctx=0x7f7c81ffae40) at src/core/lib/iomgr/timer_manager.cc:129
#8 timer_main_loop (exec_ctx=0x7f7c81ffae40) at src/core/lib/iomgr/timer_manager.cc:231
#9 timer_thread (completed_thread_ptr=completed_thread_ptr@entry=0x7f7d600086d0) at src/core/lib/iomgr/timer_manager.cc:279
#10 0x00007f7d85c0f988 in thread_body (v=<optimized out>) at src/core/lib/support/thd_posix.cc:53
#11 0x00007f7d8fbb9e9a in start_thread () from .../x86_64-linux-gnu/libpthread.so.0
#12 0x00007f7d8f8e136d in clone () from .../x86_64-linux-gnu/libc.so.6
#13 0x0000000000000000 in ?? ()

We also occasionally see

E0105 19:43:12.164784957   50909 completion_queue.cc:697]    Kick failed: {"created":"@1515181392.164703224","description":"OS Error","errno":9,"file":"src/core/lib/iomgr/wakeup_fd_eventfd.cc","file_line":61,"os_error":"Bad file descriptor","syscall":"eventfd_write"}

in our logs, although this happens fairly rarely.

What did you expect to see?

No segfault -- expected subprocess command to run normally

What did you see instead?

Segfault

Make sure you include information that can help us debug (full error message, exception listing, stack trace, logs).

Anything else we should know about your project / environment?

The text was updated successfully, but these errors were encountered:

alexmv · 2018-01-16T20:44:53Z

This may be the same as #13327.

sreecha · 2018-01-16T22:53:11Z

@alexmv , #13327 has a different callstack and looks different from this issue.

@Sakawa,
Could you tell a bit more details on what your code is doing..

what subprocess32 call were you making
By any chance, is your code calling fork() at some point ? (we have seen very strange failures in the past with fork() since we do not support it)
are you sure the callstack you posted is accurate? ...the reason I ask is, as per the callstack, the failing line is: https://github.com/grpc/grpc/blob/v1.8.2/src/core/lib/iomgr/ev_epoll1_linux.cc#L1037

i.e

      grpc_pollset_worker* root_worker = pollset->root_worker;
      if (root_worker == nullptr) {
        ...
        goto done;
      }
      grpc_pollset_worker* next_worker = root_worker->next;
1037:      if (root_worker->state == KICKED) {  // <<<< THIS IS THE LINE FAILING AS PER CALL STACK
  ..

So it looks like root_worker is null at this point (thats the only way I can see that line segfaulting) but that is not possible because we handle that case just a few lines above. So I am not sure if this is the right callstack.

Sakawa · 2018-01-17T00:32:55Z

what subprocess32 call were you making?

The code calls subprocess32.Popen and Popen.communicate to run a git command. There's nothing terribly unique about this call -- it reproduces with the standard git binary.

By any chance, is your code calling fork() at some point?

Unfortunately not, we've seen this behavior on a single-process single-thread instance and our code doesn't fork anywhere explicitly.

are you sure the callstack you posted is accurate?

I just went and generated another core dump with grpcio 1.8.4 and got essentially the same stacktrace (line 1037 of ev_epoll1_linux.cc still the culprit). I don't know the code well enough to be able to explain the segfault. Here is a bt full from the core dump, which shows that root_worker isn't null:

#0  pollset_kick (exec_ctx=0x7f46ef7fde40, pollset=0x7f47a0001300, specific_worker=0x0) at src/core/lib/iomgr/ev_epoll1_linux.cc:1037
        root_worker = 0x7f47af7fd170
        next_worker = <optimized out>
        ret_err = 0x0
#1  0x00007f47fb0da1f3 in cq_end_op_for_next (exec_ctx=0x7f46ef7fde40, cq=0x7f47a0001230, tag=<optimized out>, error=0x7f47e8001c30,
    done=0x7f47fb11dc10 <finished_completion(grpc_exec_ctx*, void*, grpc_cq_completion*)>, done_arg=<optimized out>, storage=0x7f47a0008c80)
    at src/core/lib/surface/completion_queue.cc:692
        kick_error = <optimized out>
        will_definitely_shutdown = <optimized out>
        cqd = 0x7f47a0001280
        is_success = 0
#2  0x00007f47fb11dde1 in partly_done (error=0x0, due_to_completion=true, w=0x7f47a0008ba0, exec_ctx=0x7f46ef7fde40)
    at src/core/ext/filters/client_channel/channel_connectivity.cc:153
No locals.
#3  watch_complete (exec_ctx=0x7f46ef7fde40, pw=0x7f47a0008ba0, error=<optimized out>) at src/core/ext/filters/client_channel/channel_connectivity.cc:166
No locals.
#4  0x00007f47fb0b80d6 in exec_ctx_run (exec_ctx=<optimized out>, closure=<optimized out>, error=0x4) at src/core/lib/iomgr/exec_ctx.cc:70
No locals.
#5  0x00007f47fb0a5011 in grpc_combiner_continue_exec_ctx (exec_ctx=0x7f46ef7fde40) at src/core/lib/iomgr/combiner.cc:260
        n = 0x7f47a00014d0
        cl = 0x7f47a00014d0
        cl_err = 0x4
        contended = <optimized out>
        old_state = <optimized out>
#6  0x00007f47fb0b82da in grpc_exec_ctx_flush (exec_ctx=0x7f46ef7fde40) at src/core/lib/iomgr/exec_ctx.cc:93
        did_something = <optimized out>
#7  0x00007f47fb0c5583 in run_some_timers (exec_ctx=0x7f46ef7fde40) at src/core/lib/iomgr/timer_manager.cc:129
No locals.
#8  timer_main_loop (exec_ctx=0x7f46ef7fde40) at src/core/lib/iomgr/timer_manager.cc:231

sreecha · 2018-01-17T19:37:23Z

Thanks @Sakawa
Even though you aren't explicitly fork, I just realized that calling subprocess32.popen itself calls fork internally. So does the process that calls popen have grpc library already loaded? (I am assuming that is the case).

jboning · 2018-01-17T23:50:32Z

I spent some time digging into this, and here's what I believe is happening:

GRPC_ENABLE_FORK_SUPPORT is defined when compiling gRPC for Python on non-Windows systems: https://github.com/grpc/grpc/blob/v1.8.4/setup.py#L176
If GRPC_ENABLE_FORK_SUPPORT is defined at compile time, fork support is enabled by default: https://github.com/grpc/grpc/blob/v1.8.4/src/core/lib/support/fork.cc#L37
At initialization, gRPC uses pthread_atfork() to install some handlers: https://github.com/grpc/grpc/blob/v1.8.4/src/core/lib/iomgr/fork_posix.cc#L84
After forking, in the child process, grpc_postfork_child() calls grpc_timer_manager_set_threading(true), which calls start_threads(). This creates a new thread which is doomed to crash.

IMO an ideal outcome here would involve:

Some way to make it possible to use fork() followed immediately by exec(). (This is, in general, a thing which multithreaded programs can do safely; it would work fine here except for that atfork callback.) Some possible approaches:
- Don't set GRPC_ENABLE_FORK_SUPPORT (the compiler macro) when building the library for Python; leave it to user programs to opt in by setting GRPC_ENABLE_FORK_SUPPORT (the environment variable).
- Include grpc_enable_fork_support() in the Python bindings, so that programs can opt out of "fork support".
Documentation about gRPC's expectations for programs that use it. If programs using gRPC are not allowed to call fork(), I would expect this to be noted prominently somewhere. I couldn't find any mention of this condition in the general grpc.io docs or the docs for the Python bindings. If programs using gRPC are allowed to call fork() for some "very specific use cases", it would be great to document those cases.

sreecha · 2018-01-18T03:22:41Z

@jboning, Thanks for looking into this!

The only scenario where fork() is safe to use is if the parent process (that did the fork()) never did any actual grpc work like creating channels, calls or creating a server (or in other words, effectively nothing beyond simply initializing the grpc library).

I agree about documenting this clearly. I created an issue: #14055

Now coming back to this issue:

After forking, in the child process, grpc_postfork_child() calls grpc_timer_manager_set_threading(true), which calls start_threads(). This creates a new thread which is doomed to crash.

You are right about the code path. The new thread may behave unexpectedly if the parent process (that called the fork() already started doing some gRPC work like creating a call or starting a server etc). I suspect this is what is happening in your case.

Unfortunately at this point, I can only think of two ways to address this :

The obvious one: Build grpc from source by undefining GRPC_ENABLE_FORK_SUPPORT
We are going to create a patch that would allow setting GRPC_ENABLE_FORK_SUPPORT to false which would turn this off. I created the issue: Allow GRPC_ENABLE_FORK_SUPPORT environment variable to be set to 'false' #14056

jboning · 2018-01-18T18:02:28Z

Another possibility: Have channel/fork/server creation set some global grpc_started flag when it's no longer feasible to fully support fork(). Then, if that flag is set, the atfork callbacks (or at least the postfork child callback) could simply do nothing. This approach would maintain the ability to initialize gRPC during prefork setup, and also support fork-exec without requiring any special configuration by the program.

aamitdb · 2018-01-20T00:08:19Z

Just as a repro case of this:

from __future__ import print_function

import grpc
import subprocess32
import time

def call_subproc():
  o = subprocess32.check_output(["echo", "1"])

def make_channel_with_message():
  channel = grpc.insecure_channel('localhost:50051')

if __name__ == '__main__':
  subprocess32.check_output(["echo", "Hello World!"])
  for i in range(1000):
    call_subproc()
    make_channel_with_message()
    if i % 100 == 0:
      print(time.time())

Reliably segfaults for me when I run with GRPC_ENABLE_FORK_SUPPORT defined on linux, and does not segfault with GRPC_ENABLE_FORK_SUPPORT unset. It sounds like it's expected that it segfaults with GRPC_ENABLE_FORK_SUPPORT, but it's surprising to me that without GRPC_ENABLE_FORK_SUPPORT (aka the pre 1.7 behavior) it doesn't.

I'd like to go with one of the options mentioned like using grpc with GRPC_ENABLE_FORK_SUPPORT off or changing the behavior of the atfork callbacks, but I'd appreciate any advice about what issues you've seen using grpc with python processes that fork/exec a child (as subprocess calls do). The inability to use subprocess at all would be a large obstacle.

We haven't run other issues as severe as this in our code using 1.6, but since grpc python has never been thread safe, I'm not sure what issues we have just be avoiding by luck. The only issue I see mentioned is deadlocks related to multiprocessing usage.

srini100 · 2018-05-18T21:22:20Z

Users interested in fork support, please chime in on #15334

gnossen · 2018-11-15T18:13:36Z

@ericgribkoff Ping

ericgribkoff · 2018-11-19T16:32:28Z

Confirmed that this is now fixed with the latest grpcio release by running the repro snippet from #13998 (comment). See #15334 for the master issue tracking the new client-side fork support, although I believe this was largely fixed by an earlier change to core's fork handlers intended to detect this situation.

ericgribkoff · 2018-11-19T16:53:41Z

Please reply here if there remain any issues with the fork+exec cases discussed here and I can re-open this issue. This should always work, regardless of whether GRPC_FORK_SUPPORT_ENABLED is explicitly set, but without an explicit value core's own fork handlers will still run: however, we will now abort the fork handlers when any external threads (e.g., gRPC Python's internal threads) are calling into core, or when we are otherwise unable to pause all core threads.

mehrdada added kind/bug lang/Python area/core labels Jan 16, 2018

mehrdada assigned sreecha Jan 17, 2018

sreecha mentioned this issue Jan 18, 2018

Allow GRPC_ENABLE_FORK_SUPPORT environment variable to be set to 'false' #14056

Closed

jboning mentioned this issue Jan 18, 2018

gRPC Python 1.8.2 Library incompatibility with fork() #13873

Closed

srini100 mentioned this issue Mar 6, 2018

Logging: deadlock when using the subprocess module after cloud logging init googleapis/google-cloud-python#4992

Closed

nathanielmanistaatgoogle added area/concurrency and removed area/concurrency labels Mar 7, 2018

sreecha added the priority/P1 label May 15, 2018

sreecha assigned kpayson64 and unassigned sreecha May 15, 2018

srini100 assigned ericgribkoff and unassigned kpayson64 May 18, 2018

ericgribkoff closed this as completed Nov 19, 2018

lock bot locked as resolved and limited conversation to collaborators Feb 17, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gRPC poll segfault in subprocess calls #13998

gRPC poll segfault in subprocess calls #13998

Sakawa commented Jan 12, 2018

alexmv commented Jan 16, 2018

sreecha commented Jan 16, 2018

Sakawa commented Jan 17, 2018

sreecha commented Jan 17, 2018

jboning commented Jan 17, 2018 •

edited

sreecha commented Jan 18, 2018

jboning commented Jan 18, 2018

aamitdb commented Jan 20, 2018 •

edited

srini100 commented May 18, 2018

gnossen commented Nov 15, 2018

ericgribkoff commented Nov 19, 2018

ericgribkoff commented Nov 19, 2018

gRPC poll segfault in subprocess calls #13998

gRPC poll segfault in subprocess calls #13998

Comments

Sakawa commented Jan 12, 2018

Should this be an issue in the gRPC issue tracker?

What version of gRPC and what language are you using?

What operating system (Linux, Windows, …) and version?

What runtime / compiler are you using (e.g. python version or version of gcc)

What did you do?

What did you expect to see?

What did you see instead?

Anything else we should know about your project / environment?

alexmv commented Jan 16, 2018

sreecha commented Jan 16, 2018

Sakawa commented Jan 17, 2018

sreecha commented Jan 17, 2018

jboning commented Jan 17, 2018 • edited

sreecha commented Jan 18, 2018

jboning commented Jan 18, 2018

aamitdb commented Jan 20, 2018 • edited

srini100 commented May 18, 2018

gnossen commented Nov 15, 2018

ericgribkoff commented Nov 19, 2018

ericgribkoff commented Nov 19, 2018

jboning commented Jan 17, 2018 •

edited

aamitdb commented Jan 20, 2018 •

edited