Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gRPC poll segfault in subprocess calls #13998

Closed
Sakawa opened this issue Jan 12, 2018 · 12 comments
Closed

gRPC poll segfault in subprocess calls #13998

Sakawa opened this issue Jan 12, 2018 · 12 comments

Comments

@Sakawa
Copy link

Sakawa commented Jan 12, 2018

Please answer these questions before submitting your issue.

Should this be an issue in the gRPC issue tracker?

Yes

What version of gRPC and what language are you using?

1.8.2, Python

What operating system (Linux, Windows, …) and version?

Linux

What runtime / compiler are you using (e.g. python version or version of gcc)

Python 2.7

What did you do?

We started seeing segfaults in our Dropbox code after upgrading from 1.6.0 to 1.8.2, when subprocess32 calls began intermittently segfaulting. A backtrace extracted from the core dump of a segfault is below:

#0 pollset_kick (exec_ctx=0x7f7c81ffae40, pollset=0x7f7d38001460, specific_worker=0x0) at src/core/lib/iomgr/ev_epoll1_linux.cc:1037
#1 0x00007f7d85c521f3 in cq_end_op_for_next (exec_ctx=0x7f7c81ffae40, cq=0x7f7d38001390, tag=<optimized out>, error=0x7f7d6c00d9d0,
done=0x7f7d85c95c10 <finished_completion(grpc_exec_ctx*, void*, grpc_cq_completion*)>, done_arg=<optimized out>, storage=0x7f7d3800f280)
at src/core/lib/surface/completion_queue.cc:692
#2 0x00007f7d85c95de1 in partly_done (error=0x0, due_to_completion=true, w=0x7f7d3800f1a0, exec_ctx=0x7f7c81ffae40)
at src/core/ext/filters/client_channel/channel_connectivity.cc:153
#3 watch_complete (exec_ctx=0x7f7c81ffae40, pw=0x7f7d3800f1a0, error=<optimized out>) at src/core/ext/filters/client_channel/channel_connectivity.cc:166
#4 0x00007f7d85c300d6 in exec_ctx_run (exec_ctx=<optimized out>, closure=<optimized out>, error=0x4) at src/core/lib/iomgr/exec_ctx.cc:70
#5 0x00007f7d85c1d011 in grpc_combiner_continue_exec_ctx (exec_ctx=0x7f7c81ffae40) at src/core/lib/iomgr/combiner.cc:260
#6 0x00007f7d85c302da in grpc_exec_ctx_flush (exec_ctx=0x7f7c81ffae40) at src/core/lib/iomgr/exec_ctx.cc:93
#7 0x00007f7d85c3d583 in run_some_timers (exec_ctx=0x7f7c81ffae40) at src/core/lib/iomgr/timer_manager.cc:129
#8 timer_main_loop (exec_ctx=0x7f7c81ffae40) at src/core/lib/iomgr/timer_manager.cc:231
#9 timer_thread (completed_thread_ptr=completed_thread_ptr@entry=0x7f7d600086d0) at src/core/lib/iomgr/timer_manager.cc:279
#10 0x00007f7d85c0f988 in thread_body (v=<optimized out>) at src/core/lib/support/thd_posix.cc:53
#11 0x00007f7d8fbb9e9a in start_thread () from .../x86_64-linux-gnu/libpthread.so.0
#12 0x00007f7d8f8e136d in clone () from .../x86_64-linux-gnu/libc.so.6
#13 0x0000000000000000 in ?? ()

We also occasionally see

E0105 19:43:12.164784957   50909 completion_queue.cc:697]    Kick failed: {"created":"@1515181392.164703224","description":"OS Error","errno":9,"file":"src/core/lib/iomgr/wakeup_fd_eventfd.cc","file_line":61,"os_error":"Bad file descriptor","syscall":"eventfd_write"}

in our logs, although this happens fairly rarely.

What did you expect to see?

No segfault -- expected subprocess command to run normally

What did you see instead?

Segfault

Make sure you include information that can help us debug (full error message, exception listing, stack trace, logs).

Anything else we should know about your project / environment?

@alexmv
Copy link

alexmv commented Jan 16, 2018

This may be the same as #13327.

@sreecha
Copy link
Contributor

sreecha commented Jan 16, 2018

@alexmv , #13327 has a different callstack and looks different from this issue.

@Sakawa,
Could you tell a bit more details on what your code is doing..

i.e

      grpc_pollset_worker* root_worker = pollset->root_worker;
      if (root_worker == nullptr) {
        ...
        goto done;
      }
      grpc_pollset_worker* next_worker = root_worker->next;
1037:      if (root_worker->state == KICKED) {  // <<<< THIS IS THE LINE FAILING AS PER CALL STACK
  ..

So it looks like root_worker is null at this point (thats the only way I can see that line segfaulting) but that is not possible because we handle that case just a few lines above. So I am not sure if this is the right callstack.

@Sakawa
Copy link
Author

Sakawa commented Jan 17, 2018

what subprocess32 call were you making?

The code calls subprocess32.Popen and Popen.communicate to run a git command. There's nothing terribly unique about this call -- it reproduces with the standard git binary.

By any chance, is your code calling fork() at some point?

Unfortunately not, we've seen this behavior on a single-process single-thread instance and our code doesn't fork anywhere explicitly.

are you sure the callstack you posted is accurate?

I just went and generated another core dump with grpcio 1.8.4 and got essentially the same stacktrace (line 1037 of ev_epoll1_linux.cc still the culprit). I don't know the code well enough to be able to explain the segfault. Here is a bt full from the core dump, which shows that root_worker isn't null:

#0  pollset_kick (exec_ctx=0x7f46ef7fde40, pollset=0x7f47a0001300, specific_worker=0x0) at src/core/lib/iomgr/ev_epoll1_linux.cc:1037
        root_worker = 0x7f47af7fd170
        next_worker = <optimized out>
        ret_err = 0x0
#1  0x00007f47fb0da1f3 in cq_end_op_for_next (exec_ctx=0x7f46ef7fde40, cq=0x7f47a0001230, tag=<optimized out>, error=0x7f47e8001c30,
    done=0x7f47fb11dc10 <finished_completion(grpc_exec_ctx*, void*, grpc_cq_completion*)>, done_arg=<optimized out>, storage=0x7f47a0008c80)
    at src/core/lib/surface/completion_queue.cc:692
        kick_error = <optimized out>
        will_definitely_shutdown = <optimized out>
        cqd = 0x7f47a0001280
        is_success = 0
#2  0x00007f47fb11dde1 in partly_done (error=0x0, due_to_completion=true, w=0x7f47a0008ba0, exec_ctx=0x7f46ef7fde40)
    at src/core/ext/filters/client_channel/channel_connectivity.cc:153
No locals.
#3  watch_complete (exec_ctx=0x7f46ef7fde40, pw=0x7f47a0008ba0, error=<optimized out>) at src/core/ext/filters/client_channel/channel_connectivity.cc:166
No locals.
#4  0x00007f47fb0b80d6 in exec_ctx_run (exec_ctx=<optimized out>, closure=<optimized out>, error=0x4) at src/core/lib/iomgr/exec_ctx.cc:70
No locals.
#5  0x00007f47fb0a5011 in grpc_combiner_continue_exec_ctx (exec_ctx=0x7f46ef7fde40) at src/core/lib/iomgr/combiner.cc:260
        n = 0x7f47a00014d0
        cl = 0x7f47a00014d0
        cl_err = 0x4
        contended = <optimized out>
        old_state = <optimized out>
#6  0x00007f47fb0b82da in grpc_exec_ctx_flush (exec_ctx=0x7f46ef7fde40) at src/core/lib/iomgr/exec_ctx.cc:93
        did_something = <optimized out>
#7  0x00007f47fb0c5583 in run_some_timers (exec_ctx=0x7f46ef7fde40) at src/core/lib/iomgr/timer_manager.cc:129
No locals.
#8  timer_main_loop (exec_ctx=0x7f46ef7fde40) at src/core/lib/iomgr/timer_manager.cc:231

@sreecha
Copy link
Contributor

sreecha commented Jan 17, 2018

Thanks @Sakawa
Even though you aren't explicitly fork, I just realized that calling subprocess32.popen itself calls fork internally. So does the process that calls popen have grpc library already loaded? (I am assuming that is the case).

@jboning
Copy link

jboning commented Jan 17, 2018

I spent some time digging into this, and here's what I believe is happening:

  1. GRPC_ENABLE_FORK_SUPPORT is defined when compiling gRPC for Python on non-Windows systems: https://github.com/grpc/grpc/blob/v1.8.4/setup.py#L176
  2. If GRPC_ENABLE_FORK_SUPPORT is defined at compile time, fork support is enabled by default: https://github.com/grpc/grpc/blob/v1.8.4/src/core/lib/support/fork.cc#L37
  3. At initialization, gRPC uses pthread_atfork() to install some handlers: https://github.com/grpc/grpc/blob/v1.8.4/src/core/lib/iomgr/fork_posix.cc#L84
  4. After forking, in the child process, grpc_postfork_child() calls grpc_timer_manager_set_threading(true), which calls start_threads(). This creates a new thread which is doomed to crash.

IMO an ideal outcome here would involve:

  • Some way to make it possible to use fork() followed immediately by exec(). (This is, in general, a thing which multithreaded programs can do safely; it would work fine here except for that atfork callback.) Some possible approaches:
    • Don't set GRPC_ENABLE_FORK_SUPPORT (the compiler macro) when building the library for Python; leave it to user programs to opt in by setting GRPC_ENABLE_FORK_SUPPORT (the environment variable).
    • Include grpc_enable_fork_support() in the Python bindings, so that programs can opt out of "fork support".
  • Documentation about gRPC's expectations for programs that use it. If programs using gRPC are not allowed to call fork(), I would expect this to be noted prominently somewhere. I couldn't find any mention of this condition in the general grpc.io docs or the docs for the Python bindings. If programs using gRPC are allowed to call fork() for some "very specific use cases", it would be great to document those cases.

@sreecha
Copy link
Contributor

sreecha commented Jan 18, 2018

@jboning, Thanks for looking into this!

The only scenario where fork() is safe to use is if the parent process (that did the fork()) never did any actual grpc work like creating channels, calls or creating a server (or in other words, effectively nothing beyond simply initializing the grpc library).

I agree about documenting this clearly. I created an issue: #14055

Now coming back to this issue:

After forking, in the child process, grpc_postfork_child() calls grpc_timer_manager_set_threading(true), which calls start_threads(). This creates a new thread which is doomed to crash.

You are right about the code path. The new thread may behave unexpectedly if the parent process (that called the fork() already started doing some gRPC work like creating a call or starting a server etc). I suspect this is what is happening in your case.

Unfortunately at this point, I can only think of two ways to address this :

@jboning
Copy link

jboning commented Jan 18, 2018

Another possibility: Have channel/fork/server creation set some global grpc_started flag when it's no longer feasible to fully support fork(). Then, if that flag is set, the atfork callbacks (or at least the postfork child callback) could simply do nothing. This approach would maintain the ability to initialize gRPC during prefork setup, and also support fork-exec without requiring any special configuration by the program.

@aamitdb
Copy link

aamitdb commented Jan 20, 2018

Just as a repro case of this:

from __future__ import print_function

import grpc
import subprocess32
import time

def call_subproc():
  o = subprocess32.check_output(["echo", "1"])

def make_channel_with_message():
  channel = grpc.insecure_channel('localhost:50051')

if __name__ == '__main__':
  subprocess32.check_output(["echo", "Hello World!"])
  for i in range(1000):
    call_subproc()
    make_channel_with_message()
    if i % 100 == 0:
      print(time.time())

Reliably segfaults for me when I run with GRPC_ENABLE_FORK_SUPPORT defined on linux, and does not segfault with GRPC_ENABLE_FORK_SUPPORT unset. It sounds like it's expected that it segfaults with GRPC_ENABLE_FORK_SUPPORT, but it's surprising to me that without GRPC_ENABLE_FORK_SUPPORT (aka the pre 1.7 behavior) it doesn't.

I'd like to go with one of the options mentioned like using grpc with GRPC_ENABLE_FORK_SUPPORT off or changing the behavior of the atfork callbacks, but I'd appreciate any advice about what issues you've seen using grpc with python processes that fork/exec a child (as subprocess calls do). The inability to use subprocess at all would be a large obstacle.

We haven't run other issues as severe as this in our code using 1.6, but since grpc python has never been thread safe, I'm not sure what issues we have just be avoiding by luck. The only issue I see mentioned is deadlocks related to multiprocessing usage.

@srini100
Copy link
Contributor

Users interested in fork support, please chime in on #15334

@gnossen
Copy link
Contributor

gnossen commented Nov 15, 2018

@ericgribkoff Ping

@ericgribkoff
Copy link
Contributor

Confirmed that this is now fixed with the latest grpcio release by running the repro snippet from #13998 (comment). See #15334 for the master issue tracking the new client-side fork support, although I believe this was largely fixed by an earlier change to core's fork handlers intended to detect this situation.

@ericgribkoff
Copy link
Contributor

Please reply here if there remain any issues with the fork+exec cases discussed here and I can re-open this issue. This should always work, regardless of whether GRPC_FORK_SUPPORT_ENABLED is explicitly set, but without an explicit value core's own fork handlers will still run: however, we will now abort the fork handlers when any external threads (e.g., gRPC Python's internal threads) are calling into core, or when we are otherwise unable to pause all core threads.

@lock lock bot locked as resolved and limited conversation to collaborators Feb 17, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests