Segment fault at gpr_mpscq_pop_and_check_end #19923

BusyJay · 2019-08-13T09:00:03Z

What version of gRPC and what language are you using?

v1.17.2, rust wrapper of grpc c core.

What operating system (Linux, Windows,...) and version?

Linux 3.10

What runtime / compiler are you using (e.g. python version or version of gcc)

grpcio, which wraps grpc c core in rust.

What did you do?

Nothing special, just a long running service.

What did you see instead?

stack trace:

#0  gpr_mpscq_pop_and_check_end (q=q@entry=0x7fe8339ece98, empty=empty@entry=0x7fe91d5f9f5f)
    at /rust/registry/src/github.com-1ecc6299db9ec823/grpcio-sys-0.4.2/grpc/src/core/lib/gpr/mpscq.cc:51
#1  0x00007fe97c22b5ce in gpr_mpscq_pop (q=q@entry=0x7fe8339ece98) at /rust/registry/src/github.com-1ecc6299db9ec823/grpcio-sys-0.4.2/grpc/src/core/lib/gpr/mpscq.cc:46
#2  0x00007fe97c23203e in grpc_combiner_continue_exec_ctx () at /rust/registry/src/github.com-1ecc6299db9ec823/grpcio-sys-0.4.2/grpc/src/core/lib/iomgr/combiner.cc:244
#3  0x00007fe97c207d2f in grpc_core::ExecCtx::Flush (this=0x7fe91d5fa540) at /rust/registry/src/github.com-1ecc6299db9ec823/grpcio-sys-0.4.2/grpc/src/core/lib/iomgr/exec_ctx.cc:137
#4  0x00007fe97c2ee3e0 in end_worker (worker_hdl=<optimized out>, worker=0x7fe91d5fa040, pollset=<optimized out>)
    at /rust/registry/src/github.com-1ecc6299db9ec823/grpcio-sys-0.4.2/grpc/src/core/lib/iomgr/ev_epoll1_linux.cc:947
#5  pollset_work (ps=<optimized out>, worker_hdl=<optimized out>, deadline=<optimized out>)
    at /rust/registry/src/github.com-1ecc6299db9ec823/grpcio-sys-0.4.2/grpc/src/core/lib/iomgr/ev_epoll1_linux.cc:1047
#6  0x00007fe97c21a74d in cq_next (cq=0x7fe9798a24c0, deadline=..., reserved=<optimized out>)
    at /rust/registry/src/github.com-1ecc6299db9ec823/grpcio-sys-0.4.2/grpc/src/core/lib/surface/completion_queue.cc:1040
 #7  0x00007fe97c21afbb in grpc_completion_queue_next (cq=<optimized out>, deadline=..., reserved=<optimized out>)
    at /rust/registry/src/github.com-1ecc6299db9ec823/grpcio-sys-0.4.2/grpc/src/core/lib/surface/completion_queue.cc:1115
#8  0x00007fe97bd641ae in next (self=<optimized out>) at /rust/registry/src/github.com-1ecc6299db9ec823/grpcio-0.4.2/src/cq.rs:150
#9  poll_queue (cq=...) at /rust/registry/src/github.com-1ecc6299db9ec823/grpcio-0.4.2/src/env.rs:28
#10 {{closure}} () at /rust/registry/src/github.com-1ecc6299db9ec823/grpcio-0.4.2/src/env.rs:89
#11 std::sys_common::backtrace::__rust_begin_short_backtrace::hd079d5c938b01150 (f=...) at /rustc/0e4a56b4b04ea98bb16caada30cb2418dd06e250/src/libstd/sys_common/backtrace.rs:77
#12 0x00007fe97bd6407d in {{closure}}<closure,()> () at /rustc/0e4a56b4b04ea98bb16caada30cb2418dd06e250/src/libstd/thread/mod.rs:470
#13 call_once<(),closure> (self=..., _args=<optimized out>) at /rustc/0e4a56b4b04ea98bb16caada30cb2418dd06e250/src/libstd/panic.rs:309
#14 do_call<std::panic::AssertUnwindSafe<closure>,()> (data=<optimized out>) at /rustc/0e4a56b4b04ea98bb16caada30cb2418dd06e250/src/libstd/panicking.rs:294
#15 __rust_maybe_catch_panic () at /rustc/0e4a56b4b04ea98bb16caada30cb2418dd06e250//src/libpanic_abort/lib.rs:29
#16 try<(),std::panic::AssertUnwindSafe<closure>> (f=...) at /rustc/0e4a56b4b04ea98bb16caada30cb2418dd06e250/src/libstd/panicking.rs:273
#17 catch_unwind<std::panic::AssertUnwindSafe<closure>,()> (f=...) at /rustc/0e4a56b4b04ea98bb16caada30cb2418dd06e250/src/libstd/panic.rs:388
#18 {{closure}}<closure,()> () at /rustc/0e4a56b4b04ea98bb16caada30cb2418dd06e250/src/libstd/thread/mod.rs:469
#19 core::ops::function::FnOnce::call_once$u7b$$u7b$vtable.shim$u7d$$u7d$::h05971589d154f23d () at /rustc/0e4a56b4b04ea98bb16caada30cb2418dd06e250/src/libcore/ops/function.rs:231
#20 0x00007fe97beed4df in _$LT$alloc..boxed..Box$LT$F$GT$$u20$as$u20$core..ops..function..FnOnce$LT$A$GT$$GT$::call_once::he71721d2d956d451 ()
    at /rustc/0e4a56b4b04ea98bb16caada30cb2418dd06e250/src/liballoc/boxed.rs:746
#21 0x00007fe97beef80c in std::sys::unix::thread::Thread::new::thread_start::h18485805666ccd3c () at /rustc/0e4a56b4b04ea98bb16caada30cb2418dd06e250/src/liballoc/boxed.rs:746
#22 0x00007fe97aaa5e25 in start_thread () from /lib64/libpthread.so.0
#23 0x00007fe97a1b134d in clone () from /lib64/libc.so.6

The text was updated successfully, but these errors were encountered:

BusyJay · 2019-08-14T08:48:14Z

Probably related to #14310

yashykt · 2019-08-14T17:34:11Z

Since this is from rust, it is probably going to be hard to track it here especially without a way to reproduce this.

BusyJay · 2019-08-15T03:31:31Z

I think it's related to #17786, which also provides a method to reproduce the core dump (and using c++).

yashykt · 2019-08-15T18:35:55Z

@BusyJay If it's the same bug, can you report your findings on that bug? Let's track this there.

gengliqi · 2019-09-01T17:35:37Z

I have figured this bug out.

#19531
The pull request above can fix this bug.

In short, it is caused by calling the recv_message_ready completion function in advance in some rare cases. So scheduling closure rather than running it can solve this problem.

Let me try to explain in detail.

At first, in thread T1 grpc_call_start_batch is called after receiving msg. And then a lot of things to do such as scheduling some closures in exec_ctx closure_list and running it in flush function in exec_ctx destructor.

After running con_start_transport_stream_op_batch -> grpc_transport_perform_stream_op -> perform_stream_op, op->handler_private.closure is added into queue of transport's combiner.
At the same time, this combiner is running closure in queue at another thread T2.

T1 sleep and T2 go to work.

Next, combiner pop this closure out and run perform_stream_op_locked -> grpc_chttp2_maybe_complete_recv_message -> null_then_run_closure -> GRPC_CLOSURE_RUN -> run_in_call_combiner(connected_channel.cc) -> GRPC_CALL_COMBINER_START.
In specific cases, GRPC_CALL_COMBINER_START do not schedule closure but add it to call_combiner's queue.

T2 sleep and T1 go to work.

T1 now run GRPC_CALL_COMBINER_STOP (in con_start_transport_stream_op_batch) in which closures schedule in T1's exec_ctx.
The exec_ctx flush function then run closure added before which will call post_batch_completion and push a completion event to the completion queue.

Soon after, next grpc_call_start_batch comes and reuse the batch.

As before, op->handler_private.closure is added into queue of transport's combiner.
closure->next_data.atm_next is used and the pointer atm_next.next is NULL because it's the last node in combiner's queue.

T2 wake up and call grpc_chttp2_complete_closure_step.

#define CLOSURE_BARRIER_FIRST_REF_BIT (1 << 16)
closure->next_data.scratch -= CLOSURE_BARRIER_FIRST_REF_BIT;

scratch and atm_next share the same memory so scratch now is zero.
scratch and atm_next is 0 - (1 << 16) = 0xffffffffffff0000 (two's complement)

After popping this node out of queue, tail become atm_next.next so the next pop will get 0xffffffffffff0000 address, which is an invalid memory address probably.

That's why the bug happens.
I think these issues are caused by the same bug. #17786, #17043.

BusyJay · 2019-09-02T00:40:01Z

Cool, thanks @gengliqi !

mhaidrygoog · 2019-09-05T19:20:21Z

Closing issue since a PR has been made to address it

yashykt added disposition/requires reporter action lang/core kind/bug labels Aug 14, 2019

gengliqi mentioned this issue Sep 2, 2019

Sched combiner closures instead of running to avoid data races tikv/grpc#32

Merged

mhaidrygoog closed this as completed Sep 5, 2019

gengliqi mentioned this issue Sep 6, 2019

segment fault in gpr_mpscq_pop_and_check_end #17043

Closed

lock bot locked as resolved and limited conversation to collaborators Dec 4, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segment fault at gpr_mpscq_pop_and_check_end #19923

Segment fault at gpr_mpscq_pop_and_check_end #19923

BusyJay commented Aug 13, 2019

BusyJay commented Aug 14, 2019

yashykt commented Aug 14, 2019

BusyJay commented Aug 15, 2019

yashykt commented Aug 15, 2019

gengliqi commented Sep 1, 2019 •

edited

BusyJay commented Sep 2, 2019

mhaidrygoog commented Sep 5, 2019

Segment fault at gpr_mpscq_pop_and_check_end #19923

Segment fault at gpr_mpscq_pop_and_check_end #19923

Comments

BusyJay commented Aug 13, 2019

What version of gRPC and what language are you using?

What operating system (Linux, Windows,...) and version?

What runtime / compiler are you using (e.g. python version or version of gcc)

What did you do?

What did you see instead?

BusyJay commented Aug 14, 2019

yashykt commented Aug 14, 2019

BusyJay commented Aug 15, 2019

yashykt commented Aug 15, 2019

gengliqi commented Sep 1, 2019 • edited

BusyJay commented Sep 2, 2019

mhaidrygoog commented Sep 5, 2019

gengliqi commented Sep 1, 2019 •

edited