Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segment fault at gpr_mpscq_pop_and_check_end #19923

Closed
BusyJay opened this issue Aug 13, 2019 · 7 comments
Closed

Segment fault at gpr_mpscq_pop_and_check_end #19923

BusyJay opened this issue Aug 13, 2019 · 7 comments

Comments

@BusyJay
Copy link
Contributor

BusyJay commented Aug 13, 2019

What version of gRPC and what language are you using?

v1.17.2, rust wrapper of grpc c core.

What operating system (Linux, Windows,...) and version?

Linux 3.10

What runtime / compiler are you using (e.g. python version or version of gcc)

grpcio, which wraps grpc c core in rust.

What did you do?

Nothing special, just a long running service.

What did you see instead?

stack trace:

#0  gpr_mpscq_pop_and_check_end (q=q@entry=0x7fe8339ece98, empty=empty@entry=0x7fe91d5f9f5f)
    at /rust/registry/src/github.com-1ecc6299db9ec823/grpcio-sys-0.4.2/grpc/src/core/lib/gpr/mpscq.cc:51
#1  0x00007fe97c22b5ce in gpr_mpscq_pop (q=q@entry=0x7fe8339ece98) at /rust/registry/src/github.com-1ecc6299db9ec823/grpcio-sys-0.4.2/grpc/src/core/lib/gpr/mpscq.cc:46
#2  0x00007fe97c23203e in grpc_combiner_continue_exec_ctx () at /rust/registry/src/github.com-1ecc6299db9ec823/grpcio-sys-0.4.2/grpc/src/core/lib/iomgr/combiner.cc:244
#3  0x00007fe97c207d2f in grpc_core::ExecCtx::Flush (this=0x7fe91d5fa540) at /rust/registry/src/github.com-1ecc6299db9ec823/grpcio-sys-0.4.2/grpc/src/core/lib/iomgr/exec_ctx.cc:137
#4  0x00007fe97c2ee3e0 in end_worker (worker_hdl=<optimized out>, worker=0x7fe91d5fa040, pollset=<optimized out>)
    at /rust/registry/src/github.com-1ecc6299db9ec823/grpcio-sys-0.4.2/grpc/src/core/lib/iomgr/ev_epoll1_linux.cc:947
#5  pollset_work (ps=<optimized out>, worker_hdl=<optimized out>, deadline=<optimized out>)
    at /rust/registry/src/github.com-1ecc6299db9ec823/grpcio-sys-0.4.2/grpc/src/core/lib/iomgr/ev_epoll1_linux.cc:1047
#6  0x00007fe97c21a74d in cq_next (cq=0x7fe9798a24c0, deadline=..., reserved=<optimized out>)
    at /rust/registry/src/github.com-1ecc6299db9ec823/grpcio-sys-0.4.2/grpc/src/core/lib/surface/completion_queue.cc:1040
 #7  0x00007fe97c21afbb in grpc_completion_queue_next (cq=<optimized out>, deadline=..., reserved=<optimized out>)
    at /rust/registry/src/github.com-1ecc6299db9ec823/grpcio-sys-0.4.2/grpc/src/core/lib/surface/completion_queue.cc:1115
#8  0x00007fe97bd641ae in next (self=<optimized out>) at /rust/registry/src/github.com-1ecc6299db9ec823/grpcio-0.4.2/src/cq.rs:150
#9  poll_queue (cq=...) at /rust/registry/src/github.com-1ecc6299db9ec823/grpcio-0.4.2/src/env.rs:28
#10 {{closure}} () at /rust/registry/src/github.com-1ecc6299db9ec823/grpcio-0.4.2/src/env.rs:89
#11 std::sys_common::backtrace::__rust_begin_short_backtrace::hd079d5c938b01150 (f=...) at /rustc/0e4a56b4b04ea98bb16caada30cb2418dd06e250/src/libstd/sys_common/backtrace.rs:77
#12 0x00007fe97bd6407d in {{closure}}<closure,()> () at /rustc/0e4a56b4b04ea98bb16caada30cb2418dd06e250/src/libstd/thread/mod.rs:470
#13 call_once<(),closure> (self=..., _args=<optimized out>) at /rustc/0e4a56b4b04ea98bb16caada30cb2418dd06e250/src/libstd/panic.rs:309
#14 do_call<std::panic::AssertUnwindSafe<closure>,()> (data=<optimized out>) at /rustc/0e4a56b4b04ea98bb16caada30cb2418dd06e250/src/libstd/panicking.rs:294
#15 __rust_maybe_catch_panic () at /rustc/0e4a56b4b04ea98bb16caada30cb2418dd06e250//src/libpanic_abort/lib.rs:29
#16 try<(),std::panic::AssertUnwindSafe<closure>> (f=...) at /rustc/0e4a56b4b04ea98bb16caada30cb2418dd06e250/src/libstd/panicking.rs:273
#17 catch_unwind<std::panic::AssertUnwindSafe<closure>,()> (f=...) at /rustc/0e4a56b4b04ea98bb16caada30cb2418dd06e250/src/libstd/panic.rs:388
#18 {{closure}}<closure,()> () at /rustc/0e4a56b4b04ea98bb16caada30cb2418dd06e250/src/libstd/thread/mod.rs:469
#19 core::ops::function::FnOnce::call_once$u7b$$u7b$vtable.shim$u7d$$u7d$::h05971589d154f23d () at /rustc/0e4a56b4b04ea98bb16caada30cb2418dd06e250/src/libcore/ops/function.rs:231
#20 0x00007fe97beed4df in _$LT$alloc..boxed..Box$LT$F$GT$$u20$as$u20$core..ops..function..FnOnce$LT$A$GT$$GT$::call_once::he71721d2d956d451 ()
    at /rustc/0e4a56b4b04ea98bb16caada30cb2418dd06e250/src/liballoc/boxed.rs:746
#21 0x00007fe97beef80c in std::sys::unix::thread::Thread::new::thread_start::h18485805666ccd3c () at /rustc/0e4a56b4b04ea98bb16caada30cb2418dd06e250/src/liballoc/boxed.rs:746
#22 0x00007fe97aaa5e25 in start_thread () from /lib64/libpthread.so.0
#23 0x00007fe97a1b134d in clone () from /lib64/libc.so.6
@BusyJay
Copy link
Contributor Author

BusyJay commented Aug 14, 2019

Probably related to #14310

@yashykt
Copy link
Member

yashykt commented Aug 14, 2019

Since this is from rust, it is probably going to be hard to track it here especially without a way to reproduce this.

@BusyJay
Copy link
Contributor Author

BusyJay commented Aug 15, 2019

I think it's related to #17786, which also provides a method to reproduce the core dump (and using c++).

@yashykt
Copy link
Member

yashykt commented Aug 15, 2019

@BusyJay If it's the same bug, can you report your findings on that bug? Let's track this there.

@gengliqi
Copy link

gengliqi commented Sep 1, 2019

I have figured this bug out.

#19531
The pull request above can fix this bug.

In short, it is caused by calling the recv_message_ready completion function in advance in some rare cases. So scheduling closure rather than running it can solve this problem.

Let me try to explain in detail.

At first, in thread T1 grpc_call_start_batch is called after receiving msg. And then a lot of things to do such as scheduling some closures in exec_ctx closure_list and running it in flush function in exec_ctx destructor.

After running con_start_transport_stream_op_batch -> grpc_transport_perform_stream_op -> perform_stream_op, op->handler_private.closure is added into queue of transport's combiner.
At the same time, this combiner is running closure in queue at another thread T2.

T1 sleep and T2 go to work.

Next, combiner pop this closure out and run perform_stream_op_locked -> grpc_chttp2_maybe_complete_recv_message -> null_then_run_closure -> GRPC_CLOSURE_RUN -> run_in_call_combiner(connected_channel.cc) -> GRPC_CALL_COMBINER_START.
In specific cases, GRPC_CALL_COMBINER_START do not schedule closure but add it to call_combiner's queue.

T2 sleep and T1 go to work.

T1 now run GRPC_CALL_COMBINER_STOP (in con_start_transport_stream_op_batch) in which closures schedule in T1's exec_ctx.
The exec_ctx flush function then run closure added before which will call post_batch_completion and push a completion event to the completion queue.

Soon after, next grpc_call_start_batch comes and reuse the batch.

As before, op->handler_private.closure is added into queue of transport's combiner.
closure->next_data.atm_next is used and the pointer atm_next.next is NULL because it's the last node in combiner's queue.

T2 wake up and call grpc_chttp2_complete_closure_step.

#define CLOSURE_BARRIER_FIRST_REF_BIT (1 << 16)
closure->next_data.scratch -= CLOSURE_BARRIER_FIRST_REF_BIT;

scratch and atm_next share the same memory so scratch now is zero.
scratch and atm_next is 0 - (1 << 16) = 0xffffffffffff0000 (two's complement)

After popping this node out of queue, tail become atm_next.next so the next pop will get 0xffffffffffff0000 address, which is an invalid memory address probably.

That's why the bug happens.
I think these issues are caused by the same bug. #17786, #17043.

@BusyJay
Copy link
Contributor Author

BusyJay commented Sep 2, 2019

Cool, thanks @gengliqi !

@mhaidrygoog
Copy link
Member

Closing issue since a PR has been made to address it

@lock lock bot locked as resolved and limited conversation to collaborators Dec 4, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants