Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes refcounting issue #14315

Merged
merged 1 commit into from
Feb 5, 2018
Merged

Fixes refcounting issue #14315

merged 1 commit into from
Feb 5, 2018

Conversation

kpayson64
Copy link
Contributor

@kpayson64 kpayson64 commented Feb 5, 2018

Fixes issue #13327.

De-referencing the stack in the cancellation callback is unsafe. There may be closures on the executor that need the call stack, and cancellation can happen before they are run.

@grpc-testing
Copy link

****************************************************************

libgrpc.so

     VM SIZE              FILE SIZE
 ++++++++++++++ GROWIN ++++++++++++++

 -------------- SHRINK --------------
  [ = ]       0 [None]     -16  -0.0%

  [ = ]       0 TOTAL      -16  -0.0%


****************************************************************

libgrpc++.so

     VM SIZE        FILE SIZE
 ++++++++++++++  ++++++++++++++

  [ = ]       0        0  [ = ]



@grpc-testing
Copy link

[trickle] No significant performance differences

@grpc-testing
Copy link

[microbenchmarks] No significant performance differences

@grpc-testing
Copy link

****************************************************************

libgrpc.so

     VM SIZE                                                             FILE SIZE
 ++++++++++++++ GROWING                                               ++++++++++++++
  +0.1%     +16 src/core/ext/filters/client_channel/client_channel.cc     +16  +0.1%
       +13%     +11 pick_callback_done_locked                                 +11   +13%
      +1.8%      +5 [Unmapped]                                                 +5  +1.8%

 -------------- SHRINKING                                             --------------
  -0.0%     -32 [None]                                                   -176  -0.0%

  -0.0%     -16 TOTAL                                                    -160  -0.0%


****************************************************************

libgrpc++.so

     VM SIZE        FILE SIZE
 ++++++++++++++  ++++++++++++++

  [ = ]       0        0  [ = ]



@grpc-testing
Copy link

[trickle] No significant performance differences

@kpayson64 kpayson64 changed the base branch from master to v1.9.x February 5, 2018 20:43
Copy link
Contributor

@dgquintas dgquintas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice, thanks!

@grpc-testing
Copy link

Corrupt JSON data (indicates timeout or crash): 
    bm_fullstack_streaming_pump.BM_PumpStreamServerToClient_SockPair__0.opt.old: 1


[microbenchmarks] Performance differences noted:
Benchmark                                                              cpu_time    real_time
---------------------------------------------------------------------  ----------  -----------
BM_StreamingPingPong<MinInProcess, NoOpMutator, NoOpMutator>/262144/2  -4%         -4%
BM_UnaryPingPong<InProcess, NoOpMutator, NoOpMutator>/0/262144         -4%         -4%

@kpayson64
Copy link
Contributor Author

#14291

@kpayson64 kpayson64 merged commit 9e620c3 into grpc:v1.9.x Feb 5, 2018
@markdroth
Copy link
Member

Why is this fix necessary? All of the callbacks that this adds refs for run as part of handling a batch sent down from the surface, and the surface maintains a ref to the call stack for each pending batch. The batch can't return until these callbacks return, so there should be no way for the call stack to be unreffed before the batches return. What am I missing here?

Also, can we add a test that triggers the problem that this solves, so that we won't have regressions in the future?

@kpayson64
Copy link
Contributor Author

@markdroth
Here is a ref count for a call stack when a segfault occurs:

D0205 09:10:30.684904707  110967 transport.cc:41]            CALL_STACK 0x62400000d000:0x62400000c7a0   REF 1->2 completion
D0205 09:10:30.685512302  110968 transport.cc:41]            CALL_STACK 0x62400000d000:0x62400000c7a0   REF 2->3 pick_callback
D0205 09:10:30.691657675  110968 transport.cc:41]            CALL_STACK 0x6290000151f0:0x6290000151e0   REF 1->2 chttp2
D0205 09:10:30.691742888  110968 transport.cc:41]            CALL_STACK 0x6290000151f0:0x6290000151e0   REF 2->3 perform_stream_op
D0205 09:10:30.691767368  110968 transport.cc:55]            CALL_STACK 0x62400000d000:0x62400000c7a0 UNREF 3->2 pick_callback
D0205 09:10:30.691886864  110968 transport.cc:41]            CALL_STACK 0x6290000151f0:0x6290000151e0   REF 3->4 chttp2_writing:become
D0205 09:10:30.691918524  110968 transport.cc:55]            CALL_STACK 0x6290000151f0:0x6290000151e0 UNREF 4->3 perform_stream_op
D0205 09:10:30.692147614  110968 transport.cc:55]            CALL_STACK 0x6290000151f0:0x6290000151e0 UNREF 3->2 chttp2_writing:end
D0205 09:10:30.692281862  110967 transport.cc:41]            CALL_STACK 0x629000010140:0x62900000f8e0   REF 1->2 chttp2
D0205 09:10:30.692363535  110967 transport.cc:41]            CALL_STACK 0x629000010140:0x62900000f8e0   REF 2->3 completion
D0205 09:10:30.692616117  110967 transport.cc:41]            CALL_STACK 0x629000010140:0x62900000f8e0   REF 3->4 perform_stream_op
D0205 09:10:30.692658063  110967 transport.cc:55]            CALL_STACK 0x629000010140:0x62900000f8e0 UNREF 4->3 perform_stream_op
D0205 09:10:30.692773627  110967 transport.cc:55]            CALL_STACK 0x629000010140:0x62900000f8e0 UNREF 3->2 completion
D0205 09:10:30.692982439  110967 transport.cc:41]            CALL_STACK 0x629000010140:0x62900000f8e0   REF 2->3 completion
D0205 09:10:30.693005248  110967 transport.cc:41]            CALL_STACK 0x629000010140:0x62900000f8e0   REF 3->4 perform_stream_op
D0205 09:10:30.693039083  110967 transport.cc:55]            CALL_STACK 0x629000010140:0x62900000f8e0 UNREF 4->3 perform_stream_op
D0205 09:10:30.693121078  110967 transport.cc:41]            CALL_STACK 0x629000010140:0x62900000f8e0   REF 3->4 termination
D0205 09:10:30.693200055  110967 transport.cc:41]            CALL_STACK 0x629000010140:0x62900000f8e0   REF 4->5 perform_stream_op
D0205 09:10:30.693555869  110967 transport.cc:55]            CALL_STACK 0x629000010140:0x62900000f8e0 UNREF 5->4 chttp2
D0205 09:10:30.693602032  110967 transport.cc:55]            CALL_STACK 0x629000010140:0x62900000f8e0 UNREF 4->3 perform_stream_op
D0205 09:10:30.693688864  110967 transport.cc:55]            CALL_STACK 0x629000010140:0x62900000f8e0 UNREF 3->2 termination
D0205 09:10:30.697189638  110967 transport.cc:55]            CALL_STACK 0x629000010140:0x62900000f8e0 UNREF 2->1 completion
D0205 09:10:30.697456418  110967 transport.cc:55]            CALL_STACK 0x6290000151f0:0x6290000151e0 UNREF 2->1 chttp2
D0205 09:10:30.697686494  110967 transport.cc:55]            CALL_STACK 0x62400000d000:0x62400000c7a0 UNREF 2->1 completion
D0205 09:10:30.697728854  110967 transport.cc:55]            CALL_STACK 0x62400000d000:0x62400000c7a0 UNREF 1->0 destroy
D0205 09:10:30.697783569  110967 transport.cc:55]            CALL_STACK 0x6290000151f0:0x6290000151e0 UNREF 1->0 client_channel_destroy_call
D0205 09:10:30.697914649  110967 transport.cc:55]            CALL_STACK 0x629000010140:0x62900000f8e0 UNREF 1->0 destroy
D0205 09:10:30.794493674  110967 transport.cc:41]            CALL_STACK 0x624000007000:0x6240000067a0   REF 1->2 completion
D0205 09:10:30.795295874  110968 transport.cc:41]            CALL_STACK 0x624000007000:0x6240000067a0   REF 2->3 pick_callback
D0205 09:10:30.802089706  110968 transport.cc:41]            CALL_STACK 0x62900001f1f0:0x62900001f1e0   REF 1->2 chttp2
D0205 09:10:30.802170067  110968 transport.cc:55]            CALL_STACK 0x624000007000:0x6240000067a0 UNREF 3->2 pick_callback
D0205 09:10:30.802244215  110968 transport.cc:41]            CALL_STACK 0x624000007000:0x6240000067a0   REF 2->3 termination
D0205 09:10:30.819658456  110967 transport.cc:55]            CALL_STACK 0x624000007000:0x6240000067a0 UNREF 3->2 completion
D0205 09:10:30.819682433  110968 transport.cc:41]            CALL_STACK 0x62900001f1f0:0x62900001f1e0   REF 2->3 perform_stream_op
D0205 09:10:30.819743264  110967 transport.cc:55]            CALL_STACK 0x624000007000:0x6240000067a0 UNREF 2->1 destroy
D0205 09:10:30.820131634  110968 transport.cc:55]            CALL_STACK 0x62900001f1f0:0x62900001f1e0 UNREF 3->2 chttp2
D0205 09:10:30.820159440  110968 transport.cc:55]            CALL_STACK 0x62900001f1f0:0x62900001f1e0 UNREF 2->1 perform_stream_op
D0205 09:10:30.832509457  110968 transport.cc:55]            CALL_STACK 0x624000007000:0x6240000067a0 UNREF 1->0 termination
D0205 09:10:30.832601481  110968 transport.cc:55]            CALL_STACK 0x62900001f1f0:0x62900001f1e0 UNREF 1->0 client_channel_destroy_call

In the second call, we can see the "completion" unref happens before the "perform_stream_op". I suspect batches are getting early-completed on cancellation.

@kpayson64
Copy link
Contributor Author

I suspect grpc_transport_stream_op_batch_finish_with_failure is at fault, it looks like it queues the batch completion while we could still possibly have closures queued up.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants