grpcio::Env can leak threads -- it detaches them instead of joining them #455

cbeck88 · 2020-04-01T21:04:51Z

The grpcio::env impl of Drop requests that all the
completion queues shutdown, but does not actually join the threads.

For many applications this works fine, often a webserver does not require a graceful shutdown strategy.

However, in my usecase I want to validate that even if the server goes down and comes back
repeatedly, the users are able to recover their data from the database.

    let users = ... mock user set

    for phase_count in 0..NUM_PHASES {
        log::info!(logger, "Phase {}/{}", phase_count + 1, NUM_PHASES);

        // First make grpcio env
        let grpcio_env = mobile_acct_api::make_env();

        ... make server, make client,
        ... make requests for each mock user,
        ... validate results
    }

Although grpcio_env is scoped to the loop body, the implementation of
Drop does not join the threads. When the test ends, it crashes consistently,
because my server contains an SGX enclave, and there is a static object in
the intel library SimEnclaveMgr which is torn down
before these threads get cleaned up. Then they try to tear down their enclaves
and SIGSEGV occurs.

I believe that with the current API, I cannot guarantee that
my grpcio threads are torn down before that object is. The only way
that I can do that is if there is some API on grpcio::Environment
that actually joins the threads.

In the grpc-rs rust tests that validate grpcio::Environment, you
yourselves have written code that explicitly joins the join handles,
instead of leaving them detached. I would
like to be able to do that in my tests at the end of my loop body.

I would like to expose this functionality as a new public function.
This commit creates a new function shutdown_and_join, which
issues the shutdown command, and then joins the join handles.
It also makes the rust unit test in grpc-rs use that API.
I would use this at the end of my loop body in my code example.

This is not a breaking change, since we don't change the implementation
of Drop or any other current public api.

BusyJay · 2020-04-02T04:42:04Z

Environment is usually wrapped in Arc, it will be useless to call the shutdown method.

Then they try to tear down their enclaves and SIGSEGV occurs.

I don't get it. All servers and clients should be shutdown before the environment, so even though the thread is not shutdown, it should not try to touch "enclaves".

Is it possible to make environment out of the loop scope?

cbeck88 · 2020-04-02T04:56:24Z

The enclave destructor calls a C library. A handle to our enclave object is owned by the grpc service object that we register with the server. So that resource is still preserved until the grpc service thread is collected, as far as I understand.

If shutting down the server does not join the threads, then shutting down the server does not collect the enclaves, right? So those destructors are never called until the threads close of their own accord. The join handles appear to be owned by the grpcio environment so nothing can join them but the environment, and I have no way to make it do that.

Here's one of the shorter kinds of stacktraces I see:

2020-04-01 20:17:11.980327757 UTC INFO API listening on 0.0.0.0:3224, mc.test_name: tx_recovery::test_ingest_mock_db, mc.module: grpc_util, mc.src: public/grpc_util/src/lib.rs:156
2020-04-01 20:17:11.980365297 UTC INFO Block 1/10, mc.test_name: tx_recovery::test_ingest_mock_db, mc.module: tx_recovery, mc.src: src/mobile_acct/ingest_server/tests/tx_recovery.rs:115
2020-04-01 20:17:11.994059707 UTC INFO Block 2/10, mc.test_name: tx_recovery::test_ingest_mock_db, mc.module: tx_recovery, mc.src: src/mobile_acct/ingest_server/tests/tx_recovery.rs:115
2020-04-01 20:17:12.009319252 UTC INFO Block 3/10, mc.test_name: tx_recovery::test_ingest_mock_db, mc.module: tx_recovery, mc.src: src/mobile_acct/ingest_server/tests/tx_recovery.rs:115
2020-04-01 20:17:12.021095644 UTC INFO Block 4/10, mc.test_name: tx_recovery::test_ingest_mock_db, mc.module: tx_recovery, mc.src: src/mobile_acct/ingest_server/tests/tx_recovery.rs:115
2020-04-01 20:17:12.034121656 UTC INFO Block 5/10, mc.test_name: tx_recovery::test_ingest_mock_db, mc.module: tx_recovery, mc.src: src/mobile_acct/ingest_server/tests/tx_recovery.rs:115
2020-04-01 20:17:12.050300344 UTC INFO Block 6/10, mc.test_name: tx_recovery::test_ingest_mock_db, mc.module: tx_recovery, mc.src: src/mobile_acct/ingest_server/tests/tx_recovery.rs:115
2020-04-01 20:17:12.062286141 UTC INFO Block 7/10, mc.test_name: tx_recovery::test_ingest_mock_db, mc.module: tx_recovery, mc.src: src/mobile_acct/ingest_server/tests/tx_recovery.rs:115
2020-04-01 20:17:12.074726050 UTC INFO Block 8/10, mc.test_name: tx_recovery::test_ingest_mock_db, mc.module: tx_recovery, mc.src: src/mobile_acct/ingest_server/tests/tx_recovery.rs:115
2020-04-01 20:17:12.092370612 UTC INFO Block 9/10, mc.test_name: tx_recovery::test_ingest_mock_db, mc.module: tx_recovery, mc.src: src/mobile_acct/ingest_server/tests/tx_recovery.rs:115
2020-04-01 20:17:12.111541067 UTC INFO Block 10/10, mc.test_name: tx_recovery::test_ingest_mock_db, mc.module: tx_recovery, mc.src: src/mobile_acct/ingest_server/tests/tx_recovery.rs:115
[Thread 0x7fabc4f72700 (LWP 952) exited]
[Thread 0x7fabc4d71700 (LWP 949) exited]
[Thread 0x7fabc4b70700 (LWP 947) exited]
[Thread 0x7fabc5173700 (LWP 946) exited]
[Thread 0x7fab4f7e0700 (LWP 945) exited]
[Thread 0x7fabc436c700 (LWP 944) exited]
[Thread 0x7fab4f3de700 (LWP 943) exited]
[Thread 0x7fab8ffff700 (LWP 942) exited]
[Thread 0x7fab4efdc700 (LWP 948) exited]
test test_ingest_mock_db ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out

[Thread 0x7fabc496f700 (LWP 955) exited]
[Thread 0x7fab4f9e1700 (LWP 953) exited]
[Thread 0x7fab4f5df700 (LWP 950) exited]
[Thread 0x7fabc476e700 (LWP 956) exited]
double free or corruption (out)
[Thread 0x7fabc456d700 (LWP 957) exited]
[Thread 0x7fab4f1dd700 (LWP 951) exited]
[Thread 0x7fabc76a0700 (LWP 808) exited]

Thread 1 "tx_recovery-f46" received signal SIGSEGV, Segmentation fault.
0x00007fabc90d0200 in CEnclaveMngr::~CEnclaveMngr() () from /opt/intel/sgxsdk/sdk_libs/libsgx_urts_sim.so
(gdb) bt
#0  0x00007fabc90d0200 in CEnclaveMngr::~CEnclaveMngr() () from /opt/intel/sgxsdk/sdk_libs/libsgx_urts_sim.so
#1  0x00007fabc7f4d615 in __cxa_finalize (d=0x7fabc90e0000) at cxa_finalize.c:83
#2  0x00007fabc90c4863 in __do_global_dtors_aux () from /opt/intel/sgxsdk/sdk_libs/libsgx_urts_sim.so
#3  0x00007fffc9b8f2c0 in ?? ()
#4  0x00007fabc8ed7b73 in _dl_fini () at dl-fini.c:138
Backtrace stopped: frame did not save the PC
(gdb) bt
#0  0x00007fabc90d0200 in CEnclaveMngr::~CEnclaveMngr() () from /opt/intel/sgxsdk/sdk_libs/libsgx_urts_sim.so
#1  0x00007fabc7f4d615 in __cxa_finalize (d=0x7fabc90e0000) at cxa_finalize.c:83
#2  0x00007fabc90c4863 in __do_global_dtors_aux () from /opt/intel/sgxsdk/sdk_libs/libsgx_urts_sim.so
#3  0x00007fffc9b8f2c0 in ?? ()
#4  0x00007fabc8ed7b73 in _dl_fini () at dl-fini.c:138
Backtrace stopped: frame did not save the PC

So the threads of grpcio continue exiting, even after the loop body is over, even after the test has ended and been declared `test result: okay". I think this is because they are detached threads. To the best of my understanding, in rust if the join handle is dropped without being explicitly joined, then it is detached instead: https://doc.rust-lang.org/std/thread/struct.JoinHandle.html

I have been able to make my test pass 100% of the time even in release mode by inserting sleep(1000ms) after every loop iteration after everything is torn down, to give the threads time to actually terminate before the server comes back up, and before the main thread exits. I think that I will not have to do that if this patch is available in grpcio lib. I can test this theory more rigorously if you like. I assume this is why you also have the join calls in the test in this same file.

cbeck88 · 2020-04-02T05:00:11Z

Is it possible to make environment out of the loop scope?

it might be but it will comingle the server resources across passes through the loop. Also, it doesn't help me ensure that the threads actually exit before the test function exits.

cbeck88 · 2020-04-02T05:03:58Z

This is the version of the loop that I am working with for now, which seems to have fixed things for me:

    let users = ... mock user set

    for phase_count in 0..NUM_PHASES {
        {
        log::info!(logger, "Phase {}/{}", phase_count + 1, NUM_PHASES);

        // First make grpcio env
        let grpcio_env = mobile_acct_api::make_env();

        ... make server, make client,
        ... make requests for each mock user,
        ... validate results
        }
        std::thread::sleep(std::time::Duration::from_millis(1000);
    }

The idea being that once the threads have been shutdown, I don't know that they have actually stopped but I hope they will try to stop soon, so 1000ms is maybe enough. If I can join them then I don't need a sleep I think.

BusyJay · 2020-04-02T05:24:55Z

Thanks for the detail explanation. I can see there are two things can be done:

Guard "SimEnclaveMgr" with reference count, so that it won't be dropped unexpectedly;
Joining the grpcio environment threads. I think we can join them in drop method.

cbeck88 · 2020-04-02T06:01:17Z

I think I cannot guard "SimEnclaveMgr", it is a static-lifetime variable in the intel C library, I think it only gets torn down after exit().

If you are okay to join the threads in the drop method that would be a great fix IMO. Would you like me to change this PR to be like that?

Thank you!

BusyJay · 2020-04-02T06:26:20Z

Joining in drop seems good to me. Although you may need to check if the current thread equal to the target thread to avoid deadlock.

BusyJay · 2020-04-07T03:13:43Z

Please sign off all your commits and fix the CI.

cbeck88 · 2020-04-10T21:07:14Z

hi sorry i got distracted, I will do it

`grpcio::env` impl of `Drop` issues commands to request all the completion queues to shutdown, but does not actually join the threads. For a lot of webservers this works fine, but for some tests, it creates a problem. In my usecase I have a server containing SGX enclaves and a database, and I want to validate that even if the server goes down and comes back repeatedly, the users are able to recover their data from the database. ``` let users = ... mock user set for phase_count in 0..NUM_PHASES { log::info!(logger, "Phase {}/{}", phase_count + 1, NUM_PHASES); // First make grpcio env let grpcio_env = mobile_acct_api::make_env(); ... make server, make client, ... make requests for each mock user, ... validate results } ``` Unfortunately for me, even though `grpcio_env` is scoped to the loop body, the threads actually leak out because the implementation of `Drop` does not join the threads. Unfortunately, this consistently causes crashes in tests because intel sgx sdk contains a `SimEnclaveMgr` object which has a static lifetime and is torn down at process destruction. I believe that with the current API, I cannot guarantee that my grpcio threads are torn down BEFORE that object is. The only way that I can do that is if there is some API on `grpcio::Environment` that actually joins the threads. In the actual rust tests that validate `grpcio::Environment`, you yourselves have written code that joins the join handles. I would like to be able to do that in my tests at the end of my loop body. This commit exposes an API on grpcio::Environment that both issues the shutdown command, AND joins the join handles. It also makes the rust unit test, in that same file, use this API. This is not a breaking change, since we don't change the implementation of `Drop` or any other public api. Signed-off-by: Chris Beck <beck.ct@gmail.com>

Includes a test for whether any of them is the current thread before joining. Signed-off-by: Chris Beck <beck.ct@gmail.com>

cbeck88 · 2020-04-11T04:51:07Z

it seems to fail like this:

     Running target/debug/deps/tests-3d5404393725a7c9
running 24 tests
test cancel_after_begin::test_secure ... test cancel_after_begin::test_secure has been running for over 60 seconds
test cancel_after_begin::test_insecure ... test cancel_after_begin::test_insecure has been running for over 60 seconds

it did this twice in CI, not sure. will investigate

cbeck88 · 2020-04-24T19:22:55Z

@BusyJay i'm sorry, i don't have bandwidth to really figure this out right now, i think i'm just going to stick with the "sleep" in my tests for the forseeable future. thanks for your help!

cbeck88 force-pushed the join_grpcio_env branch from 1645992 to 701db5c Compare April 11, 2020 01:30

cbeck88 force-pushed the join_grpcio_env branch from 701db5c to 452e447 Compare April 11, 2020 01:31

Instead of previous commit, join worker threads in drop

1e473da

Includes a test for whether any of them is the current thread before joining. Signed-off-by: Chris Beck <beck.ct@gmail.com>

cbeck88 force-pushed the join_grpcio_env branch from 452e447 to 1e473da Compare April 11, 2020 04:27

cbeck88 changed the title ~~Expose an API in grpcio::env that shuts-down AND joins the threads~~ grpcio::Env can leak threads -- it detaches them instead of joining them Apr 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

grpcio::Env can leak threads -- it detaches them instead of joining them #455

grpcio::Env can leak threads -- it detaches them instead of joining them #455

cbeck88 commented Apr 1, 2020 •

edited

BusyJay commented Apr 2, 2020

cbeck88 commented Apr 2, 2020 •

edited

cbeck88 commented Apr 2, 2020 •

edited

cbeck88 commented Apr 2, 2020

BusyJay commented Apr 2, 2020

cbeck88 commented Apr 2, 2020

BusyJay commented Apr 2, 2020

BusyJay commented Apr 7, 2020

cbeck88 commented Apr 10, 2020

cbeck88 commented Apr 11, 2020 •

edited

cbeck88 commented Apr 24, 2020

grpcio::Env can leak threads -- it detaches them instead of joining them #455

Are you sure you want to change the base?

grpcio::Env can leak threads -- it detaches them instead of joining them #455

Conversation

cbeck88 commented Apr 1, 2020 • edited

BusyJay commented Apr 2, 2020

cbeck88 commented Apr 2, 2020 • edited

cbeck88 commented Apr 2, 2020 • edited

cbeck88 commented Apr 2, 2020

BusyJay commented Apr 2, 2020

cbeck88 commented Apr 2, 2020

BusyJay commented Apr 2, 2020

BusyJay commented Apr 7, 2020

cbeck88 commented Apr 10, 2020

cbeck88 commented Apr 11, 2020 • edited

cbeck88 commented Apr 24, 2020

cbeck88 commented Apr 1, 2020 •

edited

cbeck88 commented Apr 2, 2020 •

edited

cbeck88 commented Apr 2, 2020 •

edited

cbeck88 commented Apr 11, 2020 •

edited