Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

grpcio::Env can leak threads -- it detaches them instead of joining them #455

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

cbeck88
Copy link

@cbeck88 cbeck88 commented Apr 1, 2020

The grpcio::env impl of Drop requests that all the
completion queues shutdown, but does not actually join the threads.

For many applications this works fine, often a webserver does not require a graceful shutdown strategy.

However, in my usecase I want to validate that even if the server goes down and comes back
repeatedly, the users are able to recover their data from the database.

    let users = ... mock user set

    for phase_count in 0..NUM_PHASES {
        log::info!(logger, "Phase {}/{}", phase_count + 1, NUM_PHASES);

        // First make grpcio env
        let grpcio_env = mobile_acct_api::make_env();

        ... make server, make client,
        ... make requests for each mock user,
        ... validate results
    }

Although grpcio_env is scoped to the loop body, the implementation of
Drop does not join the threads. When the test ends, it crashes consistently,
because my server contains an SGX enclave, and there is a static object in
the intel library SimEnclaveMgr which is torn down
before these threads get cleaned up. Then they try to tear down their enclaves
and SIGSEGV occurs.

I believe that with the current API, I cannot guarantee that
my grpcio threads are torn down before that object is. The only way
that I can do that is if there is some API on grpcio::Environment
that actually joins the threads.

In the grpc-rs rust tests that validate grpcio::Environment, you
yourselves have written code that explicitly joins the join handles,
instead of leaving them detached. I would
like to be able to do that in my tests at the end of my loop body.

I would like to expose this functionality as a new public function.
This commit creates a new function shutdown_and_join, which
issues the shutdown command, and then joins the join handles.
It also makes the rust unit test in grpc-rs use that API.
I would use this at the end of my loop body in my code example.

This is not a breaking change, since we don't change the implementation
of Drop or any other current public api.

@BusyJay
Copy link
Member

BusyJay commented Apr 2, 2020

Environment is usually wrapped in Arc, it will be useless to call the shutdown method.

Then they try to tear down their enclaves and SIGSEGV occurs.

I don't get it. All servers and clients should be shutdown before the environment, so even though the thread is not shutdown, it should not try to touch "enclaves".

Is it possible to make environment out of the loop scope?

@cbeck88
Copy link
Author

cbeck88 commented Apr 2, 2020

The enclave destructor calls a C library. A handle to our enclave object is owned by the grpc service object that we register with the server. So that resource is still preserved until the grpc service thread is collected, as far as I understand.

If shutting down the server does not join the threads, then shutting down the server does not collect the enclaves, right? So those destructors are never called until the threads close of their own accord. The join handles appear to be owned by the grpcio environment so nothing can join them but the environment, and I have no way to make it do that.

Here's one of the shorter kinds of stacktraces I see:

2020-04-01 20:17:11.980327757 UTC INFO API listening on 0.0.0.0:3224, mc.test_name: tx_recovery::test_ingest_mock_db, mc.module: grpc_util, mc.src: public/grpc_util/src/lib.rs:156
2020-04-01 20:17:11.980365297 UTC INFO Block 1/10, mc.test_name: tx_recovery::test_ingest_mock_db, mc.module: tx_recovery, mc.src: src/mobile_acct/ingest_server/tests/tx_recovery.rs:115
2020-04-01 20:17:11.994059707 UTC INFO Block 2/10, mc.test_name: tx_recovery::test_ingest_mock_db, mc.module: tx_recovery, mc.src: src/mobile_acct/ingest_server/tests/tx_recovery.rs:115
2020-04-01 20:17:12.009319252 UTC INFO Block 3/10, mc.test_name: tx_recovery::test_ingest_mock_db, mc.module: tx_recovery, mc.src: src/mobile_acct/ingest_server/tests/tx_recovery.rs:115
2020-04-01 20:17:12.021095644 UTC INFO Block 4/10, mc.test_name: tx_recovery::test_ingest_mock_db, mc.module: tx_recovery, mc.src: src/mobile_acct/ingest_server/tests/tx_recovery.rs:115
2020-04-01 20:17:12.034121656 UTC INFO Block 5/10, mc.test_name: tx_recovery::test_ingest_mock_db, mc.module: tx_recovery, mc.src: src/mobile_acct/ingest_server/tests/tx_recovery.rs:115
2020-04-01 20:17:12.050300344 UTC INFO Block 6/10, mc.test_name: tx_recovery::test_ingest_mock_db, mc.module: tx_recovery, mc.src: src/mobile_acct/ingest_server/tests/tx_recovery.rs:115
2020-04-01 20:17:12.062286141 UTC INFO Block 7/10, mc.test_name: tx_recovery::test_ingest_mock_db, mc.module: tx_recovery, mc.src: src/mobile_acct/ingest_server/tests/tx_recovery.rs:115
2020-04-01 20:17:12.074726050 UTC INFO Block 8/10, mc.test_name: tx_recovery::test_ingest_mock_db, mc.module: tx_recovery, mc.src: src/mobile_acct/ingest_server/tests/tx_recovery.rs:115
2020-04-01 20:17:12.092370612 UTC INFO Block 9/10, mc.test_name: tx_recovery::test_ingest_mock_db, mc.module: tx_recovery, mc.src: src/mobile_acct/ingest_server/tests/tx_recovery.rs:115
2020-04-01 20:17:12.111541067 UTC INFO Block 10/10, mc.test_name: tx_recovery::test_ingest_mock_db, mc.module: tx_recovery, mc.src: src/mobile_acct/ingest_server/tests/tx_recovery.rs:115
[Thread 0x7fabc4f72700 (LWP 952) exited]
[Thread 0x7fabc4d71700 (LWP 949) exited]
[Thread 0x7fabc4b70700 (LWP 947) exited]
[Thread 0x7fabc5173700 (LWP 946) exited]
[Thread 0x7fab4f7e0700 (LWP 945) exited]
[Thread 0x7fabc436c700 (LWP 944) exited]
[Thread 0x7fab4f3de700 (LWP 943) exited]
[Thread 0x7fab8ffff700 (LWP 942) exited]
[Thread 0x7fab4efdc700 (LWP 948) exited]
test test_ingest_mock_db ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out

[Thread 0x7fabc496f700 (LWP 955) exited]
[Thread 0x7fab4f9e1700 (LWP 953) exited]
[Thread 0x7fab4f5df700 (LWP 950) exited]
[Thread 0x7fabc476e700 (LWP 956) exited]
double free or corruption (out)
[Thread 0x7fabc456d700 (LWP 957) exited]
[Thread 0x7fab4f1dd700 (LWP 951) exited]
[Thread 0x7fabc76a0700 (LWP 808) exited]

Thread 1 "tx_recovery-f46" received signal SIGSEGV, Segmentation fault.
0x00007fabc90d0200 in CEnclaveMngr::~CEnclaveMngr() () from /opt/intel/sgxsdk/sdk_libs/libsgx_urts_sim.so
(gdb) bt
#0  0x00007fabc90d0200 in CEnclaveMngr::~CEnclaveMngr() () from /opt/intel/sgxsdk/sdk_libs/libsgx_urts_sim.so
#1  0x00007fabc7f4d615 in __cxa_finalize (d=0x7fabc90e0000) at cxa_finalize.c:83
#2  0x00007fabc90c4863 in __do_global_dtors_aux () from /opt/intel/sgxsdk/sdk_libs/libsgx_urts_sim.so
#3  0x00007fffc9b8f2c0 in ?? ()
#4  0x00007fabc8ed7b73 in _dl_fini () at dl-fini.c:138
Backtrace stopped: frame did not save the PC
(gdb) bt
#0  0x00007fabc90d0200 in CEnclaveMngr::~CEnclaveMngr() () from /opt/intel/sgxsdk/sdk_libs/libsgx_urts_sim.so
#1  0x00007fabc7f4d615 in __cxa_finalize (d=0x7fabc90e0000) at cxa_finalize.c:83
#2  0x00007fabc90c4863 in __do_global_dtors_aux () from /opt/intel/sgxsdk/sdk_libs/libsgx_urts_sim.so
#3  0x00007fffc9b8f2c0 in ?? ()
#4  0x00007fabc8ed7b73 in _dl_fini () at dl-fini.c:138
Backtrace stopped: frame did not save the PC

So the threads of grpcio continue exiting, even after the loop body is over, even after the test has ended and been declared `test result: okay". I think this is because they are detached threads. To the best of my understanding, in rust if the join handle is dropped without being explicitly joined, then it is detached instead: https://doc.rust-lang.org/std/thread/struct.JoinHandle.html

I have been able to make my test pass 100% of the time even in release mode by inserting sleep(1000ms) after every loop iteration after everything is torn down, to give the threads time to actually terminate before the server comes back up, and before the main thread exits. I think that I will not have to do that if this patch is available in grpcio lib. I can test this theory more rigorously if you like. I assume this is why you also have the join calls in the test in this same file.

@cbeck88
Copy link
Author

cbeck88 commented Apr 2, 2020

Is it possible to make environment out of the loop scope?

it might be but it will comingle the server resources across passes through the loop. Also, it doesn't help me ensure that the threads actually exit before the test function exits.

@cbeck88
Copy link
Author

cbeck88 commented Apr 2, 2020

This is the version of the loop that I am working with for now, which seems to have fixed things for me:

    let users = ... mock user set

    for phase_count in 0..NUM_PHASES {
        {
        log::info!(logger, "Phase {}/{}", phase_count + 1, NUM_PHASES);

        // First make grpcio env
        let grpcio_env = mobile_acct_api::make_env();

        ... make server, make client,
        ... make requests for each mock user,
        ... validate results
        }
        std::thread::sleep(std::time::Duration::from_millis(1000);
    }

The idea being that once the threads have been shutdown, I don't know that they have actually stopped but I hope they will try to stop soon, so 1000ms is maybe enough. If I can join them then I don't need a sleep I think.

@BusyJay
Copy link
Member

BusyJay commented Apr 2, 2020

Thanks for the detail explanation. I can see there are two things can be done:

  1. Guard "SimEnclaveMgr" with reference count, so that it won't be dropped unexpectedly;
  2. Joining the grpcio environment threads. I think we can join them in drop method.

@cbeck88
Copy link
Author

cbeck88 commented Apr 2, 2020

I think I cannot guard "SimEnclaveMgr", it is a static-lifetime variable in the intel C library, I think it only gets torn down after exit().

If you are okay to join the threads in the drop method that would be a great fix IMO. Would you like me to change this PR to be like that?

Thank you!

@BusyJay
Copy link
Member

BusyJay commented Apr 2, 2020

Joining in drop seems good to me. Although you may need to check if the current thread equal to the target thread to avoid deadlock.

@BusyJay
Copy link
Member

BusyJay commented Apr 7, 2020

Please sign off all your commits and fix the CI.

@cbeck88
Copy link
Author

cbeck88 commented Apr 10, 2020

hi sorry i got distracted, I will do it

`grpcio::env` impl of `Drop` issues commands to request all the
completion queues to shutdown, but does not actually join the threads.

For a lot of webservers this works fine, but for some tests, it
creates a problem.

In my usecase I have a server containing SGX enclaves and a database,
and I want to validate that even if the server goes down and comes back
repeatedly, the users are able to recover their data from the database.

```
    let users = ... mock user set

    for phase_count in 0..NUM_PHASES {
        log::info!(logger, "Phase {}/{}", phase_count + 1, NUM_PHASES);

        // First make grpcio env
        let grpcio_env = mobile_acct_api::make_env();

        ... make server, make client,
        ... make requests for each mock user,
        ... validate results
    }
```

Unfortunately for me, even though `grpcio_env` is scoped to the loop
body, the threads actually leak out because the implementation of
`Drop` does not join the threads.

Unfortunately, this consistently causes crashes in tests because intel
sgx sdk contains a `SimEnclaveMgr` object which has a static lifetime
and is torn down at process destruction.

I believe that with the current API, I cannot guarantee that
my grpcio threads are torn down BEFORE that object is. The only way
that I can do that is if there is some API on `grpcio::Environment`
that actually joins the threads.

In the actual rust tests that validate `grpcio::Environment`, you
yourselves have written code that joins the join handles. I would
like to be able to do that in my tests at the end of my loop body.

This commit exposes an API on grpcio::Environment that both issues
the shutdown command, AND joins the join handles. It also makes
the rust unit test, in that same file, use this API.

This is not a breaking change, since we don't change the implementation
of `Drop` or any other public api.

Signed-off-by: Chris Beck <beck.ct@gmail.com>
Includes a test for whether any of them is the current thread
before joining.

Signed-off-by: Chris Beck <beck.ct@gmail.com>
@cbeck88
Copy link
Author

cbeck88 commented Apr 11, 2020

it seems to fail like this:

     Running target/debug/deps/tests-3d5404393725a7c9
running 24 tests
test cancel_after_begin::test_secure ... test cancel_after_begin::test_secure has been running for over 60 seconds
test cancel_after_begin::test_insecure ... test cancel_after_begin::test_insecure has been running for over 60 seconds

it did this twice in CI, not sure. will investigate

@cbeck88
Copy link
Author

cbeck88 commented Apr 24, 2020

@BusyJay i'm sorry, i don't have bandwidth to really figure this out right now, i think i'm just going to stick with the "sleep" in my tests for the forseeable future. thanks for your help!

@cbeck88 cbeck88 changed the title Expose an API in grpcio::env that shuts-down AND joins the threads grpcio::Env can leak threads -- it detaches them instead of joining them Apr 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants