Panicked at 'missing cancelation' in tower-ready-cache #415

jonhoo · 2020-02-13T16:10:40Z

I just had my code hit this .expect in tower-ready-cache. It's using tower-balance::pool (and thus also tower-balance::p2c). Not sure what caused it, but figured maybe @olix0r can tease it out from the backtrace?

thread 'tokio-runtime-worker' panicked at 'missing cancelation', /home/ubuntu/.cargo/registry/src/github.com-1ecc6299db9ec823/tower-ready-cache-0.3.0/src/cache.rs:236:37
stack backtrace:
   0: backtrace::backtrace::libunwind::trace
             at /cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.40/src/backtrace/libunwind.rs:88
   1: backtrace::backtrace::trace_unsynchronized
             at /cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.40/src/backtrace/mod.rs:66
   2: std::sys_common::backtrace::_print_fmt
             at src/libstd/sys_common/backtrace.rs:77
   3: <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt
             at src/libstd/sys_common/backtrace.rs:59
   4: core::fmt::write
             at src/libcore/fmt/mod.rs:1052
   5: std::io::Write::write_fmt
             at src/libstd/io/mod.rs:1428
   6: std::sys_common::backtrace::_print
             at src/libstd/sys_common/backtrace.rs:62
   7: std::sys_common::backtrace::print
             at src/libstd/sys_common/backtrace.rs:49
   8: std::panicking::default_hook::{{closure}}
             at src/libstd/panicking.rs:204
   9: std::panicking::default_hook
             at src/libstd/panicking.rs:224
  10: std::panicking::rust_panic_with_hook
             at src/libstd/panicking.rs:470
  11: rust_begin_unwind
             at src/libstd/panicking.rs:378
  12: core::panicking::panic_fmt
             at src/libcore/panicking.rs:85
  13: core::option::expect_failed
             at src/libcore/option.rs:1203
  14: core::option::Option<T>::expect
             at /rustc/fc23a81831d5b41510d3261c20c34dd8d32f0f31/src/libcore/option.rs:347
  15: tower_ready_cache::cache::ReadyCache<K,S,Req>::poll_pending
             at ./.cargo/registry/src/github.com-1ecc6299db9ec823/tower-ready-cache-0.3.0/src/cache.rs:236
  16: tower_balance::p2c::service::Balance<D,Req>::promote_pending_to_ready
             at ./.cargo/registry/src/github.com-1ecc6299db9ec823/tower-balance-0.3.0/src/p2c/service.rs:151
  17: <tower_balance::p2c::service::Balance<D,Req> as tower_service::Service<Req>>::poll_ready
             at ./.cargo/registry/src/github.com-1ecc6299db9ec823/tower-balance-0.3.0/src/p2c/service.rs:238
  18: <tower_balance::pool::Pool<MS,Target,Req> as tower_service::Service<Req>>::poll_ready
             at ./.cargo/registry/src/github.com-1ecc6299db9ec823/tower-balance-0.3.0/src/pool/mod.rs:373
  19: <tower_buffer::worker::Worker<T,Request> as core::future::future::Future>::poll
             at ./.cargo/registry/src/github.com-1ecc6299db9ec823/tower-buffer-0.3.0/src/worker.rs:169
  20: tokio::task::core::Core<T>::poll
             at ./.cargo/git/checkouts/tokio-377c595163f99a10/3fb217f/tokio/src/task/core.rs:128
  21: tokio::task::harness::Harness<T,S>::poll::{{closure}}::{{closure}}
             at ./.cargo/git/checkouts/tokio-377c595163f99a10/3fb217f/tokio/src/task/harness.rs:119
  22: core::ops::function::FnOnce::call_once
             at /rustc/fc23a81831d5b41510d3261c20c34dd8d32f0f31/src/libcore/ops/function.rs:232
  23: <std::panic::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
             at /rustc/fc23a81831d5b41510d3261c20c34dd8d32f0f31/src/libstd/panic.rs:318
  24: std::panicking::try::do_call
             at /rustc/fc23a81831d5b41510d3261c20c34dd8d32f0f31/src/libstd/panicking.rs:303
  25: __rust_maybe_catch_panic
             at src/libpanic_unwind/lib.rs:86
  26: std::panicking::try
             at /rustc/fc23a81831d5b41510d3261c20c34dd8d32f0f31/src/libstd/panicking.rs:281
  27: std::panic::catch_unwind
             at /rustc/fc23a81831d5b41510d3261c20c34dd8d32f0f31/src/libstd/panic.rs:394
  28: tokio::task::harness::Harness<T,S>::poll::{{closure}}
             at ./.cargo/git/checkouts/tokio-377c595163f99a10/3fb217f/tokio/src/task/harness.rs:100
  29: tokio::loom::std::causal_cell::CausalCell<T>::with_mut
             at ./.cargo/git/checkouts/tokio-377c595163f99a10/3fb217f/tokio/src/loom/std/causal_cell.rs:41
  30: tokio::task::harness::Harness<T,S>::poll
             at ./.cargo/git/checkouts/tokio-377c595163f99a10/3fb217f/tokio/src/task/harness.rs:99
  31: tokio::task::raw::RawTask::poll
             at ./.cargo/git/checkouts/tokio-377c595163f99a10/3fb217f/tokio/src/task/raw.rs:113
  32: tokio::task::Task<S>::run
             at ./.cargo/git/checkouts/tokio-377c595163f99a10/3fb217f/tokio/src/task/mod.rs:381
  33: tokio::runtime::thread_pool::worker::GenerationGuard::run_task
             at ./.cargo/git/checkouts/tokio-377c595163f99a10/3fb217f/tokio/src/runtime/thread_pool/worker.rs:459
  34: tokio::runtime::thread_pool::worker::GenerationGuard::process_available_work
             at ./.cargo/git/checkouts/tokio-377c595163f99a10/3fb217f/tokio/src/runtime/thread_pool/worker.rs:317
  35: tokio::runtime::thread_pool::worker::GenerationGuard::run
             at ./.cargo/git/checkouts/tokio-377c595163f99a10/3fb217f/tokio/src/runtime/thread_pool/worker.rs:282
  36: tokio::runtime::thread_pool::worker::Worker::run::{{closure}}::{{closure}}
             at ./.cargo/git/checkouts/tokio-377c595163f99a10/3fb217f/tokio/src/runtime/thread_pool/worker.rs:160
  37: std::thread::local::LocalKey<T>::try_with
             at /rustc/fc23a81831d5b41510d3261c20c34dd8d32f0f31/src/libstd/thread/local.rs:262
  38: std::thread::local::LocalKey<T>::with
             at /rustc/fc23a81831d5b41510d3261c20c34dd8d32f0f31/src/libstd/thread/local.rs:239
  39: tokio::runtime::thread_pool::worker::Worker::run::{{closure}}
             at ./.cargo/git/checkouts/tokio-377c595163f99a10/3fb217f/tokio/src/runtime/thread_pool/worker.rs:136
  40: tokio::runtime::thread_pool::current::set::{{closure}}
             at ./.cargo/git/checkouts/tokio-377c595163f99a10/3fb217f/tokio/src/runtime/thread_pool/current.rs:47
  41: std::thread::local::LocalKey<T>::try_with
             at /rustc/fc23a81831d5b41510d3261c20c34dd8d32f0f31/src/libstd/thread/local.rs:262
  42: std::thread::local::LocalKey<T>::with
             at /rustc/fc23a81831d5b41510d3261c20c34dd8d32f0f31/src/libstd/thread/local.rs:239
  43: tokio::runtime::thread_pool::current::set
             at ./.cargo/git/checkouts/tokio-377c595163f99a10/3fb217f/tokio/src/runtime/thread_pool/current.rs:29
  44: tokio::runtime::thread_pool::worker::Worker::run
             at ./.cargo/git/checkouts/tokio-377c595163f99a10/3fb217f/tokio/src/runtime/thread_pool/worker.rs:132
  45: tokio::runtime::thread_pool::Workers::spawn::{{closure}}::{{closure}}
             at ./.cargo/git/checkouts/tokio-377c595163f99a10/3fb217f/tokio/src/runtime/thread_pool/mod.rs:113
  46: <tokio::runtime::blocking::task::BlockingTask<T> as core::future::future::Future>::poll
             at ./.cargo/git/checkouts/tokio-377c595163f99a10/3fb217f/tokio/src/runtime/blocking/task.rs:30
  47: tokio::task::core::Core<T>::poll
             at ./.cargo/git/checkouts/tokio-377c595163f99a10/3fb217f/tokio/src/task/core.rs:128
  48: tokio::task::harness::Harness<T,S>::poll::{{closure}}::{{closure}}
             at ./.cargo/git/checkouts/tokio-377c595163f99a10/3fb217f/tokio/src/task/harness.rs:119
  49: core::ops::function::FnOnce::call_once
             at /rustc/fc23a81831d5b41510d3261c20c34dd8d32f0f31/src/libcore/ops/function.rs:232
  50: <std::panic::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
             at /rustc/fc23a81831d5b41510d3261c20c34dd8d32f0f31/src/libstd/panic.rs:318
  51: std::panicking::try::do_call
             at /rustc/fc23a81831d5b41510d3261c20c34dd8d32f0f31/src/libstd/panicking.rs:303
  52: __rust_maybe_catch_panic
             at src/libpanic_unwind/lib.rs:86
  53: std::panicking::try
             at /rustc/fc23a81831d5b41510d3261c20c34dd8d32f0f31/src/libstd/panicking.rs:281
  54: std::panic::catch_unwind
             at /rustc/fc23a81831d5b41510d3261c20c34dd8d32f0f31/src/libstd/panic.rs:394
  55: tokio::task::harness::Harness<T,S>::poll::{{closure}}
             at ./.cargo/git/checkouts/tokio-377c595163f99a10/3fb217f/tokio/src/task/harness.rs:100
  56: tokio::loom::std::causal_cell::CausalCell<T>::with_mut
             at ./.cargo/git/checkouts/tokio-377c595163f99a10/3fb217f/tokio/src/loom/std/causal_cell.rs:41
  57: tokio::task::harness::Harness<T,S>::poll
             at ./.cargo/git/checkouts/tokio-377c595163f99a10/3fb217f/tokio/src/task/harness.rs:99
  58: tokio::task::raw::RawTask::poll
             at ./.cargo/git/checkouts/tokio-377c595163f99a10/3fb217f/tokio/src/task/raw.rs:113
  59: tokio::task::Task<S>::run
             at ./.cargo/git/checkouts/tokio-377c595163f99a10/3fb217f/tokio/src/task/mod.rs:381
  60: tokio::runtime::blocking::pool::run_task
             at ./.cargo/git/checkouts/tokio-377c595163f99a10/3fb217f/tokio/src/runtime/blocking/pool.rs:311
  61: tokio::runtime::blocking::pool::Inner::run
             at ./.cargo/git/checkouts/tokio-377c595163f99a10/3fb217f/tokio/src/runtime/blocking/pool.rs:230
  62: tokio::runtime::blocking::pool::Spawner::spawn_thread::{{closure}}::{{closure}}
             at ./.cargo/git/checkouts/tokio-377c595163f99a10/3fb217f/tokio/src/runtime/blocking/pool.rs:210
  63: tokio::runtime::context::enter
             at ./.cargo/git/checkouts/tokio-377c595163f99a10/3fb217f/tokio/src/runtime/context.rs:72
  64: tokio::runtime::handle::Handle::enter
             at ./.cargo/git/checkouts/tokio-377c595163f99a10/3fb217f/tokio/src/runtime/handle.rs:34
  65: tokio::runtime::blocking::pool::Spawner::spawn_thread::{{closure}}
             at ./.cargo/git/checkouts/tokio-377c595163f99a10/3fb217f/tokio/src/runtime/blocking/pool.rs:209

The text was updated successfully, but these errors were encountered:

jonhoo · 2020-02-13T19:19:01Z

I can replicate this pretty reliably at high load in noria. The backtrace seems to always be the same.

jonhoo · 2020-02-13T19:32:42Z

@olix0r is there any particular state you'd like me to print out at the time of the panic that might aid in debugging this? I can only replicate the issue at fairly high load, so full logs will be tricky (and probably not very useful).

jonhoo · 2020-02-13T19:40:33Z

As an aside, I cannot replicate this easily on a local machine, even when it is heavily loaded, but it happens on like 1/3 or runs on EC2, which suggests that it is indeed a race condition triggered by scheduling noise (which EC2 has a lot of).

jonhoo · 2020-02-17T19:58:17Z

I'm quite stuck on this... @olix0r any suggestions for how to go about debugging this? Any idea what the root cause might be?

The hope for this was to reproduce #415 (which it does not sadly), but at least it adds a test for p2c!

jonhoo · 2020-02-20T20:31:10Z

I've been stepping through the code, and cannot understand how the panic can possibly be triggered.. For that to happen, we need to have an entry in pending whose key is not in pending_cancel_txs. The only way for an entry to be added to pending is push_pending. Since push_pending ensures that the pushed service's key is in pending_cancel_txs, we know that this can only happen if a key is removed from pending_cancel_txs. And specifically, we need a key k to be removed from pending_cancel_txs while there is a service with key k in pending.

A key k is removed from pending_cancel_txs in one of three ways:

A service with key k becomes ready.
A service with key k fails.
evict(k) is called.

Let's look at each of these in turn.

Case 1: A service with key k becomes ready. This removes k from pending_cancel_txs. For the panic to trigger, we need another service to be in pending with the same key k. There are two possible options:

The remaining service was pushed before the ready service. In that case, when the ready service was pushed to pending, it must have found k in pending_cancel_txs and canceled it (by push_pending). The service canceled by that was either the remaining service, or some other service. If it was some other service (also with key k), that service must have canceled yet another service with key k when it was pushed. Ultimately, the remaining service must have been canceled. That means that when we poll the remaining service, we will hit the Canceled case, which does not encounter the panic.
The ready service was pushed before the remaining service. This cannot happen, since when the remaining service was pushed, it must have cancelled an existing service with key k. That must (ultimately, as in 1.) have cancelled the service that just became ready. But it just became ready, so it wasn't cancelled. So we can't be in this case.

Case 2: A service with key k fails. Argument is the same as case 1.

Case 3: If a service with key k is in pending_cancel_txs, it will be cancelled. Therefore, a service with key k in pending will end up in the Canceled case, which does not trigger the panic. There cannot be any other services with key k in pending that aren't already going to return Canceled, because of the cancel in push_pending.

So, none of these cases allow the panic to trigger. Yet it does. So, I must be missing something. I just don't know what. My one thought was that things would go wrong if polling a Pending immediately after canceling it might not see the cancellation, but I don't think that's possible either, even with a work-stealing scheduler. And so I am stuck.

jonhoo · 2020-02-20T20:36:20Z

To state ^ a bit more concisely, there doesn't seem to be a way to have multiple services in pending with the same key k where more than one of them will return something other than Canceled. And there doesn't seem to be a way to introduce a service into pending that doesn't have an entry in pending_cancel_txs. From my reading of the panic, one of those must be occurring, I just cannot come up with an execution in which that'd be the case.

The hope for this was to reproduce #415 (which it does not sadly), but at least it adds a test for p2c!

olix0r · 2020-02-20T21:21:40Z

This is, indeed, a very weird error. I have to admit ignorance to the newer std::future APIs. I would personally want to look at the translation from futures 0.1 to see if there's anything that changed semantically. Or perhaps there's a notification race that we haven't seen before. I'll refresh myself and see if I can come up with a better theory.

jonhoo · 2020-02-20T21:31:23Z

The translation seems correct to me -- it doesn't really change the logic anywhere, and the move to pin is pretty much irrelevant for ReadyCache. The only difference I can see is that the old code had (in Pending::poll):

tower/tower-ready-cache/src/cache.rs

Lines 354 to 355 in 0d2a377

    
           .expect("cancel sender lost") 
        
           .is_ready()

whereas the new one has

tower/tower-ready-cache/src/cache.rs

Line 366 in e2f1a49

if let Poll::Ready(Ok(_)) = Pin::new(&mut fut).poll(cx) {

So if the oneshot sender is dropped, the old code would panic, whereas the new code would continue. Given that I wasn't seeing panics before though, I doubt this change is the culprit (although it probably should be changed to panic again).

It could be that the 0.2 scheduler has more weird interleavings or notification patterns, though given that all the methods on ReadyCache take &mut self, the only race I could really see is with the order of things returned by FuturesUnordered::poll_next. And I don't think a different order there could make this particular panic happen (if the swap_remove happens twice for a key, there just needs to be two elements with the same key; their order doesn't matter).

olix0r · 2020-02-20T21:50:34Z

@jonhoo Your descriptions above match my understanding exactly. Thinking outside the box a bit... there are a few assumptions we could violate to produce this result, however:

If (somehow) the cancelation oneshot failed to signal on the reader, we could incorrectly return successfully more than once.
If (somehow) the futuresunordered eagerly polled and buffered results, we could satisfy pending futures successfully before, i.e. an evict was called and then poll_pending processed the result.

I've looked at the futuresunordered impl and it doesn't seem to do any sort of buffering of results, so that seems unlikely. Could it be possible that tokio::sync::oneshot does not reliably notify receivers under load?

jonhoo · 2020-02-20T21:56:36Z

Yeah, I have looked at the FuturesUnordered impl myself before, and also do not believe it does any buffering. The receiver notification is the one I'm also currently suspicious of, though it would be crazy if a send on a oneshot would not be seen by a receiver polled by the same future. There's currently some discussion about this on Discord. Alice suggested:

Initially key K contains a task A with associated cancel tx Atx in the map.
A new task B with same key and cancel tx Btx is added.

self.pending_cancel_txs.insert inserts Btx and takes out Atx

self.pending.push(B) happens

self.poll_pending happens and decides to poll task A

Task A says it's done, and Btx is removed along with it.

c.send(()) is called on Atx

self.poll_pending happens again, this time B is polled and says it's done, but there's no cancel token in the map.

The un-reordered version of my execution is that the poll_pending in 3. happened entirely before the push_pending
But 1. and 2. got reordered backwards into the preceding poll_pending

Though that execution seems insane. It would mean that writing to a field through &mut self before yielding, and then reading that same field after yielding, could have the load not see the store.

jonhoo · 2020-02-20T22:02:45Z

As an aside, this loom test passes, which gives me some confidence that operating on &mut self should have a happens-before:

#[test]
fn future_poll_happens_before() {
    use crate::sync::oneshot;

    loom::model(|| {
        let mut pool = mk_pool(2);

        let (tx, rx) = oneshot::channel();

        let jh = pool.spawn(async move {
            let mut x = 0;
            x = 1;
            rx.await.unwrap();
            assert_eq!(x, 1);
        });
        tx.send(()).unwrap();
        pool.block_on(jh).unwrap();
    });
}

Also, I filed #418 to restore the expect.

jonhoo · 2020-02-20T22:30:19Z

Given that I can reproduce the error on EC2 under load, are there any assertions or other code snippets that would be useful if I tried to inject and then run where it fails?

olix0r · 2020-02-20T22:38:20Z

Also, I don't love this option, but these expects are really intended to be more like debug_assert! -- there's no hard reason why we can't continue in this state (just create a new oneshot...). So we might consider being resilient in release code and failing with a debug_assert! instead?

But, we are somehow violating an invariant, so it would be good to get to the bottom of this. If I were debugging this, I'd probably add a K: fmt::Debug (if it's not otherwise available via tracing) and using trace! logging to track the exact state of things. It would also likely be useful to log when cancelling and when polling the cancellation in Pending, even if you can't easily wire through the key information.

olix0r · 2020-02-20T22:45:45Z

Is it possible that we are somehow mismatching cancelation pairs? I.e. that the cancel obtained from pending_cancel_txs does not notify the cancel_rx?

tower/tower-ready-cache/src/cache.rs

Lines 234 to 239 in 47c3a14

    
           Poll::Ready(Some(Ok((key, svc, cancel_rx)))) => { 
        
               trace!("endpoint ready"); 
        
               let cancel_tx = self 
        
                   .pending_cancel_txs 
        
                   .swap_remove(&key) 
        
                   .expect("missing cancelation");

olix0r · 2020-02-20T22:51:25Z

Is it possible that we are somehow mismatching cancelation pairs?

I can't see how it would be possible... but it would be helpful to see the ordering of cancelation and poll_pending results immediately before a panic

jonhoo · 2020-02-20T22:55:06Z

Yeah, I also don't see how that could happen, because the service that is in pending and that won't already return Canceled must (I think) be the one whose tx is in pending_cancel_txs.

these expects are really intended to be more like debug_assert!

I don't think those expects should ever fail though, that's why I'm inclined to keep them as such. If they fail, it's because we are not upholding some invariant we were expecting to hold (like seems to be the case in this instance).

If I were debugging this

If you have a patch in mind I'd be happy to run with it applied. It might generate some pretty large logs though, especially because there are multiple ReadyCache's in use at the same time. We may need to also include something like "{:p}", self in the output.

olix0r · 2020-02-20T23:18:21Z

Here's a quick attempt to fill in some gaps: https://github.com/olix0r/tower/pull/new/ver/ready-cache-debug

Note that, if you are using tracing in your application, you can instrument the task with a debug_span!("mycache") and then set RUST_LOG=[mycache]=trace -- that should limit logs to only matching spans...

jonhoo · 2020-02-20T23:24:33Z

I'll run with that once I get into the office tomorrow, thanks!

It's, err, tricky to set spans appropriately because the ReadyCache is relatively deep in the stick behind both Pool and Buffer. There are also multiple different stacks like that, and I do not know ahead of time which of them the panic occurs in, so I may just have to instrument all of them. I'll see what I can come up with.

jonhoo · 2020-02-20T23:44:24Z

Interestingly, I found that I was able to reproduce the issue locally too if I just load the CPU enough (stress -c #CPUs-1 for example). With that, you may even be able to get one of the tests to fail directly, though I haven't had a chance to try that yet.

jonhoo · 2020-02-21T15:35:00Z

Logs

I modified your log statements slightly to also include the failing key.

jonhoo · 2020-02-21T16:12:27Z

Here's the log for just the failing key (key 1): https://jon.tsp.io/share/tower-ready-cache-panic-key.log

jonhoo · 2020-02-21T16:14:04Z

I'm seeing a lot of

cache: poll_pending: endpoint ready key=4
cache: call_ready_index: service has been replaced in pending key=4

Which seems surprising.

jonhoo · 2020-02-21T16:24:42Z

Ah, I think that one trace line is just placed incorrectly (it should be in an else block) -- will fix.

jonhoo · 2020-02-21T16:32:21Z

Fixed (and updated the links above).

jonhoo · 2020-02-21T17:02:53Z

Here are the last few events for the failing key:

  50.943531372s TRACE cache: poll_pending: endpoint ready key=2
  50.949534210s TRACE cache: poll_pending: endpoint ready key=2
  50.951870970s TRACE cache: poll_pending: endpoint ready key=2
  50.954457519s TRACE cache: poll_pending: endpoint ready key=2
  50.956442849s TRACE cache: poll_pending: endpoint ready key=2
  50.969499021s TRACE cache: poll_pending: endpoint ready key=2
  50.975538632s TRACE cache: poll_pending: endpoint ready key=2
  50.976581804s TRACE cache: poll_pending: endpoint ready key=2
  50.981410276s TRACE cache: poll_pending: endpoint ready key=2
  50.983052593s TRACE cache: poll_pending: endpoint ready key=2
  50.984634312s TRACE cache: poll_pending: endpoint ready key=2
  50.985693535s TRACE cache: poll_pending: endpoint ready key=2
  50.996814748s TRACE cache: poll_pending: endpoint ready key=2
  50.999420057s TRACE cache: poll_pending: endpoint ready key=2
  51.001615313s TRACE cache: evict: replacing service in pending key=2
  51.001726570s TRACE cache: poll_pending: endpoint ready key=2
  51.001734877s ERROR cache: cancel_tx.is_none key=2

Where replacing service in pending is the if let Some case in evict().

All the other evictions look like we'd expect:

  50.419531385s TRACE cache: poll_pending: endpoint ready key=2
  50.432676441s TRACE cache: poll_pending: endpoint ready key=2
  50.433185867s TRACE cache: poll_pending: endpoint ready key=2
  50.437287077s TRACE cache: evict: replacing service in pending key=2
  50.437295539s DEBUG cache: poll_pending: endpoint canceled key=2
  50.467646284s TRACE cache: push: inserting a new service key=2
  50.468975639s TRACE cache: poll_pending: endpoint ready key=2
  50.473768557s TRACE cache: poll_pending: endpoint ready key=2

jonhoo · 2020-02-21T19:11:27Z

As a sanity check, I also tried storing a copy of the service key inside CancelTx, and then I assert_eq! with the key we remove any time we get a CancelTx out of pending_cancel_txs, and those assertions never fail.

I also had a careful read through the code for oneshot, and it does have a synchronizing relationship between send and recv by virtue of an Acquire store and an AcqRel load on the inner State. So I don't think it should be possible for the recv to not see the send.

This in turn lead me to wonder whether IndexMap::insert might sometimes return the CancelTx we insert rather the one that was previously in the map. But even if that were the case, it probably wouldn't make any difference, since there would still only be one non-canceled service in pending, and that service's oneshot::Sender would still be in pending_cancel_txs. I also don't see any straight-up replacements for a key in the logs, only evicts followed by later pushes, so it's probably not even worth looking into.

jonhoo · 2020-02-21T21:40:55Z

And here's another: even polling the rx a second time when a service returns ready does not fix the issue (it still returns Poll::Pending).

jonhoo · 2020-02-21T21:57:49Z

Another data-point: adding a sync::atomic::fence(SeqCst) before oneshot::Receiver::poll in Pending::poll does not fix the issue.

jonhoo · 2020-02-21T22:52:28Z

And here's another: if I replace oneshot with mpsc, the panic still occurs, so this is not a bug in oneshot (unless the same bug somehow exists in mpsc).

It's been observed that occasionally tower-ready-cache would panic trying to find an already canceled service in `cancel_pending_txs` (#415). The source of the race is not entirely clear, but extensive debugging demonstrated that occasionally a call to `evict` would send on the `CancelTx` for a service, yet that service would be yielded back from `pending` in `poll_pending` in a non-`Canceled` state. This is equivalent to saying that this code may panic: ```rust async { let (tx, rx) = oneshot::channel(); tx.send(42).unwrap(); yield_once().await; rx.try_recv().unwrap(); // <- may occasionally panic } ``` I have not been able to demonstrate a self-contained example failing in this way, but it's the only explanation I have found for the observed bug. Pinning the entire runtime to one core still produced the bug, which indicates that it is not a memory ordering issue. Replacing oneshot with `mpsc::channel(1)` still produced the bug, which indicates that the bug is not with the implementation of `oneshot`. Logs also indicate that the `ChannelTx` we send on in `evict()` truly is the same one associated with the `ChannelRx` polled in `Pending::poll`, so we're not getting our wires crossed somewhere. It truly is bizarre. This patch resolves the issue by considering a failure to find a ready/errored service's `CancelTx` as another signal that a service has been removed. Specifically, if `poll_pending` finds a service that returns `Ok` or `Err`, but does _not_ find its `CancelTx`, then it assumes that it must be because the service _was_ canceled, but did not observe that cancellation signal. As an explanation, this isn't entirely satisfactory, since we do not fully understand the underlying problem. It _may_ be that a canceled service could remain in the pending state for a very long time if it does not become ready _and_ does not see the cancellation signal (so it returns `Poll::Pending` and is not removed). That, in turn, might cause an issue if the driver of the `ReadyCache` then chooses to re-use a key they believe they have evicted. However, any such case _must_ first hit the panic that exists in the code today, so this is still an improvement over the status quo. Fixes #415.

linkerd/linkerd2#6086 describes an issue that sounds closely related to tower-rs#415: There's some sort of consistency issue between the ready-cache's pending stream and its set of cancelations. Where the latter issues describes triggering a panic in the stream receiver, the former describes triggering a panic in the stream implementation. There's no logical reason why we can't continue to operate in this scenario, though it does indicate a real correctness issue. So, this change prevents panicking in this scenario when not building with debugging. Instead, we now emit WARN-level logs so that we have a clearer signal they're occurring. Finally, this change also adds `Debug` constraints to the cache's key types (and hence the balancer's key types) so that we can more reasonably debug this behavior.

Signed-off-by: Eliza Weisman <eliza@buoyant.io>

`tokio::task` enforces a cooperative scheduling regime that can cause `oneshot::Receiver::poll` to return pending after the sender has sent an update. `ReadyCache` uses a oneshot to notify pending services that they should not become ready. When a cancelation is not observed, the ready cache return service instances that should have been canceled, which breaks assumptions and causes an invalid state. Fixes #415 Co-authored-by: Eliza Weisman <eliza@buoyant.io>

`tokio::task` enforces a cooperative scheduling regime that can cause `oneshot::Receiver::poll` to return pending after the sender has sent an update. `ReadyCache` uses a oneshot to notify pending services that they should not become ready. When a cancelation is not observed, the ready cache return service instances that should have been canceled, which breaks assumptions and causes an invalid state. This branch replaces the use of `tokio::sync::oneshot` for canceling pending futures with a custom cancelation handle using an `AtomicBool` and `futures::task::AtomicWaker`. This ensures that canceled `Pending` services are always woken even when the task's budget is exceeded. Additionally, cancelation status is now always known to the `Pending` future, by checking the `AtomicBool` immediately on polls, even in cases where the canceled `Pending` future was woken by the inner `Service` becoming ready, rather than by the cancelation. Fixes #415 Signed-off-by: Oliver Gould <ver@buoyant.io> Co-authored-by: Eliza Weisman <eliza@buoyant.io>

# 0.4.13 (June 17, 2022) ### Added - **load_shed**: Public constructor for `Overloaded` error ([#661]) ### Fixed - **util**: Fix hang with `call_all` when the `Stream` of requests is pending ([#656]) - **ready_cache**: Ensure cancelation is observed by pending services ([#668], fixes [#415]) - **docs**: Fix a missing section header due to a typo ([#646]) - **docs**: Fix broken links to `Service` trait ([#659]) [#661]: #661 [#656]: #656 [#668]: #668 [#415]: #415 [#646]: #646 [#659]: #659

Tower [v0.4.13] includes a fix for a bug in the `tower::ready_cache` module, tower-rs/tower#415. The `ready_cache` module is used internally in Tower's load balancer. This bug resulted in panics in the proxy (linkerd/linkerd2#8666, linkerd/linkerd2#6086) in cases where the Destination service sends a very large number of service discovery updates (see linkerd/linkerd2#8677). This commit updates the proxy's dependency on `tower` to 0.4.13, to ensure that this bugfix is picked up. Fixes linkerd/linkerd2#8666 Fixes linkerd/linkerd2#6086 [v0.4.13]: https://github.com/tower-rs/tower/releases/tag/tower-0.4.13

jonhoo assigned olix0r Feb 13, 2020

jonhoo added a commit that referenced this issue Feb 20, 2020

balance: Add a stress test for p2c

d48510d

The hope for this was to reproduce #415 (which it does not sadly), but at least it adds a test for p2c!

jonhoo mentioned this issue Feb 20, 2020

balance: Add a stress test for p2c #416

Merged

jonhoo added a commit that referenced this issue Feb 20, 2020

balance: Add a stress test for p2c

650e5be

The hope for this was to reproduce #415 (which it does not sadly), but at least it adds a test for p2c!

jonhoo mentioned this issue Feb 24, 2020

ready-cache: Avoid panic on strange race #420

Merged

jonhoo closed this as completed in #420 Feb 24, 2020

LucioFranco mentioned this issue Feb 26, 2020

perf: Tokio compat vectordotdev/vector#1922

Merged

7 tasks

olix0r mentioned this issue Jun 16, 2021

Outbound requests fail intermittently after the proxy reported "panicked at 'cancel sender lost'" linkerd/linkerd2#6086

Closed

olix0r mentioned this issue Jun 14, 2022

Outbound requests for some downstream services consistently fail after the proxy reported "panicked at 'cancel sender lost'" linkerd/linkerd2#8666

Closed

hawkw added a commit that referenced this issue Jun 16, 2022

add test reproducing #415

ee87e28

Signed-off-by: Eliza Weisman <eliza@buoyant.io>

olix0r mentioned this issue Jun 17, 2022

ready-cache: Ensure cancelation can be observed #668

Merged

hawkw mentioned this issue Jun 17, 2022

chore: prepare to release tower v0.4.13 #672

Merged

hawkw mentioned this issue Jun 17, 2022

update Tower to 0.4.13 to fix load balancer panic linkerd/linkerd2-proxy#1758

Merged

hawkw mentioned this issue Jun 30, 2022

2.141.3: cherry-pick #1758 linkerd/linkerd2-proxy#1800

Closed

hawkw mentioned this issue Jun 30, 2022

2.161.3: cherry-pick #1758 linkerd/linkerd2-proxy#1801

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Panicked at 'missing cancelation' in tower-ready-cache #415

Panicked at 'missing cancelation' in tower-ready-cache #415

jonhoo commented Feb 13, 2020

jonhoo commented Feb 13, 2020

jonhoo commented Feb 13, 2020

jonhoo commented Feb 13, 2020 •

edited

jonhoo commented Feb 17, 2020

jonhoo commented Feb 20, 2020 •

edited

jonhoo commented Feb 20, 2020

olix0r commented Feb 20, 2020

jonhoo commented Feb 20, 2020

olix0r commented Feb 20, 2020 •

edited

jonhoo commented Feb 20, 2020

jonhoo commented Feb 20, 2020

jonhoo commented Feb 20, 2020

olix0r commented Feb 20, 2020

olix0r commented Feb 20, 2020 •

edited

olix0r commented Feb 20, 2020

jonhoo commented Feb 20, 2020

olix0r commented Feb 20, 2020 •

edited

jonhoo commented Feb 20, 2020

jonhoo commented Feb 20, 2020

jonhoo commented Feb 21, 2020 •

edited

jonhoo commented Feb 21, 2020 •

edited

jonhoo commented Feb 21, 2020

jonhoo commented Feb 21, 2020

jonhoo commented Feb 21, 2020

jonhoo commented Feb 21, 2020 •

edited

jonhoo commented Feb 21, 2020

jonhoo commented Feb 21, 2020

jonhoo commented Feb 21, 2020

jonhoo commented Feb 21, 2020

Panicked at 'missing cancelation' in tower-ready-cache #415

Panicked at 'missing cancelation' in tower-ready-cache #415

Comments

jonhoo commented Feb 13, 2020

jonhoo commented Feb 13, 2020

jonhoo commented Feb 13, 2020

jonhoo commented Feb 13, 2020 • edited

jonhoo commented Feb 17, 2020

jonhoo commented Feb 20, 2020 • edited

jonhoo commented Feb 20, 2020

olix0r commented Feb 20, 2020

jonhoo commented Feb 20, 2020

olix0r commented Feb 20, 2020 • edited

jonhoo commented Feb 20, 2020

jonhoo commented Feb 20, 2020

jonhoo commented Feb 20, 2020

olix0r commented Feb 20, 2020

olix0r commented Feb 20, 2020 • edited

olix0r commented Feb 20, 2020

jonhoo commented Feb 20, 2020

olix0r commented Feb 20, 2020 • edited

jonhoo commented Feb 20, 2020

jonhoo commented Feb 20, 2020

jonhoo commented Feb 21, 2020 • edited

jonhoo commented Feb 21, 2020 • edited

jonhoo commented Feb 21, 2020

jonhoo commented Feb 21, 2020

jonhoo commented Feb 21, 2020

jonhoo commented Feb 21, 2020 • edited

jonhoo commented Feb 21, 2020

jonhoo commented Feb 21, 2020

jonhoo commented Feb 21, 2020

jonhoo commented Feb 21, 2020

jonhoo commented Feb 13, 2020 •

edited

jonhoo commented Feb 20, 2020 •

edited

olix0r commented Feb 20, 2020 •

edited

olix0r commented Feb 20, 2020 •

edited

olix0r commented Feb 20, 2020 •

edited

jonhoo commented Feb 21, 2020 •

edited

jonhoo commented Feb 21, 2020 •

edited

jonhoo commented Feb 21, 2020 •

edited