Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Panicked at 'missing cancelation' in tower-ready-cache #415

Closed
jonhoo opened this issue Feb 13, 2020 · 31 comments · Fixed by #420 or #668
Closed

Panicked at 'missing cancelation' in tower-ready-cache #415

jonhoo opened this issue Feb 13, 2020 · 31 comments · Fixed by #420 or #668
Assignees

Comments

@jonhoo
Copy link
Collaborator

jonhoo commented Feb 13, 2020

I just had my code hit this .expect in tower-ready-cache. It's using tower-balance::pool (and thus also tower-balance::p2c). Not sure what caused it, but figured maybe @olix0r can tease it out from the backtrace?

thread 'tokio-runtime-worker' panicked at 'missing cancelation', /home/ubuntu/.cargo/registry/src/github.com-1ecc6299db9ec823/tower-ready-cache-0.3.0/src/cache.rs:236:37
stack backtrace:
   0: backtrace::backtrace::libunwind::trace
             at /cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.40/src/backtrace/libunwind.rs:88
   1: backtrace::backtrace::trace_unsynchronized
             at /cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.40/src/backtrace/mod.rs:66
   2: std::sys_common::backtrace::_print_fmt
             at src/libstd/sys_common/backtrace.rs:77
   3: <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt
             at src/libstd/sys_common/backtrace.rs:59
   4: core::fmt::write
             at src/libcore/fmt/mod.rs:1052
   5: std::io::Write::write_fmt
             at src/libstd/io/mod.rs:1428
   6: std::sys_common::backtrace::_print
             at src/libstd/sys_common/backtrace.rs:62
   7: std::sys_common::backtrace::print
             at src/libstd/sys_common/backtrace.rs:49
   8: std::panicking::default_hook::{{closure}}
             at src/libstd/panicking.rs:204
   9: std::panicking::default_hook
             at src/libstd/panicking.rs:224
  10: std::panicking::rust_panic_with_hook
             at src/libstd/panicking.rs:470
  11: rust_begin_unwind
             at src/libstd/panicking.rs:378
  12: core::panicking::panic_fmt
             at src/libcore/panicking.rs:85
  13: core::option::expect_failed
             at src/libcore/option.rs:1203
  14: core::option::Option<T>::expect
             at /rustc/fc23a81831d5b41510d3261c20c34dd8d32f0f31/src/libcore/option.rs:347
  15: tower_ready_cache::cache::ReadyCache<K,S,Req>::poll_pending
             at ./.cargo/registry/src/github.com-1ecc6299db9ec823/tower-ready-cache-0.3.0/src/cache.rs:236
  16: tower_balance::p2c::service::Balance<D,Req>::promote_pending_to_ready
             at ./.cargo/registry/src/github.com-1ecc6299db9ec823/tower-balance-0.3.0/src/p2c/service.rs:151
  17: <tower_balance::p2c::service::Balance<D,Req> as tower_service::Service<Req>>::poll_ready
             at ./.cargo/registry/src/github.com-1ecc6299db9ec823/tower-balance-0.3.0/src/p2c/service.rs:238
  18: <tower_balance::pool::Pool<MS,Target,Req> as tower_service::Service<Req>>::poll_ready
             at ./.cargo/registry/src/github.com-1ecc6299db9ec823/tower-balance-0.3.0/src/pool/mod.rs:373
  19: <tower_buffer::worker::Worker<T,Request> as core::future::future::Future>::poll
             at ./.cargo/registry/src/github.com-1ecc6299db9ec823/tower-buffer-0.3.0/src/worker.rs:169
  20: tokio::task::core::Core<T>::poll
             at ./.cargo/git/checkouts/tokio-377c595163f99a10/3fb217f/tokio/src/task/core.rs:128
  21: tokio::task::harness::Harness<T,S>::poll::{{closure}}::{{closure}}
             at ./.cargo/git/checkouts/tokio-377c595163f99a10/3fb217f/tokio/src/task/harness.rs:119
  22: core::ops::function::FnOnce::call_once
             at /rustc/fc23a81831d5b41510d3261c20c34dd8d32f0f31/src/libcore/ops/function.rs:232
  23: <std::panic::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
             at /rustc/fc23a81831d5b41510d3261c20c34dd8d32f0f31/src/libstd/panic.rs:318
  24: std::panicking::try::do_call
             at /rustc/fc23a81831d5b41510d3261c20c34dd8d32f0f31/src/libstd/panicking.rs:303
  25: __rust_maybe_catch_panic
             at src/libpanic_unwind/lib.rs:86
  26: std::panicking::try
             at /rustc/fc23a81831d5b41510d3261c20c34dd8d32f0f31/src/libstd/panicking.rs:281
  27: std::panic::catch_unwind
             at /rustc/fc23a81831d5b41510d3261c20c34dd8d32f0f31/src/libstd/panic.rs:394
  28: tokio::task::harness::Harness<T,S>::poll::{{closure}}
             at ./.cargo/git/checkouts/tokio-377c595163f99a10/3fb217f/tokio/src/task/harness.rs:100
  29: tokio::loom::std::causal_cell::CausalCell<T>::with_mut
             at ./.cargo/git/checkouts/tokio-377c595163f99a10/3fb217f/tokio/src/loom/std/causal_cell.rs:41
  30: tokio::task::harness::Harness<T,S>::poll
             at ./.cargo/git/checkouts/tokio-377c595163f99a10/3fb217f/tokio/src/task/harness.rs:99
  31: tokio::task::raw::RawTask::poll
             at ./.cargo/git/checkouts/tokio-377c595163f99a10/3fb217f/tokio/src/task/raw.rs:113
  32: tokio::task::Task<S>::run
             at ./.cargo/git/checkouts/tokio-377c595163f99a10/3fb217f/tokio/src/task/mod.rs:381
  33: tokio::runtime::thread_pool::worker::GenerationGuard::run_task
             at ./.cargo/git/checkouts/tokio-377c595163f99a10/3fb217f/tokio/src/runtime/thread_pool/worker.rs:459
  34: tokio::runtime::thread_pool::worker::GenerationGuard::process_available_work
             at ./.cargo/git/checkouts/tokio-377c595163f99a10/3fb217f/tokio/src/runtime/thread_pool/worker.rs:317
  35: tokio::runtime::thread_pool::worker::GenerationGuard::run
             at ./.cargo/git/checkouts/tokio-377c595163f99a10/3fb217f/tokio/src/runtime/thread_pool/worker.rs:282
  36: tokio::runtime::thread_pool::worker::Worker::run::{{closure}}::{{closure}}
             at ./.cargo/git/checkouts/tokio-377c595163f99a10/3fb217f/tokio/src/runtime/thread_pool/worker.rs:160
  37: std::thread::local::LocalKey<T>::try_with
             at /rustc/fc23a81831d5b41510d3261c20c34dd8d32f0f31/src/libstd/thread/local.rs:262
  38: std::thread::local::LocalKey<T>::with
             at /rustc/fc23a81831d5b41510d3261c20c34dd8d32f0f31/src/libstd/thread/local.rs:239
  39: tokio::runtime::thread_pool::worker::Worker::run::{{closure}}
             at ./.cargo/git/checkouts/tokio-377c595163f99a10/3fb217f/tokio/src/runtime/thread_pool/worker.rs:136
  40: tokio::runtime::thread_pool::current::set::{{closure}}
             at ./.cargo/git/checkouts/tokio-377c595163f99a10/3fb217f/tokio/src/runtime/thread_pool/current.rs:47
  41: std::thread::local::LocalKey<T>::try_with
             at /rustc/fc23a81831d5b41510d3261c20c34dd8d32f0f31/src/libstd/thread/local.rs:262
  42: std::thread::local::LocalKey<T>::with
             at /rustc/fc23a81831d5b41510d3261c20c34dd8d32f0f31/src/libstd/thread/local.rs:239
  43: tokio::runtime::thread_pool::current::set
             at ./.cargo/git/checkouts/tokio-377c595163f99a10/3fb217f/tokio/src/runtime/thread_pool/current.rs:29
  44: tokio::runtime::thread_pool::worker::Worker::run
             at ./.cargo/git/checkouts/tokio-377c595163f99a10/3fb217f/tokio/src/runtime/thread_pool/worker.rs:132
  45: tokio::runtime::thread_pool::Workers::spawn::{{closure}}::{{closure}}
             at ./.cargo/git/checkouts/tokio-377c595163f99a10/3fb217f/tokio/src/runtime/thread_pool/mod.rs:113
  46: <tokio::runtime::blocking::task::BlockingTask<T> as core::future::future::Future>::poll
             at ./.cargo/git/checkouts/tokio-377c595163f99a10/3fb217f/tokio/src/runtime/blocking/task.rs:30
  47: tokio::task::core::Core<T>::poll
             at ./.cargo/git/checkouts/tokio-377c595163f99a10/3fb217f/tokio/src/task/core.rs:128
  48: tokio::task::harness::Harness<T,S>::poll::{{closure}}::{{closure}}
             at ./.cargo/git/checkouts/tokio-377c595163f99a10/3fb217f/tokio/src/task/harness.rs:119
  49: core::ops::function::FnOnce::call_once
             at /rustc/fc23a81831d5b41510d3261c20c34dd8d32f0f31/src/libcore/ops/function.rs:232
  50: <std::panic::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
             at /rustc/fc23a81831d5b41510d3261c20c34dd8d32f0f31/src/libstd/panic.rs:318
  51: std::panicking::try::do_call
             at /rustc/fc23a81831d5b41510d3261c20c34dd8d32f0f31/src/libstd/panicking.rs:303
  52: __rust_maybe_catch_panic
             at src/libpanic_unwind/lib.rs:86
  53: std::panicking::try
             at /rustc/fc23a81831d5b41510d3261c20c34dd8d32f0f31/src/libstd/panicking.rs:281
  54: std::panic::catch_unwind
             at /rustc/fc23a81831d5b41510d3261c20c34dd8d32f0f31/src/libstd/panic.rs:394
  55: tokio::task::harness::Harness<T,S>::poll::{{closure}}
             at ./.cargo/git/checkouts/tokio-377c595163f99a10/3fb217f/tokio/src/task/harness.rs:100
  56: tokio::loom::std::causal_cell::CausalCell<T>::with_mut
             at ./.cargo/git/checkouts/tokio-377c595163f99a10/3fb217f/tokio/src/loom/std/causal_cell.rs:41
  57: tokio::task::harness::Harness<T,S>::poll
             at ./.cargo/git/checkouts/tokio-377c595163f99a10/3fb217f/tokio/src/task/harness.rs:99
  58: tokio::task::raw::RawTask::poll
             at ./.cargo/git/checkouts/tokio-377c595163f99a10/3fb217f/tokio/src/task/raw.rs:113
  59: tokio::task::Task<S>::run
             at ./.cargo/git/checkouts/tokio-377c595163f99a10/3fb217f/tokio/src/task/mod.rs:381
  60: tokio::runtime::blocking::pool::run_task
             at ./.cargo/git/checkouts/tokio-377c595163f99a10/3fb217f/tokio/src/runtime/blocking/pool.rs:311
  61: tokio::runtime::blocking::pool::Inner::run
             at ./.cargo/git/checkouts/tokio-377c595163f99a10/3fb217f/tokio/src/runtime/blocking/pool.rs:230
  62: tokio::runtime::blocking::pool::Spawner::spawn_thread::{{closure}}::{{closure}}
             at ./.cargo/git/checkouts/tokio-377c595163f99a10/3fb217f/tokio/src/runtime/blocking/pool.rs:210
  63: tokio::runtime::context::enter
             at ./.cargo/git/checkouts/tokio-377c595163f99a10/3fb217f/tokio/src/runtime/context.rs:72
  64: tokio::runtime::handle::Handle::enter
             at ./.cargo/git/checkouts/tokio-377c595163f99a10/3fb217f/tokio/src/runtime/handle.rs:34
  65: tokio::runtime::blocking::pool::Spawner::spawn_thread::{{closure}}
             at ./.cargo/git/checkouts/tokio-377c595163f99a10/3fb217f/tokio/src/runtime/blocking/pool.rs:209
@jonhoo
Copy link
Collaborator Author

jonhoo commented Feb 13, 2020

I can replicate this pretty reliably at high load in noria. The backtrace seems to always be the same.

@jonhoo
Copy link
Collaborator Author

jonhoo commented Feb 13, 2020

@olix0r is there any particular state you'd like me to print out at the time of the panic that might aid in debugging this? I can only replicate the issue at fairly high load, so full logs will be tricky (and probably not very useful).

@jonhoo
Copy link
Collaborator Author

jonhoo commented Feb 13, 2020

As an aside, I cannot replicate this easily on a local machine, even when it is heavily loaded, but it happens on like 1/3 or runs on EC2, which suggests that it is indeed a race condition triggered by scheduling noise (which EC2 has a lot of).

@jonhoo
Copy link
Collaborator Author

jonhoo commented Feb 17, 2020

I'm quite stuck on this... @olix0r any suggestions for how to go about debugging this? Any idea what the root cause might be?

jonhoo added a commit that referenced this issue Feb 20, 2020
The hope for this was to reproduce #415 (which it does not sadly), but
at least it adds a test for p2c!
@jonhoo
Copy link
Collaborator Author

jonhoo commented Feb 20, 2020

I've been stepping through the code, and cannot understand how the panic can possibly be triggered.. For that to happen, we need to have an entry in pending whose key is not in pending_cancel_txs. The only way for an entry to be added to pending is push_pending. Since push_pending ensures that the pushed service's key is in pending_cancel_txs, we know that this can only happen if a key is removed from pending_cancel_txs. And specifically, we need a key k to be removed from pending_cancel_txs while there is a service with key k in pending.

A key k is removed from pending_cancel_txs in one of three ways:

  1. A service with key k becomes ready.
  2. A service with key k fails.
  3. evict(k) is called.

Let's look at each of these in turn.

Case 1: A service with key k becomes ready. This removes k from pending_cancel_txs. For the panic to trigger, we need another service to be in pending with the same key k. There are two possible options:

  1. The remaining service was pushed before the ready service. In that case, when the ready service was pushed to pending, it must have found k in pending_cancel_txs and canceled it (by push_pending). The service canceled by that was either the remaining service, or some other service. If it was some other service (also with key k), that service must have canceled yet another service with key k when it was pushed. Ultimately, the remaining service must have been canceled. That means that when we poll the remaining service, we will hit the Canceled case, which does not encounter the panic.
  2. The ready service was pushed before the remaining service. This cannot happen, since when the remaining service was pushed, it must have cancelled an existing service with key k. That must (ultimately, as in 1.) have cancelled the service that just became ready. But it just became ready, so it wasn't cancelled. So we can't be in this case.

Case 2: A service with key k fails. Argument is the same as case 1.

Case 3: If a service with key k is in pending_cancel_txs, it will be cancelled. Therefore, a service with key k in pending will end up in the Canceled case, which does not trigger the panic. There cannot be any other services with key k in pending that aren't already going to return Canceled, because of the cancel in push_pending.

So, none of these cases allow the panic to trigger. Yet it does. So, I must be missing something. I just don't know what. My one thought was that things would go wrong if polling a Pending immediately after canceling it might not see the cancellation, but I don't think that's possible either, even with a work-stealing scheduler. And so I am stuck.

@jonhoo
Copy link
Collaborator Author

jonhoo commented Feb 20, 2020

To state ^ a bit more concisely, there doesn't seem to be a way to have multiple services in pending with the same key k where more than one of them will return something other than Canceled. And there doesn't seem to be a way to introduce a service into pending that doesn't have an entry in pending_cancel_txs. From my reading of the panic, one of those must be occurring, I just cannot come up with an execution in which that'd be the case.

jonhoo added a commit that referenced this issue Feb 20, 2020
The hope for this was to reproduce #415 (which it does not sadly), but
at least it adds a test for p2c!
@olix0r
Copy link
Collaborator

olix0r commented Feb 20, 2020

This is, indeed, a very weird error. I have to admit ignorance to the newer std::future APIs. I would personally want to look at the translation from futures 0.1 to see if there's anything that changed semantically. Or perhaps there's a notification race that we haven't seen before. I'll refresh myself and see if I can come up with a better theory.

@jonhoo
Copy link
Collaborator Author

jonhoo commented Feb 20, 2020

The translation seems correct to me -- it doesn't really change the logic anywhere, and the move to pin is pretty much irrelevant for ReadyCache. The only difference I can see is that the old code had (in Pending::poll):

.expect("cancel sender lost")
.is_ready()

whereas the new one has
if let Poll::Ready(Ok(_)) = Pin::new(&mut fut).poll(cx) {

So if the oneshot sender is dropped, the old code would panic, whereas the new code would continue. Given that I wasn't seeing panics before though, I doubt this change is the culprit (although it probably should be changed to panic again).

It could be that the 0.2 scheduler has more weird interleavings or notification patterns, though given that all the methods on ReadyCache take &mut self, the only race I could really see is with the order of things returned by FuturesUnordered::poll_next. And I don't think a different order there could make this particular panic happen (if the swap_remove happens twice for a key, there just needs to be two elements with the same key; their order doesn't matter).

@olix0r
Copy link
Collaborator

olix0r commented Feb 20, 2020

@jonhoo Your descriptions above match my understanding exactly. Thinking outside the box a bit... there are a few assumptions we could violate to produce this result, however:

  • If (somehow) the cancelation oneshot failed to signal on the reader, we could incorrectly return successfully more than once.
  • If (somehow) the futuresunordered eagerly polled and buffered results, we could satisfy pending futures successfully before, i.e. an evict was called and then poll_pending processed the result.

I've looked at the futuresunordered impl and it doesn't seem to do any sort of buffering of results, so that seems unlikely. Could it be possible that tokio::sync::oneshot does not reliably notify receivers under load?

@jonhoo
Copy link
Collaborator Author

jonhoo commented Feb 20, 2020

Yeah, I have looked at the FuturesUnordered impl myself before, and also do not believe it does any buffering. The receiver notification is the one I'm also currently suspicious of, though it would be crazy if a send on a oneshot would not be seen by a receiver polled by the same future. There's currently some discussion about this on Discord. Alice suggested:

Initially key K contains a task A with associated cancel tx Atx in the map.
A new task B with same key and cancel tx Btx is added.

  1. self.pending_cancel_txs.insert inserts Btx and takes out Atx
  2. self.pending.push(B) happens
  3. self.poll_pending happens and decides to poll task A
  4. Task A says it's done, and Btx is removed along with it.
  5. c.send(()) is called on Atx
  6. self.poll_pending happens again, this time B is polled and says it's done, but there's no cancel token in the map.

The un-reordered version of my execution is that the poll_pending in 3. happened entirely before the push_pending
But 1. and 2. got reordered backwards into the preceding poll_pending

Though that execution seems insane. It would mean that writing to a field through &mut self before yielding, and then reading that same field after yielding, could have the load not see the store.

@jonhoo
Copy link
Collaborator Author

jonhoo commented Feb 20, 2020

As an aside, this loom test passes, which gives me some confidence that operating on &mut self should have a happens-before:

#[test]
fn future_poll_happens_before() {
    use crate::sync::oneshot;

    loom::model(|| {
        let mut pool = mk_pool(2);

        let (tx, rx) = oneshot::channel();

        let jh = pool.spawn(async move {
            let mut x = 0;
            x = 1;
            rx.await.unwrap();
            assert_eq!(x, 1);
        });
        tx.send(()).unwrap();
        pool.block_on(jh).unwrap();
    });
}

Also, I filed #418 to restore the expect.

@jonhoo
Copy link
Collaborator Author

jonhoo commented Feb 20, 2020

Given that I can reproduce the error on EC2 under load, are there any assertions or other code snippets that would be useful if I tried to inject and then run where it fails?

@olix0r
Copy link
Collaborator

olix0r commented Feb 20, 2020

Also, I don't love this option, but these expects are really intended to be more like debug_assert! -- there's no hard reason why we can't continue in this state (just create a new oneshot...). So we might consider being resilient in release code and failing with a debug_assert! instead?

But, we are somehow violating an invariant, so it would be good to get to the bottom of this. If I were debugging this, I'd probably add a K: fmt::Debug (if it's not otherwise available via tracing) and using trace! logging to track the exact state of things. It would also likely be useful to log when cancelling and when polling the cancellation in Pending, even if you can't easily wire through the key information.

@olix0r
Copy link
Collaborator

olix0r commented Feb 20, 2020

Is it possible that we are somehow mismatching cancelation pairs? I.e. that the cancel obtained from pending_cancel_txs does not notify the cancel_rx?

Poll::Ready(Some(Ok((key, svc, cancel_rx)))) => {
trace!("endpoint ready");
let cancel_tx = self
.pending_cancel_txs
.swap_remove(&key)
.expect("missing cancelation");

@olix0r
Copy link
Collaborator

olix0r commented Feb 20, 2020

Is it possible that we are somehow mismatching cancelation pairs?

I can't see how it would be possible... but it would be helpful to see the ordering of cancelation and poll_pending results immediately before a panic

@jonhoo
Copy link
Collaborator Author

jonhoo commented Feb 20, 2020

Yeah, I also don't see how that could happen, because the service that is in pending and that won't already return Canceled must (I think) be the one whose tx is in pending_cancel_txs.

these expects are really intended to be more like debug_assert!

I don't think those expects should ever fail though, that's why I'm inclined to keep them as such. If they fail, it's because we are not upholding some invariant we were expecting to hold (like seems to be the case in this instance).

If I were debugging this

If you have a patch in mind I'd be happy to run with it applied. It might generate some pretty large logs though, especially because there are multiple ReadyCache's in use at the same time. We may need to also include something like "{:p}", self in the output.

@olix0r
Copy link
Collaborator

olix0r commented Feb 20, 2020

Here's a quick attempt to fill in some gaps: https://github.com/olix0r/tower/pull/new/ver/ready-cache-debug

Note that, if you are using tracing in your application, you can instrument the task with a debug_span!("mycache") and then set RUST_LOG=[mycache]=trace -- that should limit logs to only matching spans...

@jonhoo
Copy link
Collaborator Author

jonhoo commented Feb 20, 2020

I'll run with that once I get into the office tomorrow, thanks!

It's, err, tricky to set spans appropriately because the ReadyCache is relatively deep in the stick behind both Pool and Buffer. There are also multiple different stacks like that, and I do not know ahead of time which of them the panic occurs in, so I may just have to instrument all of them. I'll see what I can come up with.

@jonhoo
Copy link
Collaborator Author

jonhoo commented Feb 20, 2020

Interestingly, I found that I was able to reproduce the issue locally too if I just load the CPU enough (stress -c #CPUs-1 for example). With that, you may even be able to get one of the tests to fail directly, though I haven't had a chance to try that yet.

@jonhoo
Copy link
Collaborator Author

jonhoo commented Feb 21, 2020

Logs

I modified your log statements slightly to also include the failing key.

@jonhoo
Copy link
Collaborator Author

jonhoo commented Feb 21, 2020

Here's the log for just the failing key (key 1): https://jon.tsp.io/share/tower-ready-cache-panic-key.log

@jonhoo
Copy link
Collaborator Author

jonhoo commented Feb 21, 2020

I'm seeing a lot of

cache: poll_pending: endpoint ready key=4
cache: call_ready_index: service has been replaced in pending key=4

Which seems surprising.

@jonhoo
Copy link
Collaborator Author

jonhoo commented Feb 21, 2020

Ah, I think that one trace line is just placed incorrectly (it should be in an else block) -- will fix.

@jonhoo
Copy link
Collaborator Author

jonhoo commented Feb 21, 2020

Fixed (and updated the links above).

@jonhoo
Copy link
Collaborator Author

jonhoo commented Feb 21, 2020

Here are the last few events for the failing key:

  50.943531372s TRACE cache: poll_pending: endpoint ready key=2
  50.949534210s TRACE cache: poll_pending: endpoint ready key=2
  50.951870970s TRACE cache: poll_pending: endpoint ready key=2
  50.954457519s TRACE cache: poll_pending: endpoint ready key=2
  50.956442849s TRACE cache: poll_pending: endpoint ready key=2
  50.969499021s TRACE cache: poll_pending: endpoint ready key=2
  50.975538632s TRACE cache: poll_pending: endpoint ready key=2
  50.976581804s TRACE cache: poll_pending: endpoint ready key=2
  50.981410276s TRACE cache: poll_pending: endpoint ready key=2
  50.983052593s TRACE cache: poll_pending: endpoint ready key=2
  50.984634312s TRACE cache: poll_pending: endpoint ready key=2
  50.985693535s TRACE cache: poll_pending: endpoint ready key=2
  50.996814748s TRACE cache: poll_pending: endpoint ready key=2
  50.999420057s TRACE cache: poll_pending: endpoint ready key=2
  51.001615313s TRACE cache: evict: replacing service in pending key=2
  51.001726570s TRACE cache: poll_pending: endpoint ready key=2
  51.001734877s ERROR cache: cancel_tx.is_none key=2

Where replacing service in pending is the if let Some case in evict().

All the other evictions look like we'd expect:

  50.419531385s TRACE cache: poll_pending: endpoint ready key=2
  50.432676441s TRACE cache: poll_pending: endpoint ready key=2
  50.433185867s TRACE cache: poll_pending: endpoint ready key=2
  50.437287077s TRACE cache: evict: replacing service in pending key=2
  50.437295539s DEBUG cache: poll_pending: endpoint canceled key=2
  50.467646284s TRACE cache: push: inserting a new service key=2
  50.468975639s TRACE cache: poll_pending: endpoint ready key=2
  50.473768557s TRACE cache: poll_pending: endpoint ready key=2

@jonhoo
Copy link
Collaborator Author

jonhoo commented Feb 21, 2020

As a sanity check, I also tried storing a copy of the service key inside CancelTx, and then I assert_eq! with the key we remove any time we get a CancelTx out of pending_cancel_txs, and those assertions never fail.

I also had a careful read through the code for oneshot, and it does have a synchronizing relationship between send and recv by virtue of an Acquire store and an AcqRel load on the inner State. So I don't think it should be possible for the recv to not see the send.

This in turn lead me to wonder whether IndexMap::insert might sometimes return the CancelTx we insert rather the one that was previously in the map. But even if that were the case, it probably wouldn't make any difference, since there would still only be one non-canceled service in pending, and that service's oneshot::Sender would still be in pending_cancel_txs. I also don't see any straight-up replacements for a key in the logs, only evicts followed by later pushes, so it's probably not even worth looking into.

@jonhoo
Copy link
Collaborator Author

jonhoo commented Feb 21, 2020

And here's another: even polling the rx a second time when a service returns ready does not fix the issue (it still returns Poll::Pending).

@jonhoo
Copy link
Collaborator Author

jonhoo commented Feb 21, 2020

Another data-point: adding a sync::atomic::fence(SeqCst) before oneshot::Receiver::poll in Pending::poll does not fix the issue.

@jonhoo
Copy link
Collaborator Author

jonhoo commented Feb 21, 2020

And here's another: if I replace oneshot with mpsc, the panic still occurs, so this is not a bug in oneshot (unless the same bug somehow exists in mpsc).

jonhoo added a commit that referenced this issue Feb 24, 2020
It's been observed that occasionally tower-ready-cache would panic
trying to find an already canceled service in `cancel_pending_txs`
(#415). The source of the race is not entirely clear, but extensive
debugging demonstrated that occasionally a call to `evict` would send on
the `CancelTx` for a service, yet that service would be yielded back
from `pending` in `poll_pending` in a non-`Canceled` state. This
is equivalent to saying that this code may panic:

```rust
async {
  let (tx, rx) = oneshot::channel();
  tx.send(42).unwrap();
  yield_once().await;
  rx.try_recv().unwrap(); // <- may occasionally panic
}
```

I have not been able to demonstrate a self-contained example failing in
this way, but it's the only explanation I have found for the observed
bug. Pinning the entire runtime to one core still produced the bug,
which indicates that it is not a memory ordering issue. Replacing
oneshot with `mpsc::channel(1)` still produced the bug, which indicates
that the bug is not with the implementation of `oneshot`. Logs also
indicate that the `ChannelTx` we send on in `evict()` truly is the same
one associated with the `ChannelRx` polled in `Pending::poll`, so we're
not getting our wires crossed somewhere. It truly is bizarre.

This patch resolves the issue by considering a failure to find a
ready/errored service's `CancelTx` as another signal that a service has
been removed. Specifically, if `poll_pending` finds a service that
returns `Ok` or `Err`, but does _not_ find its `CancelTx`, then it
assumes that it must be because the service _was_ canceled, but did not
observe that cancellation signal.

As an explanation, this isn't entirely satisfactory, since we do not
fully understand the underlying problem. It _may_ be that a canceled
service could remain in the pending state for a very long time if it
does not become ready _and_ does not see the cancellation signal (so it
returns `Poll::Pending` and is not removed). That, in turn, might cause
an issue if the driver of the `ReadyCache` then chooses to re-use a key
they believe they have evicted. However, any such case _must_ first hit
the panic that exists in the code today, so this is still an improvement
over the status quo.

Fixes #415.
jonhoo added a commit that referenced this issue Feb 24, 2020
It's been observed that occasionally tower-ready-cache would panic
trying to find an already canceled service in `cancel_pending_txs`
(#415). The source of the race is not entirely clear, but extensive
debugging demonstrated that occasionally a call to `evict` would send on
the `CancelTx` for a service, yet that service would be yielded back
from `pending` in `poll_pending` in a non-`Canceled` state. This
is equivalent to saying that this code may panic:

```rust
async {
  let (tx, rx) = oneshot::channel();
  tx.send(42).unwrap();
  yield_once().await;
  rx.try_recv().unwrap(); // <- may occasionally panic
}
```

I have not been able to demonstrate a self-contained example failing in
this way, but it's the only explanation I have found for the observed
bug. Pinning the entire runtime to one core still produced the bug,
which indicates that it is not a memory ordering issue. Replacing
oneshot with `mpsc::channel(1)` still produced the bug, which indicates
that the bug is not with the implementation of `oneshot`. Logs also
indicate that the `ChannelTx` we send on in `evict()` truly is the same
one associated with the `ChannelRx` polled in `Pending::poll`, so we're
not getting our wires crossed somewhere. It truly is bizarre.

This patch resolves the issue by considering a failure to find a
ready/errored service's `CancelTx` as another signal that a service has
been removed. Specifically, if `poll_pending` finds a service that
returns `Ok` or `Err`, but does _not_ find its `CancelTx`, then it
assumes that it must be because the service _was_ canceled, but did not
observe that cancellation signal.

As an explanation, this isn't entirely satisfactory, since we do not
fully understand the underlying problem. It _may_ be that a canceled
service could remain in the pending state for a very long time if it
does not become ready _and_ does not see the cancellation signal (so it
returns `Poll::Pending` and is not removed). That, in turn, might cause
an issue if the driver of the `ReadyCache` then chooses to re-use a key
they believe they have evicted. However, any such case _must_ first hit
the panic that exists in the code today, so this is still an improvement
over the status quo.

Fixes #415.
olix0r added a commit to olix0r/tower that referenced this issue Jun 16, 2021
linkerd/linkerd2#6086 describes an issue that sounds closely related to
tower-rs#415: There's some sort of consistency issue between the
ready-cache's pending stream and its set of cancelations. Where the
latter issues describes triggering a panic in the stream receiver, the
former describes triggering a panic in the stream implementation.

There's no logical reason why we can't continue to operate in this
scenario, though it does indicate a real correctness issue.

So, this change prevents panicking in this scenario when not building
with debugging. Instead, we now emit WARN-level logs so that we have a
clearer signal they're occurring.

Finally, this change also adds `Debug` constraints to the cache's key
types (and hence the balancer's key types) so that we can more
reasonably debug this behavior.
olix0r added a commit to olix0r/tower that referenced this issue Jun 16, 2021
linkerd/linkerd2#6086 describes an issue that sounds closely related to
tower-rs#415: There's some sort of consistency issue between the
ready-cache's pending stream and its set of cancelations. Where the
latter issues describes triggering a panic in the stream receiver, the
former describes triggering a panic in the stream implementation.

There's no logical reason why we can't continue to operate in this
scenario, though it does indicate a real correctness issue.

So, this change prevents panicking in this scenario when not building
with debugging. Instead, we now emit WARN-level logs so that we have a
clearer signal they're occurring.

Finally, this change also adds `Debug` constraints to the cache's key
types (and hence the balancer's key types) so that we can more
reasonably debug this behavior.
hawkw added a commit that referenced this issue Jun 16, 2022
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
olix0r added a commit that referenced this issue Jun 17, 2022
`tokio::task` enforces a cooperative scheduling regime that can cause
`oneshot::Receiver::poll` to return pending after the sender has sent an
update. `ReadyCache` uses a oneshot to notify pending services that they
should not become ready. When a cancelation is not observed, the ready
cache return service instances that should have been canceled, which
breaks assumptions and causes an invalid state.

Fixes #415

Co-authored-by: Eliza Weisman <eliza@buoyant.io>
hawkw added a commit that referenced this issue Jun 17, 2022
`tokio::task` enforces a cooperative scheduling regime that can cause
`oneshot::Receiver::poll` to return pending after the sender has sent an
update. `ReadyCache` uses a oneshot to notify pending services that they
should not become ready. When a cancelation is not observed, the ready
cache return service instances that should have been canceled, which
breaks assumptions and causes an invalid state.

This branch replaces the use of `tokio::sync::oneshot` for canceling 
pending futures with a custom cancelation handle using an `AtomicBool`
and `futures::task::AtomicWaker`. This ensures that canceled `Pending`
services are always woken even when the task's budget is exceeded.
Additionally, cancelation status is now always known to the `Pending`
future, by checking the `AtomicBool` immediately on polls, even in cases
where the canceled `Pending` future was woken by the inner `Service`
becoming ready, rather than by the cancelation.

Fixes #415

Signed-off-by: Oliver Gould <ver@buoyant.io>
Co-authored-by: Eliza Weisman <eliza@buoyant.io>
hawkw added a commit that referenced this issue Jun 17, 2022
`tokio::task` enforces a cooperative scheduling regime that can cause
`oneshot::Receiver::poll` to return pending after the sender has sent an
update. `ReadyCache` uses a oneshot to notify pending services that they
should not become ready. When a cancelation is not observed, the ready
cache return service instances that should have been canceled, which
breaks assumptions and causes an invalid state.

This branch replaces the use of `tokio::sync::oneshot` for canceling 
pending futures with a custom cancelation handle using an `AtomicBool`
and `futures::task::AtomicWaker`. This ensures that canceled `Pending`
services are always woken even when the task's budget is exceeded.
Additionally, cancelation status is now always known to the `Pending`
future, by checking the `AtomicBool` immediately on polls, even in cases
where the canceled `Pending` future was woken by the inner `Service`
becoming ready, rather than by the cancelation.

Fixes #415

Signed-off-by: Oliver Gould <ver@buoyant.io>
Co-authored-by: Eliza Weisman <eliza@buoyant.io>
hawkw added a commit that referenced this issue Jun 17, 2022
`tokio::task` enforces a cooperative scheduling regime that can cause
`oneshot::Receiver::poll` to return pending after the sender has sent an
update. `ReadyCache` uses a oneshot to notify pending services that they
should not become ready. When a cancelation is not observed, the ready
cache return service instances that should have been canceled, which
breaks assumptions and causes an invalid state.

This branch replaces the use of `tokio::sync::oneshot` for canceling 
pending futures with a custom cancelation handle using an `AtomicBool`
and `futures::task::AtomicWaker`. This ensures that canceled `Pending`
services are always woken even when the task's budget is exceeded.
Additionally, cancelation status is now always known to the `Pending`
future, by checking the `AtomicBool` immediately on polls, even in cases
where the canceled `Pending` future was woken by the inner `Service`
becoming ready, rather than by the cancelation.

Fixes #415

Signed-off-by: Oliver Gould <ver@buoyant.io>
Co-authored-by: Eliza Weisman <eliza@buoyant.io>
hawkw added a commit that referenced this issue Jun 17, 2022
# 0.4.13 (June 17, 2022)

### Added

- **load_shed**: Public constructor for `Overloaded` error ([#661])

### Fixed

- **util**: Fix hang with `call_all` when the `Stream` of requests is
  pending ([#656])
- **ready_cache**: Ensure cancelation is observed by pending services
  ([#668], fixes [#415])
- **docs**: Fix a missing section header due to a typo ([#646])
- **docs**: Fix broken links to `Service` trait ([#659])

[#661]: #661
[#656]: #656
[#668]: #668
[#415]: #415
[#646]: #646
[#659]: #659
hawkw added a commit that referenced this issue Jun 17, 2022
# 0.4.13 (June 17, 2022)

### Added

- **load_shed**: Public constructor for `Overloaded` error ([#661])

### Fixed

- **util**: Fix hang with `call_all` when the `Stream` of requests is
  pending ([#656])
- **ready_cache**: Ensure cancelation is observed by pending services
  ([#668], fixes [#415])
- **docs**: Fix a missing section header due to a typo ([#646])
- **docs**: Fix broken links to `Service` trait ([#659])

[#661]: #661
[#656]: #656
[#668]: #668
[#415]: #415
[#646]: #646
[#659]: #659
hawkw added a commit to linkerd/linkerd2-proxy that referenced this issue Jun 17, 2022
Tower [v0.4.13] includes a fix for a bug in the `tower::ready_cache`
module, tower-rs/tower#415. The `ready_cache` module is used internally
in Tower's load balancer. This bug resulted in panics in the proxy
(linkerd/linkerd2#8666, linkerd/linkerd2#6086) in cases where the
Destination service sends a very large number of service discovery
updates (see linkerd/linkerd2#8677).

This commit updates the proxy's dependency on `tower` to 0.4.13, to
ensure that this bugfix is picked up.

Fixes linkerd/linkerd2#8666
Fixes linkerd/linkerd2#6086

[v0.4.13]: https://github.com/tower-rs/tower/releases/tag/tower-0.4.13
olix0r pushed a commit to linkerd/linkerd2-proxy that referenced this issue Jun 17, 2022
Tower [v0.4.13] includes a fix for a bug in the `tower::ready_cache`
module, tower-rs/tower#415. The `ready_cache` module is used internally
in Tower's load balancer. This bug resulted in panics in the proxy
(linkerd/linkerd2#8666, linkerd/linkerd2#6086) in cases where the
Destination service sends a very large number of service discovery
updates (see linkerd/linkerd2#8677).

This commit updates the proxy's dependency on `tower` to 0.4.13, to
ensure that this bugfix is picked up.

Fixes linkerd/linkerd2#8666
Fixes linkerd/linkerd2#6086

[v0.4.13]: https://github.com/tower-rs/tower/releases/tag/tower-0.4.13
hawkw added a commit to linkerd/linkerd2-proxy that referenced this issue Jun 30, 2022
Tower [v0.4.13] includes a fix for a bug in the `tower::ready_cache`
module, tower-rs/tower#415. The `ready_cache` module is used internally
in Tower's load balancer. This bug resulted in panics in the proxy
(linkerd/linkerd2#8666, linkerd/linkerd2#6086) in cases where the
Destination service sends a very large number of service discovery
updates (see linkerd/linkerd2#8677).

This commit updates the proxy's dependency on `tower` to 0.4.13, to
ensure that this bugfix is picked up.

Fixes linkerd/linkerd2#8666
Fixes linkerd/linkerd2#6086

[v0.4.13]: https://github.com/tower-rs/tower/releases/tag/tower-0.4.13
hawkw added a commit to linkerd/linkerd2-proxy that referenced this issue Jun 30, 2022
Tower [v0.4.13] includes a fix for a bug in the `tower::ready_cache`
module, tower-rs/tower#415. The `ready_cache` module is used internally
in Tower's load balancer. This bug resulted in panics in the proxy
(linkerd/linkerd2#8666, linkerd/linkerd2#6086) in cases where the
Destination service sends a very large number of service discovery
updates (see linkerd/linkerd2#8677).

This commit updates the proxy's dependency on `tower` to 0.4.13, to
ensure that this bugfix is picked up.

Fixes linkerd/linkerd2#8666
Fixes linkerd/linkerd2#6086

[v0.4.13]: https://github.com/tower-rs/tower/releases/tag/tower-0.4.13
hawkw added a commit to linkerd/linkerd2-proxy that referenced this issue Jun 30, 2022
Tower [v0.4.13] includes a fix for a bug in the `tower::ready_cache`
module, tower-rs/tower#415. The `ready_cache` module is used internally
in Tower's load balancer. This bug resulted in panics in the proxy
(linkerd/linkerd2#8666, linkerd/linkerd2#6086) in cases where the
Destination service sends a very large number of service discovery
updates (see linkerd/linkerd2#8677).

This commit updates the proxy's dependency on `tower` to 0.4.13, to
ensure that this bugfix is picked up.

Fixes linkerd/linkerd2#8666
Fixes linkerd/linkerd2#6086

[v0.4.13]: https://github.com/tower-rs/tower/releases/tag/tower-0.4.13
hawkw added a commit to linkerd/linkerd2-proxy that referenced this issue Jun 30, 2022
Tower [v0.4.13] includes a fix for a bug in the `tower::ready_cache`
module, tower-rs/tower#415. The `ready_cache` module is used internally
in Tower's load balancer. This bug resulted in panics in the proxy
(linkerd/linkerd2#8666, linkerd/linkerd2#6086) in cases where the
Destination service sends a very large number of service discovery
updates (see linkerd/linkerd2#8677).

This commit updates the proxy's dependency on `tower` to 0.4.13, to
ensure that this bugfix is picked up.

Fixes linkerd/linkerd2#8666
Fixes linkerd/linkerd2#6086

[v0.4.13]: https://github.com/tower-rs/tower/releases/tag/tower-0.4.13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants