Outbound requests fail intermittently after the proxy reported "panicked at 'cancel sender lost'" #6086

Wenliang-CHEN · 2021-04-30T14:30:45Z

Bug Report

What is the issue?

The outbound requests of a meshed pod fail intermittently after its linkerd-proxy reported "panicked at 'cancel sender lost'".

We are not sure what triggers the issue. From the logs we can tell that after the linkerd-proxy emits the following:

thread 'main' panicked at 'cancel sender lost', /usr/local/cargo/registry/src/github.com-1ecc6299db9ec823/tower-0.4.6/src/ready_cache/cache.rs:397:13

Then around 50% of the outbound requests starts failing intermittently with message:

[2m[ 19948.510484s]�[0m �[33m WARN�[0m ThreadId(01) �[1mserver�[0m�[1m{�[0morig_dst=172.20.207.194:80�[1m}�[0m: linkerd_app_core::errors: Failed to proxy request: buffer's worker closed unexpectedly client.addr=10.250.162.208:59692

Additional context

The outbound destination is also a meshed service.

The linkerd-init container exited with "Completed" status in the pod.

Before and during the incident, there was no restart in either the application container or the proxy container.

Once we restarted the pod manually, the outbound traffic succeeds at 100% again.

`linkerd check` output

kubernetes-api
--------------
√ can initialize the client
√ can query the Kubernetes API
kubernetes-version
------------------
√ is running the minimum Kubernetes API version
√ is running the minimum kubectl version
linkerd-existence
-----------------
√ 'linkerd-config' config map exists
√ heartbeat ServiceAccount exist
√ control plane replica sets are ready
√ no unschedulable pods
√ controller pod is running
linkerd-config
--------------
√ control plane Namespace exists
√ control plane ClusterRoles exist
√ control plane ClusterRoleBindings exist
√ control plane ServiceAccounts exist
√ control plane CustomResourceDefinitions exist
√ control plane MutatingWebhookConfigurations exist
√ control plane ValidatingWebhookConfigurations exist
√ control plane PodSecurityPolicies exist
linkerd-identity
----------------
√ certificate config is valid
√ trust anchors are using supported crypto algorithm
√ trust anchors are within their validity period
√ trust anchors are valid for at least 60 days
√ issuer cert is using supported crypto algorithm
√ issuer cert is within its validity period
‼ issuer cert is valid for at least 60 days
    issuer certificate will expire on 2021-06-06T08:57:23Z
    see https://linkerd.io/checks/#l5d-identity-issuer-cert-not-expiring-soon for hints
√ issuer cert is issued by the trust anchor
linkerd-webhooks-and-apisvc-tls
-------------------------------
√ proxy-injector webhook has valid cert
√ proxy-injector cert is valid for at least 60 days
√ sp-validator webhook has valid cert
√ sp-validator cert is valid for at least 60 days
linkerd-api
-----------
√ control plane pods are ready
√ can initialize the client
√ can query the control plane API
linkerd-version
---------------
√ can determine the latest version
‼ cli is up-to-date
    is running version 2.10.0 but the latest stable version is 2.10.1
    see https://linkerd.io/checks/#l5d-version-cli for hints
control-plane-version
---------------------
‼ control plane is up-to-date
    is running version 2.10.0 but the latest stable version is 2.10.1
    see https://linkerd.io/checks/#l5d-version-control for hints
√ control plane and cli versions match
linkerd-ha-checks
-----------------
√ pod injection disabled on kube-system
Status check results are √
Linkerd extensions checks
=========================
linkerd-viz
-----------
√ linkerd-viz Namespace exists
√ linkerd-viz ClusterRoles exist
√ linkerd-viz ClusterRoleBindings exist
√ tap API server has valid cert
√ tap API server cert is valid for at least 60 days
√ tap API service is running
‼ linkerd-viz pods are injected
    could not find proxy container for prometheus-7b5758b6ff-xlqv4 pod
    see https://linkerd.io/checks/#https://linkerd.io/checks/#l5d-viz-pods-injection for hints
√ viz extension pods are running
√ prometheus is installed and configured correctly
√ can initialize the client
√ viz extension self-check
Status check results are √

Environment

Kubernetes Version: v1.18.9-eks-d1db3c
Cluster Environment: EKS:
Linkerd version:
control plane: v2.10.0
linkerd-proxy: happened both for v2.139 and v2.142
linkerd-init: cr.l5d.io/linkerd/proxy-init:v1.3.9

The text was updated successfully, but these errors were encountered:

olix0r · 2021-04-30T16:01:53Z

Thanks for letting us know.

I've deleted the v2.142.0 proxy release -- it hit some other issues during integration tests and isn't yet ready for public consumption. In general, I'd only recommend using proxy versions that have been released on an edge release. Is there a specific reason you picked up v2.142.0? Are you using a patched version of the proxy?

For what it's worth, we plan on releasing a v2.10.2 release that use proxy version v2.141.1.

Wenliang-CHEN · 2021-05-03T07:07:54Z

Hey @olix0r thanks for the reply. So...

Is there a specific reason you picked up v2.142.0? Are you using a patched version of the proxy?
Yes and yes we need this commit linkerd/linkerd2-proxy#965 to fix the ingress problem. But it seems in 2.141.1 the commit is there, so it is all good.

Just wanted to point out that the issue existed before we upgraded the proxy to v2.142. We were using v2.139. We saw the 1st occurrence during that time. I will update the issue description.

olix0r · 2021-05-07T16:24:33Z

@Wenliang-CHEN We've been trying to reproduce this in library tests but haven't been able to get a solid lead on what's going on. Next week, we'll put together a branch that increases diagnostic logging in the ready-cache & balancer and ask you to test that out, if that works for you.

Wenliang-CHEN · 2021-05-10T09:48:00Z

Hey @olix0r thanks for the update. We will upgrade to it once it is ready.

Also an update from our side:

We are not able to reproduce the issue either. But when it happens, we observed high request rate for outbound traffic.

It seems that it goes in such pattern during the incident:

problematic pod sends around 100 rps to dest A: all traffic succeeds
at the same time problematic pod sends requests to dest B: we saw failing requests via he proxy logs
once the high rate traffic to dest A finishes, everything goes back to normal

~~It seems to be a load related issue.~~ (This seems not true)

hawkw · 2021-05-11T22:26:30Z

Hi @Wenliang-CHEN, I've published a linkerd proxy image mycoliza/l2-proxy:ready-cache-debug which contains additional debug logging in the ready-cache code. If you can test out this proxy image and set the proxy log level to

linkerd=debug,tower::balance=trace,tower::ready_cache=trace

that would be extremely helpful.

Thanks!

olix0r · 2021-05-12T14:52:08Z

specifically, you'll want to set workload annotations:

config.linkerd.io/proxy-image: docker.io/mycoliza/l2-proxy
config.linkerd.io/proxy-version: ready-cache-debug
config.linkerd.io/proxy-log-level: linkerd=debug,tower::balance=trace,tower::ready_cache=trace,warn

Wenliang-CHEN · 2021-05-14T07:46:58Z

Hi @hawkw @olix0r thanks for the effort! We are going to try out the proxy for the service.

We will let you know once we found anything interesting.

Wenliang-CHEN · 2021-05-19T12:14:16Z

Hello again. We are able to reproduce the issue with debug proxy. We observed the following pattern during the incident:

First we saw large number of logs like the following:

DEBUG ThreadId(01) outbound:accept{client.addr=10.250.187.39:35960}:server{orig_dst=172.20.207.194:80}:profile:http{v=1.x}:logical{dst=service-name.prod.svc.cluster.local:80}:concrete{addr=service-name-primary.prod.svc.cluster.local:80}: tower::ready_cache::cache: endpoint canceled

Then we saw the panic log

thread 'main' panicked at 'cancel sender lost', /usr/local/cargo/registry/src/github.com-1ecc6299db9ec823/tower-0.4.7/src/ready_cache/cache.rs:397:13

Then the following

[ 25455.324175s]  WARN ThreadId(01) outbound:accept{client.addr=10.250.187.39:35960}:server{orig_dst=172.20.207.194:80}:profile:http{v=1.x}    : linkerd_app_core::errors: Failed to proxy request: buffered service failed: panic client.addr=10.250.187.39:35960

[ 25455.324282s] DEBUG ThreadId(01) outbound:accept{client.addr=10.250.187.39:35960}: linkerd_app_core::serve: Connection closed

[ 25455.324205s] DEBUG ThreadId(01) outbound:accept{client.addr=10.250.187.39:35960}:server{orig_dst=172.20.207.194:80}:profile:http{v=1.x}    : linkerd_app_core::errors: Handling error with HTTP response status=502 Bad Gateway version=HTTP/1.1

[ 25455.324199s] DEBUG ThreadId(01) outbound:accept{client.addr=10.250.187.39:35960}:server{orig_dst=172.20.207.194:80}:profile:http{v=1.x}    : linkerd_app_core::errors: Closing server-side connection

[ 25455.324251s] DEBUG ThreadId(01) outbound:accept{client.addr=10.250.187.39:35960}:server{orig_dst=172.20.207.194:80}:profile:http{v=1.x}    : linkerd_proxy_http::server: The stack is tearing down the connection

[ 25455.318532s] DEBUG ThreadId(01) outbound:accept{client.addr=10.250.187.39:35960}:server{orig_dst=172.20.207.194:80}:profile:http{v=1.x}    :logical{dst=service-name.prod.svc.cluster.local:80}:concrete{addr=service-name-primary.prod.svc.cluster.local:80}: tower::ready_cache::cac    he: endpoint canceled

Afterwards, the full connection lifecycle looks like this:

[ 25455.329021s] DEBUG ThreadId(01) outbound:accept{client.addr=10.250.187.39:36226}:server{orig_dst=172.20.207.194:80}:profile:http{v=1.x}: linkerd_app_core::errors: Handling error with HTTP response status=502 Bad Gateway version=HTTP/1.1

[ 25455.329114s] DEBUG ThreadId(01) outbound:accept{client.addr=10.250.187.39:36226}: linkerd_app_core::serve: Connection closed

[ 25455.329072s] DEBUG ThreadId(01) outbound:accept{client.addr=10.250.187.39:36226}:server{orig_dst=172.20.207.194:80}:profile:http{v=1.x}: linkerd_proxy_http::server: The stack is tearing down the connection

[ 25455.328793s] DEBUG ThreadId(01) outbound:accept{client.addr=10.250.187.39:36226}:server{orig_dst=172.20.207.194:80}:profile: linkerd_detect: DetectResult protocol=Some(Http1) elapsed=17.323µs

[ 25455.328895s] DEBUG ThreadId(01) outbound:accept{client.addr=10.250.187.39:36226}:server{orig_dst=172.20.207.194:80}:profile:http{v=1.x}: linkerd_proxy_http::server: Handling as HTTP version=Http1

[ 25455.329003s]  WARN ThreadId(01) outbound:accept{client.addr=10.250.187.39:36226}:server{orig_dst=172.20.207.194:80}:profile:http{v=1.x}: linkerd_app_core::errors: Failed to proxy request: buffered service failed: buffered service failed: panic client.addr=10.250.187.39:36226

[ 25455.329015s] DEBUG ThreadId(01) outbound:accept{client.addr=10.250.187.39:36226}:server{orig_dst=172.20.207.194:80}:profile:http{v=1.x}: linkerd_app_core::errors: Closing server-side connection

[ 25455.328821s] DEBUG ThreadId(01) outbound:accept{client.addr=10.250.187.39:36226}:server{orig_dst=172.20.207.194:80}:profile:http{v=1.x}: linkerd_proxy_http::server: Creating HTTP service version=Http1

[ 25455.328971s] DEBUG ThreadId(01) outbound:accept{client.addr=10.250.187.39:36226}:server{orig_dst=172.20.207.194:80}:profile:http{v=1.x}:logical{dst=service-name.prod.svc.cluster.local:80}: linkerd_service_profiles::http::route_request: Updating HTTP routes routes=0

What's worth mentioning

The destination is also running with a Linkerd proxy - nginx - fpm setup
The target endpoint at destination is slow. It reaches execution timeout at 60s, which I think is where the "502 bad gateway" comes from
We did not paste the "keep alive" logs here as we think those are irrelevant. Please let us know if you need them too.

Thanks!

olix0r · 2021-05-19T14:18:47Z

Thank you for the helpful debug info! Oliver Gould < ***@***.*** >

…

On Wed, May 19, 2021 at 5:14 AM, Bruce Chen Wenliang < ***@***.*** > wrote: Hello again. We are able to reproduce the issue with debug proxy. We observed the following pattern during the incident: First we saw large number of logs like the following: DEBUG ThreadId(01) outbound:accept{client.addr=10.250.187.39:35960}:server{orig_dst=172.20.207.194:80}:profile:http{v=1.x}:logical{ dst=service-name. prod. svc. cluster. local:80 ( http://dst=service-name.prod.svc.cluster.local/ ) }:concrete{ addr=service-name-primary. prod. svc. cluster. local:80 ( http://addr=service-name-primary.prod.svc.cluster.local/ ) }: tower::ready_cache::cache: endpoint canceled Then we saw the panic log thread 'main' panicked at 'cancel sender lost', / usr/ local/ cargo/ registry/ src/ github. com-1ecc6299db9ec823/ tower-0. 4. 7/ src/ ready_cache/ cache. rs:397:13 ( http://usr/local/cargo/registry/src/github.com-1ecc6299db9ec823/tower-0.4.7/src/ready_cache/cache.rs:397:13 ) Then the following [ 25455.324175s] WARN ThreadId(01) outbound:accept{client.addr=10.250.187.39:35960}:server{orig_dst=172.20.207.194:80}:profile:http{v=1.x} : linkerd_app_core::errors: Failed to proxy request: buffered service failed: panic client.addr=10.250.187.39:35960 [ 25455.324282s] DEBUG ThreadId(01) outbound:accept{client.addr=10.250.187.39:35960}: linkerd_app_core::serve: Connection closed [ 25455.324205s] DEBUG ThreadId(01) outbound:accept{client.addr=10.250.187.39:35960}:server{orig_dst=172.20.207.194:80}:profile:http{v=1.x} : linkerd_app_core::errors: Handling error with HTTP response status=502 Bad Gateway version=HTTP/1.1 [ 25455.324199s] DEBUG ThreadId(01) outbound:accept{client.addr=10.250.187.39:35960}:server{orig_dst=172.20.207.194:80}:profile:http{v=1.x} : linkerd_app_core::errors: Closing server-side connection [ 25455.324251s] DEBUG ThreadId(01) outbound:accept{client.addr=10.250.187.39:35960}:server{orig_dst=172.20.207.194:80}:profile:http{v=1.x} : linkerd_proxy_http::server: The stack is tearing down the connection [ 25455.318532s] DEBUG ThreadId(01) outbound:accept{client.addr=10.250.187.39:35960}:server{orig_dst=172.20.207.194:80}:profile:http{v=1.x} :logical{ dst=service-name. prod. svc. cluster. local:80 ( http://dst=service-name.prod.svc.cluster.local/ ) }:concrete{ addr=service-name-primary. prod. svc. cluster. local:80 ( http://addr=service-name-primary.prod.svc.cluster.local/ ) }: tower::ready_cache::cac he: endpoint canceled Afterwards, the full connection lifecycle looks like this: [ 25455.329021s] DEBUG ThreadId(01) outbound:accept{client.addr=10.250.187.39:36226}:server{orig_dst=172.20.207.194:80}:profile:http{v=1.x}: linkerd_app_core::errors: Handling error with HTTP response status=502 Bad Gateway version=HTTP/1.1 [ 25455.329114s] DEBUG ThreadId(01) outbound:accept{client.addr=10.250.187.39:36226}: linkerd_app_core::serve: Connection closed [ 25455.329072s] DEBUG ThreadId(01) outbound:accept{client.addr=10.250.187.39:36226}:server{orig_dst=172.20.207.194:80}:profile:http{v=1.x}: linkerd_proxy_http::server: The stack is tearing down the connection [ 25455.328793s] DEBUG ThreadId(01) outbound:accept{client.addr=10.250.187.39:36226}:server{orig_dst=172.20.207.194:80}:profile: linkerd_detect: DetectResult protocol=Some(Http1) elapsed=17.323µs [ 25455.328895s] DEBUG ThreadId(01) outbound:accept{client.addr=10.250.187.39:36226}:server{orig_dst=172.20.207.194:80}:profile:http{v=1.x}: linkerd_proxy_http::server: Handling as HTTP version=Http1 [ 25455.329003s] WARN ThreadId(01) outbound:accept{client.addr=10.250.187.39:36226}:server{orig_dst=172.20.207.194:80}:profile:http{v=1.x}: linkerd_app_core::errors: Failed to proxy request: buffered service failed: buffered service failed: panic client.addr=10.250.187.39:36226 [ 25455.329015s] DEBUG ThreadId(01) outbound:accept{client.addr=10.250.187.39:36226}:server{orig_dst=172.20.207.194:80}:profile:http{v=1.x}: linkerd_app_core::errors: Closing server-side connection [ 25455.328821s] DEBUG ThreadId(01) outbound:accept{client.addr=10.250.187.39:36226}:server{orig_dst=172.20.207.194:80}:profile:http{v=1.x}: linkerd_proxy_http::server: Creating HTTP service version=Http1 [ 25455.328971s] DEBUG ThreadId(01) outbound:accept{client.addr=10.250.187.39:36226}:server{orig_dst=172.20.207.194:80}:profile:http{v=1.x}:logical{ dst=service-name. prod. svc. cluster. local:80 ( http://dst=service-name.prod.svc.cluster.local/ ) }: linkerd_service_profiles::http::route_request: Updating HTTP routes routes=0 What's worth mentioning ----------------------- * The destination is also running with a Linkerd proxy - nginx - fpm setup * The target endpoint at destination is slow. It reaches execution timeout at 60s, which I think is where the "502 bad gateway" comes from * We did not paste the "keep alive" logs here as we think those are irrelevant. Please let us know if you need them too. Thanks! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub ( #6086 (comment) ) , or unsubscribe ( https://github.com/notifications/unsubscribe-auth/AAB2YYWAWZVC53RVSLMWDTDTOOTS7ANCNFSM434LQD4Q ).

hawkw · 2021-05-19T17:27:05Z

First we saw large number of logs like the following:

DEBUG ThreadId(01) outbound:accept{client.addr=10.250.187.39:35960}:server{orig_dst=172.20.207.194:80}:profile:http{v=1.x}:logical{dst=service-name.prod.svc.cluster.local:80}:concrete{addr=service-name-primary.prod.svc.cluster.local:80}: tower::ready_cache::cache: endpoint canceled

Then we saw the panic log

thread 'main' panicked at 'cancel sender lost', /usr/local/cargo/registry/src/github.com-1ecc6299db9ec823/tower-0.4.7/src/ready_cache/cache.rs:397:13

Then the following

Hi @Wenliang-CHEN, is it possible to get more complete logs starting from before the first "endpoint canceled" message was logged? Thank you!

Wenliang-CHEN · 2021-05-20T12:16:51Z

proxy-logs.csv

Hey @hawkw sure, please find the attached.

You will find the 1st occurrence of "endpoint cancelled" at line 12.

And the "cancel sender lost" at line 185.

This branch updates the `futures` crate to v0.3.15. This includes a fix for task starvation with `FuturesUnordered` (added in 0.3.13). This may or may not be related to issues that have been reported in the proxy involving the load balancer (linkerd/linkerd2#6086), but we should update to the fixed version regardless. This may also improve performance in some cases, since we may now have to do fewer poll-wakeup cycles when a load balancer has a large number of pending endpoints.

olix0r · 2021-06-02T14:59:30Z

We've been stress testing the tower ready-cache dependency and have been able to trigger some unexpected behavior, though not the exact problems you seem to have captured. There's at least one fix we picked up from futures (rust-lang/futures-rs#2333) that should eliminate some pathological behavior when there are many endpoints in a balancer.

In your case roughly how many endpoints should exist in the target service? Are endpoints churning (being deleted/created) frequently? Or are they relatively static?

Wenliang-CHEN · 2021-06-02T15:09:04Z

Thanks for the update @olix0r. The target service is a monolith that has around 200 endpoints. I would say 90% of them are static. And the service that makes the outbound call uses 8 of them, all static.

olix0r · 2021-06-02T15:11:34Z

@Wenliang-CHEN Sorry, I should have been clearer: how many pods of the service are running? Are these panics at all correlated with deployments/restarts of the target service?

Wenliang-CHEN · 2021-06-02T15:17:35Z

@olix0r there are 6 pods running. We did observe coincidence between the deployment of the target service and the panics. But it does not always happen. When panic happens, it does not happen to all the pods. It is mostly 1 or 2 pods.

olix0r · 2021-06-02T15:19:54Z

@Wenliang-CHEN Thanks, this is helpful. I doubt that the futures change will help this issue. I suspect that there's a race condition around updating the balancer with new endpoints where we enter an illegal state. We'll focus more on stress testing the update path.

This branch updates the `futures` crate to v0.3.15. This includes a fix for task starvation with `FuturesUnordered` (added in 0.3.13). This may or may not be related to issues that have been reported in the proxy involving the load balancer (linkerd/linkerd2#6086), but we should update to the fixed version regardless. This may also improve performance in some cases, since we may now have to do fewer poll-wakeup cycles when a load balancer has a large number of pending endpoints.

linkerd/linkerd2#6086 describes an issue that sounds closely related to tower-rs#415: There's some sort of consistency issue between the ready-cache's pending stream and its set of cancelations. Where the latter issues describes triggering a panic in the stream receiver, the former describes triggering a panic in the stream implementation. There's no logical reason why we can't continue to operate in this scenario, though it does indicate a real correctness issue. So, this change prevents panicking in this scenario when not building with debugging. Instead, we now emit WARN-level logs so that we have a clearer signal they're occurring. Finally, this change also adds `Debug` constraints to the cache's key types (and hence the balancer's key types) so that we can more reasonably debug this behavior.

olix0r · 2021-06-16T02:33:52Z

We've had a lot of trouble reproducing this in tests, but I think this is very likely a manifestation of the same problem described in tower-rs/tower#415. I'm especially suspicious of tokio::sync::oneshot, but we need to do a better job of eliminating the application logic before investigating such a low-level primitive.

I've put together a tower branch that makes a few changes:

We log in more situations and now include the cache key, so we can track how individual entries move through the cache;
We no longer panic in this situation. We instead emit WARN-level logs
We now emit WARN-level logs (rather than DEBUG) for the issue described in Panicked at 'missing cancelation' in tower-ready-cache tower-rs/tower#415;
The above situation is now handled more gracefully by creating new cancelations rather than dropping the service.

I recommend setting the following annotations on your pod template:

config.linkerd.io/proxy-image: ghcr.io/olix0r/l2-proxy
config.linkerd.io/proxy-version: tower-ready-debug.ab6c68ee
config.linkerd.io/proxy-log-level: linkerd=info,tower::ready_cache=debug,warn

Your application should no longer panic. If you see WARN-level logs, it would be great if we could capture the preceding logs to get a better sense of the access pattern that may be triggering this.

Wenliang-CHEN · 2021-06-16T08:38:13Z

Hey @olix0r thanks for the update. I will get the service running with the new proxy. will report here once I get anything interesting.

Wenliang-CHEN · 2021-06-18T09:13:11Z

Hello @olix0r, so...we let the service run for 2 days with the new proxy. The panic does not happen anymore. And we do notice some WARN message in the logs.

Attached please find the the samples. Please let me know if anything is missing or incomplete.
error_111.txt
error_113.txt
maybe_during_destination_deloyment.txt
no_route_to_host.txt

olix0r · 2021-06-18T14:06:11Z

Thanks! Interestingly, it looks like we haven't triggered the scenario we've seen previously: in those logs most of the warnings appear to come from the reconnect module (which is more-or-less expected if the target endpoint isn't available); but none of those warnings should really impact the balancer/ready_cache.

The logs I'm most interested in catching are Pending service lost its cancelation or Ready service had no associated cancelation.

If you want to filter out the reconnect log messages you could run with the log level linkerd=info,tower::ready_cache=debug,linkerd_reconnect=off,warn -- but it would be great if you could continue running with this proxy version so that we can hopefully hit one of these two cases.

Wenliang-CHEN · 2021-06-18T14:34:38Z

Cool, I will keep it running. Maybe we just need a bit more time until the scenario gets triggered.

olix0r · 2021-06-25T16:46:05Z

@Wenliang-CHEN have you seen any of these warnings over the past week?

Wenliang-CHEN · 2021-06-28T07:58:16Z

Hello @olix0r , we have been running the service with the new proxy, the same log level for a week. There is no log related to "Pending service lost..." or "Ready service had no associated cancelation". And we observe no connection issue from that service any more.

Is it possible that the upgrade of the library somehow fixes/suppresses the issue?

stale · 2021-09-26T08:06:57Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

Tower [v0.4.13] includes a fix for a bug in the `tower::ready_cache` module, tower-rs/tower#415. The `ready_cache` module is used internally in Tower's load balancer. This bug resulted in panics in the proxy (linkerd/linkerd2#8666, linkerd/linkerd2#6086) in cases where the Destination service sends a very large number of service discovery updates (see linkerd/linkerd2#8677). This commit updates the proxy's dependency on `tower` to 0.4.13, to ensure that this bugfix is picked up. Fixes linkerd/linkerd2#8666 Fixes linkerd/linkerd2#6086 [v0.4.13]: https://github.com/tower-rs/tower/releases/tag/tower-0.4.13

olix0r assigned hawkw May 7, 2021

olix0r added area/proxy bug labels May 7, 2021

olix0r added this to the stable-2.11.0 milestone May 7, 2021

hawkw mentioned this issue Jun 1, 2021

deps: update futures to 0.3.15 linkerd/linkerd2-proxy#1022

Merged

olix0r mentioned this issue Jul 27, 2021

Linkerd-proxy v2.148.0 emits "possible future leak" warning message #6556

Closed

stale bot added the wontfix label Sep 26, 2021

stale bot closed this as completed Oct 10, 2021

github-actions bot locked as resolved and limited conversation to collaborators Nov 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Outbound requests fail intermittently after the proxy reported "panicked at 'cancel sender lost'" #6086

Outbound requests fail intermittently after the proxy reported "panicked at 'cancel sender lost'" #6086

Wenliang-CHEN commented Apr 30, 2021 •

edited

olix0r commented Apr 30, 2021

Wenliang-CHEN commented May 3, 2021

olix0r commented May 7, 2021

Wenliang-CHEN commented May 10, 2021 •

edited

hawkw commented May 11, 2021

olix0r commented May 12, 2021

Wenliang-CHEN commented May 14, 2021

Wenliang-CHEN commented May 19, 2021

olix0r commented May 19, 2021 via email

hawkw commented May 19, 2021

Wenliang-CHEN commented May 20, 2021 •

edited

olix0r commented Jun 2, 2021

Wenliang-CHEN commented Jun 2, 2021

olix0r commented Jun 2, 2021

Wenliang-CHEN commented Jun 2, 2021

olix0r commented Jun 2, 2021

olix0r commented Jun 16, 2021

Wenliang-CHEN commented Jun 16, 2021

Wenliang-CHEN commented Jun 18, 2021

olix0r commented Jun 18, 2021 •

edited

Wenliang-CHEN commented Jun 18, 2021

olix0r commented Jun 25, 2021

Wenliang-CHEN commented Jun 28, 2021

stale bot commented Sep 26, 2021

Outbound requests fail intermittently after the proxy reported "panicked at 'cancel sender lost'" #6086

Outbound requests fail intermittently after the proxy reported "panicked at 'cancel sender lost'" #6086

Comments

Wenliang-CHEN commented Apr 30, 2021 • edited

Bug Report

What is the issue?

Additional context

linkerd check output

Environment

olix0r commented Apr 30, 2021

Wenliang-CHEN commented May 3, 2021

olix0r commented May 7, 2021

Wenliang-CHEN commented May 10, 2021 • edited

hawkw commented May 11, 2021

olix0r commented May 12, 2021

Wenliang-CHEN commented May 14, 2021

Wenliang-CHEN commented May 19, 2021

What's worth mentioning

olix0r commented May 19, 2021 via email

hawkw commented May 19, 2021

Wenliang-CHEN commented May 20, 2021 • edited

olix0r commented Jun 2, 2021

Wenliang-CHEN commented Jun 2, 2021

olix0r commented Jun 2, 2021

Wenliang-CHEN commented Jun 2, 2021

olix0r commented Jun 2, 2021

olix0r commented Jun 16, 2021

Wenliang-CHEN commented Jun 16, 2021

Wenliang-CHEN commented Jun 18, 2021

olix0r commented Jun 18, 2021 • edited

Wenliang-CHEN commented Jun 18, 2021

olix0r commented Jun 25, 2021

Wenliang-CHEN commented Jun 28, 2021

stale bot commented Sep 26, 2021

Wenliang-CHEN commented Apr 30, 2021 •

edited

`linkerd check` output

Wenliang-CHEN commented May 10, 2021 •

edited

Wenliang-CHEN commented May 20, 2021 •

edited

olix0r commented Jun 18, 2021 •

edited