Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Linkerd-proxy v2.148.0 emits "possible future leak" warning message #6556

Closed
Wenliang-CHEN opened this issue Jul 26, 2021 · 7 comments
Closed
Labels

Comments

@Wenliang-CHEN
Copy link
Contributor

Bug Report

What is the issue?

Hey guys, we just upgraded to linkerd-proxy 2.148.0 to try out the new features. Everything works fine, but we realized that the linkerd-proxy keeps emitting logs regarding "possible future leak".

The logging frequency is around 20 logs/second from a service with around 400rps.

A full log body looks like below:

[  5343.955150s]  WARN ThreadId(01) trust_dns_proto::xfer: error notifying wait, possible future leak: TrySendError { kind: Disconnected }    

linkerd check output

kubernetes-api
--------------
√ can initialize the client
√ can query the Kubernetes API
kubernetes-version
------------------
√ is running the minimum Kubernetes API version
√ is running the minimum kubectl version
linkerd-existence
-----------------
√ 'linkerd-config' config map exists
√ heartbeat ServiceAccount exist
√ control plane replica sets are ready
√ no unschedulable pods
√ controller pod is running
linkerd-config
--------------
√ control plane Namespace exists
√ control plane ClusterRoles exist
√ control plane ClusterRoleBindings exist
√ control plane ServiceAccounts exist
√ control plane CustomResourceDefinitions exist
√ control plane MutatingWebhookConfigurations exist
√ control plane ValidatingWebhookConfigurations exist
√ control plane PodSecurityPolicies exist
linkerd-identity
----------------
√ certificate config is valid
√ trust anchors are using supported crypto algorithm
√ trust anchors are within their validity period
√ trust anchors are valid for at least 60 days
√ issuer cert is using supported crypto algorithm
√ issuer cert is within its validity period
‼ issuer cert is valid for at least 60 days
    issuer certificate will expire on 2021-09-04T08:57:23Z
    see https://linkerd.io/checks/#l5d-identity-issuer-cert-not-expiring-soon for hints
√ issuer cert is issued by the trust anchor
linkerd-webhooks-and-apisvc-tls
-------------------------------
√ proxy-injector webhook has valid cert
√ proxy-injector cert is valid for at least 60 days
√ sp-validator webhook has valid cert
√ sp-validator cert is valid for at least 60 days
linkerd-api
-----------
√ control plane pods are ready
√ can initialize the client
√ can query the control plane API
linkerd-version
---------------
√ can determine the latest version
‼ cli is up-to-date
    is running version 2.10.0 but the latest stable version is 2.10.2
    see https://linkerd.io/checks/#l5d-version-cli for hints
control-plane-version
---------------------
‼ control plane is up-to-date
    is running version 2.10.0 but the latest stable version is 2.10.2
    see https://linkerd.io/checks/#l5d-version-control for hints
√ control plane and cli versions match
linkerd-ha-checks
-----------------
√ pod injection disabled on kube-system
Status check results are √
Linkerd extensions checks
=========================
linkerd-viz
-----------
√ linkerd-viz Namespace exists
√ linkerd-viz ClusterRoles exist
√ linkerd-viz ClusterRoleBindings exist
√ tap API server has valid cert
√ tap API server cert is valid for at least 60 days
√ tap API service is running
‼ linkerd-viz pods are injected
    could not find proxy container for prometheus-985945bc7-2rlcr pod
    see https://linkerd.io/checks/#https://linkerd.io/checks/#l5d-viz-pods-injection for hints
√ viz extension pods are running
√ prometheus is installed and configured correctly
√ can initialize the client
√ viz extension self-check
Status check results are √

Environment

  • Kubernetes Version: v1.18.9-eks-d1db3c
  • Cluster Environment: EKS
  • Linkerd version:
    control plane: v2.10.0
    linkerd-proxy: v2.148.0
    linkerd-init: cr.l5d.io/linkerd/proxy-init:v1.3.9
@olix0r
Copy link
Member

olix0r commented Jul 27, 2021

@Wenliang-CHEN interesting! We've never seen this before, as far as I know.

The log is emitted here and is only used here -- and this error seems to indicate that when trying to notify a channel that an error occurred, the receiver has already been dropped.

This looks suspiciously similar to #6086 -- some sort of notification/visibility race.

If there are no functional problems other than the logging, I'd suggest running with a proxy-log-level like linkerd=info,trust_dns=off,warn to disable these log messages.

Can you share details about the hardware/OS platform you're running on including at least the CPU arch, number of cores, and kernel version?

Also can you share the proxy's INFO logs during startup -- I'm mostly curious whether we're using a multi-threaded executor in the proxy in this case or whether this can be hit using a single-threaded executor.

@olix0r
Copy link
Member

olix0r commented Jul 27, 2021

@bluejekyll Do you have any suggestions at what we can do to get more insight into this? complete_with_error is called in a few cases and I don't know which we're hitting in this case.

Why does this error suggest a future leak? Are there any known bad access patterns that could trigger this in user code? Or would this log message indicate a bug in trust-dns?

@bluejekyll
Copy link

It’s been a while that I’ve looked into that area. There’s probably some refactoring we could do to ensure that this case doesn’t happen.

I think the concern is that some of the Futures returned on a request are not executed, ie are dropped before the response is received.

The concern was that a memory leak would happen in a case where the response isn’t polled from the request future.

@bluejekyll
Copy link

@olix0r its entirely possible that there is not issue here and we only need to remove the log line, that would require some time to do that review.

@Wenliang-CHEN
Copy link
Contributor Author

Wenliang-CHEN commented Jul 28, 2021

@olix0r Cool, and yep I will get back to you with more detailed OS information and the proxy startup logs.

please find the info below:

About the OS, we use AWS c5a.4xlarge EC2 instances, 16 vCPUs and "feature 2nd generation 3.3GHz AMD EPYC 7002 series processors built on a 7nm process node".

For the Kernel version I got this

uname -a
Linux ip-10-160-146-253.eu-central-1.compute.internal 4.14.232-176.381.amzn2.x86_64 #1 SMP Wed May 19 00:31:54 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

A sample of linkerd-proxy startup logs looks like this



2021-07-28 10:47:13.255 | [     0.008904s]  INFO ThreadId(02) daemon:identity: linkerd_app: Certified identity: payroll-datev-service.prod.serviceaccount.identity.linkerd.cluster.local

2021-07-28 10:47:13.247 | [     0.000864s]  INFO ThreadId(01) linkerd2_proxy: Inbound interface on 0.0.0.0:4143

2021-07-28 10:47:13.247 | [     0.000873s]  INFO ThreadId(01) linkerd2_proxy: Identity verified via linkerd-identity-headless.linkerd.svc.cluster.local:8080 (linkerd-identity.linkerd.serviceaccount.identity.linkerd.cluster.local)

2021-07-28 10:47:13.247 | [     0.000866s]  INFO ThreadId(01) linkerd2_proxy: Outbound interface on 127.0.0.1:4140

2021-07-28 10:47:13.247 | [     0.000875s]  INFO ThreadId(01) linkerd2_proxy: Destinations resolved via linkerd-dst-headless.linkerd.svc.cluster.local:8086 (linkerd-destination.linkerd.serviceaccount.identity.linkerd.cluster.local)

2021-07-28 10:47:13.247 | [     0.000867s]  INFO ThreadId(01) linkerd2_proxy: Tap interface on 0.0.0.0:4190

2021-07-28 10:47:13.247 | [     0.000504s]  INFO ThreadId(01) linkerd2_proxy::rt: Using single-threaded proxy runtime

2021-07-28 10:47:13.247 | [     0.000849s]  INFO ThreadId(01) linkerd2_proxy: Admin interface on 0.0.0.0:4191

2021-07-28 10:47:13.247 | [     0.000869s]  INFO ThreadId(01) linkerd2_proxy: Local identity is payroll-datev-service.prod.serviceaccount.identity.linkerd.cluster.local
2021-07-28 10:47:13.235 | time="2021-07-28T08:47:13Z" level=info msg="running version edge-21.4.5"

It looks like it is a single-threaded executor.

@olix0r
Copy link
Member

olix0r commented Jul 28, 2021

@Wenliang-CHEN Thanks! We'll try to do some more investigation here. In the meantime, I'd suggest disabling these logs by configuration.

@stale
Copy link

stale bot commented Oct 26, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Oct 26, 2021
@stale stale bot closed this as completed Nov 9, 2021
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 10, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

3 participants