Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZooKeeper server set namer io.l5d.serversets appears to leak ZooKeeper watches #2460

Open
brandonvin opened this issue Nov 13, 2023 · 1 comment

Comments

@brandonvin
Copy link

brandonvin commented Nov 13, 2023

Issue Type:

Bug report / question

Given that linkerd 1.x is in maintenance mode, I'm not sure how likely a bug fix is. At minimum this issue can help get confirmation of the issue, and help anyone else who encounters issues on linkerd 1.x.

What happened:

Linkerd can lose its connection to ZooKeeper either through a normal rolling restart/update of the ZooKeeper cluster, or spurious connectivity loss. However, when this happens, linkerd enters a tight loop, logging this repeatedly:

Reacquiring watch on com.twitter.finagle.serverset2.client.SessionState$SyncConnected$@742bd12c. Session: 703ca837df3e0a2

During this loop, linkerd can consume >99% of the host's CPU.

If linkerd has been running for a long time before this, then the loop goes on for a long time (thousands of messages). Linkerd then starts logging:

log queue overflow - record dropped

Ultimately the linkerd process OOMs:

VM error: Java heap space
java.lang.OutOfMemoryError: Java heap space

As an experiment, I ran a script to restart all instances of linkerd and let it come back up as normal. The total watch count on the ZooKeeper cluster dropped significantly:

Screenshot 2023-11-12 at 5 19 35 PM

What you expected to happen:

Linkerd would recover gracefully from a spurious ZooKeeper disconnect or rolling restart of ZooKeeper nodes.

Watch count on the ZooKeeper cluster would be stable. It wouldn't increase linearly with linkerd uptime, nor suddenly decrease when linkerd is restarted.

How to reproduce it (as minimally and precisely as possible):

Set up linkerd using the io.l5d.serversets namer, allow it to run for a long time. Then run a rolling restart of the ZooKeeper cluster, or rolling restart of linkerd processes.

Anything else we need to know?:

Linkerd is deployed one-per-host (daemon) on a cluster. So in practice, each instance of linkerd needs to be up all the time to enable communication between services.

In my case, the uptime of linkerd processes was well over 200 days. A workaround could be preventing long uptime of linkerd, by restarting the linkerd processes periodically to prevent accumulating thousands of ZooKeeper watches. This would make the system resilient to blips in ZooKeeper, but restarting linkerd in this deployment model (one-per-host) incurs errors in service communication (or at best, latency spikes - if clients have fine-tuned retry logic).

Environment:

I've seen this on linkerd versions 1.7.4 and 1.7.5.

Linkerd config snippet:

namers:
- kind: io.l5d.serversets
  zkAddrs:
    {%- for zk_addr in zks %}
    - host: {{ zk_addr }}
      port: {{ zk_port }}
    {%- endfor %}
@wmorgan
Copy link
Member

wmorgan commented Nov 13, 2023

Thank you for documenting this @brandonvin . At a minimum this issue help other 1.x users who may encounter this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants