Linkerd sporadically stops watching remote addresses in Namerd with thrift interpreter #2411

ishubin · 2020-11-25T22:08:51Z

Issue Type:

Bug report
Feature request

What happened:

Linkerd with thrift interpreter gets stuck with old addresses of the service and does not update the information from Namerd anymore.

Our setup

We run our services in a large Nomad cluster with hundreds of VMs. Each VM has 1 Linkerd deployed to it and any service running there connects to Linkerd on localhost. Namerd is deployed on random VMs and Linkerd discovers it using Consul DNS lookup (e.g. namerd.service.consul:4100). We used to have 3 Namerd instances. We tried to upscale it to 6, but the same problem still remains.
Our services talk via Thrift. Linkerd and Namerd are configured with io.l5d.namerd interpreter so they also talk Thrift to each other.

When I am saying restart Namerd - it is actually not a restart but a re-allocation, thus all of its instances move to random VMs and get new IPs.
Same with releasing services - it is a rolling update of service instances and they move to random VMs + when they start they are assigned random ports.

Here is a part of the Linkerd config:

# ...
- protocol: thrift
  label: port-10001
  thriftProtocol: binary
  interpreter:
    kind: io.l5d.namerd
    dst: /$/inet/namerd.service.consul/4100
    namespace: 10001
    retry:
      baseSeconds: 1 
      maxSeconds: 1 # we had to set this to 1 second because Linkerd had terrible recovery time once Namerd is down for a even a few minutes 
  servers:
  - port: 10001
    ip: 0.0.0.0
    thriftFramed: true
  client:
    thriftFramed: true
    loadBalancer:
      kind: ewma
      maxEffort: 24
      decayTimeMs: 10000
    failureAccrual:
      kind: io.l5d.successRate
      successRate: 0
      requests: 1000
      backoff:
        kind: constant
        ms: 0
    hostConnectionPool:
      minSize: 0
      idleTimeMs: 5000
# ...

By the way, we had to reduce the exponential backoff with limiting it to 1 second max because of another nasty behavior we saw when Linkerd took too long to recover if it lost a connection with its Namerd counterpart. An example to reproduce it: Just stop Namerd and relocate the remoting service while having something requesting that service on the Linkerd. Linkerd will have old address thus giving only errors. But when Namerd gets back up - it takes from 2 to 10 minutes for Linkerd to recover, depending on how long did we wait to bring the Namerd back up. By reducing that maxSeconds setting we lowered the recovery time for Linkerd.
Originally we though that it will also fix this issue, but it didn't really help there.

And the config of Namerd:

admin:
  port: 9991
  ip: 0.0.0.0

telemetry:
- kind: io.l5d.prometheus
  path: /metrics

interfaces:
- kind: io.l5d.thriftNameInterpreter
  port: 4100
  ip: 0.0.0.0
- kind: io.l5d.httpController
  port: 4321

namers:
- kind: io.l5d.consul
  host: 127.0.0.1
  port: 8500
  includeTag: false
  useHealthCheck: true
  healthStatuses:
  - passing
  consistencyMode: stale

storage:
  kind: io.l5d.inMemory
  namespaces:
      1: /svc => /#/io.l5d.consul/dc1/service-a  # names and namespaces are made up just for the example
      2: /svc => /#/io.l5d.consul/dc1/service-b
      3: /svc => /#/io.l5d.consul/dc1/service-c

The problem

Every time we restart Namerd, some random Linkerd instances get "stuck" with old information of remote service addresses. Once we start releasing our services and they relocate to other VMs and get new IPs and ports - some random Linkerd starts emiting service creation failures and the calling service gets the following error from Linkerd: com.twitter.finagle.naming.buoyant.RichConnectionFailedExceptionWithPath: Unable to establish connection to 10.x.x.x:25404. (the ip and ports are random)

Simplified version of service communication: Service A calls Service B via Linkerd using Thrift protocol:

+-----------+  thrift  +---------+  thrift  +-----------+
| Service A | <------> | Linkerd | <------> | Service B |
+-----------+          +---------+          +-----------+

I decided to catch this bug on production, since it is impossible to reproduce it in QA.

Log of events:

Nov 18th 16:40 - Service B was released in Nomad so all of its instances (it has dozens of them) moved to other VMs and got new IPs and ports.
Nov 18th 16:50 Namerd was redeployed (6 instances) so it also moved to new VMs and got new IPs. The config.yaml of Namerd has not been updated, we did not add any new services. So it was just a regular restart.
Nov 19th 09:39 - Service B was released in Nomad, and again all of its 81 instances got new IPs and ports.
Nov 19th 09:45 - Service A started to produce exceptions like com.twitter.finagle.naming.buoyant.RichConnectionFailedExceptionWithPath: Unable to establish connection to 10.x.x.x:25404. but only on one VM (also on one Linkerd since we do per host Linkerd deployment)
I have downloaded all error logs of Service A and extracted ip+port it was complaining about and checked against the ip+port of the Service B. It all matched the ips that it had on Nov 18th, so before its release on Nov 19th. So it looks like this particular Linkerd instance did not get any updates of Service B.
Nov 19th 10:00 - I decided to start a tcpdump on port 4100 (which is used by Namerd for thrift interpreter) on two VMs with Linkerd running on them: a "good" one and a "broken" one (the one where we saw service creation failures and those exceptions from above).
Nov 19th 10:05 - Redeployed Service B to trigger yet another rellocation of all of its instances.
When Service B finished its release I stopped tcpdump on both VMs and looked into them.

tcpdump result of a "good" Linkerd

I was not able to decode the hex data in the tcp packet but it was pretty much clear what thrift method and args it was sending to Namerd.

"good" Linkerd instance was sending a lot of packets with addr message to Namerd. Those packets contained the dtab path and namespace of the Service B. It looks likes this, but I changed the real name of the service to Service B:

.....add r.......
........ ..q.[...
........ ......#.
...io.l5 d.consul
....dc1. ...#just
an-examp le-servi
ce-be... ........
......23314.....

Namerd was replying to that message with addresses and ip of the newly released instances of Service B. Since I could not decode the thrift message I at least managed to find all those addresses using a Wireshark filter like this: data contains 0A:00:00:01 and data contains 6D:AD (this IP is fake, but I was searching for all combination of Service B IPs and port that I knew at that time).

tcpdump of a "bad" Linkerd

"bad" Linkerd instance for some reason did not send that to Namerd. At least not for the Service B. In all the packets that contained addr thrift method it was only requesting for a completely different service. However at the same time there were services trying to talk to Service B via Linkerd and they kept getting errors.

Both Linkerd instances "good" and "bad" had the same last log message at that time (Nov 18th):

E 1118 16:52:26.846 GMT THREAD529 TraceId:41b99eb3a22972c7: addr on /#/io.l5d.consul/dc1/service-b
Failure(syscall:getsockopt(..) failed: Connection refused: namerd.service.consul/10.x.x.x:4100 at remote address: namerd.service.consul/10.x.x.x:4100. Remote Info: Not Available, flags=0x18) with RemoteInfo -> Upstream Address: /127.0.0.1:42708, Upstream id: Not Available, Downstream Address: namerd.service.consul/10.x.x.x:4100, Downstream label: interpreter/io.buoyant.namerd.iface.NamerdInterpreterConfig, Trace Id: 554952f66844976c.554952f66844976c<:554952f66844976c with Service -> interpreter/io.buoyant.namerd.iface.NamerdInterpreterConfig
Caused by: com.twitter.finagle.ConnectionFailedException: syscall:getsockopt(..) failed: Connection refused: namerd.service.consul/10.x.x.x:4100 at remote address: namerd.service.consul/10.x.x.x:4100. Remote Info: Not Available
	at com.twitter.finagle.netty4.ConnectionBuilder$$anon$1.operationComplete(ConnectionBuilder.scala:100)
	at com.twitter.finagle.netty4.ConnectionBuilder$$anon$1.operationComplete(ConnectionBuilder.scala:79)
	at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:511)
	at io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:504)
	at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:483)
	at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:424)
	at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:121)
	at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.fulfillConnectPromise(AbstractEpollChannel.java:629)
	at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.finishConnect(AbstractEpollChannel.java:648)
	at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.epollOutReady(AbstractEpollChannel.java:522)
	at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:423)
	at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:330)
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:897)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at com.twitter.finagle.util.BlockingTimeTrackingThreadFactory$$anon$1.run(BlockingTimeTrackingThreadFactory.scala:23)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.lang.Thread.run(Thread.java:819)
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: syscall:getsockopt(..) failed: Connection refused: namerd.service.consul/10.x.x.x:4100
	at io.netty.channel.unix.Socket.finishConnect(..)(Unknown Source)
Caused by: io.netty.channel.unix.Errors$NativeConnectException: syscall:getsockopt(..) failed: Connection refused
	... 1 more

Summary

So it looks like the "bad" instance of Linkerd got stuck with stale information about Service B and stopped requesting any updates for it. It did request some updates about another service though. To fix this we had to restart that Linkerd and then it started working correctly.

We could not ever reporoduce it on QA. It looks like it is some very rare condition so you need a big cluster with lots of services moving around to catch it. We also noticed that we started to run into this issue a lot more often once we increased our cluster by 25%. Previously it was not always the case that we run into it. But now it is almost always the case after Namerd is restarted.

Next steps

We suspect that the problem maybe somewhere in the ThriftNamerClient (probably in the loop method). We are going to try to switch to mesh interpreter to see if it fixes the problem.

The text was updated successfully, but these errors were encountered:

cpretzer · 2020-11-26T06:38:21Z

hi @ishubin thanks for the detail in this report.

What you describe sounds a bit like this issue.

Which version of Linkerd are you using? If you're not using 1.7.4, I suggest upgrading to see if that addresses the issue.

ishubin · 2020-11-26T10:01:47Z

@cpretzer Sorry forgot to put it in. We use Linkerd 1.7.1. We are going to upgrade but it will take some time.

But just to mention: this bug existed in all versions of Linkerd in the last 2 years. At least we saw this bug from the version we started using Linkerd 2 years ago. Every time we thought that we should just upgrade. Also back then our clusters were smaller so we had less chance of running into this bug. But now, due to the size of our Nomad clusters, we always run into it with each Namerd restart.

cpretzer · 2020-12-01T05:31:52Z

Thanks for the additional info @ishubin

Please keep this issue updated as you roll out the upgrade.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Linkerd sporadically stops watching remote addresses in Namerd with thrift interpreter #2411

Linkerd sporadically stops watching remote addresses in Namerd with thrift interpreter #2411

ishubin commented Nov 25, 2020

cpretzer commented Nov 26, 2020

ishubin commented Nov 26, 2020

cpretzer commented Dec 1, 2020

Linkerd sporadically stops watching remote addresses in Namerd with thrift interpreter #2411

Linkerd sporadically stops watching remote addresses in Namerd with thrift interpreter #2411

Comments

ishubin commented Nov 25, 2020

Our setup

The problem

tcpdump result of a "good" Linkerd

tcpdump of a "bad" Linkerd

Summary

Next steps

cpretzer commented Nov 26, 2020

ishubin commented Nov 26, 2020

cpretzer commented Dec 1, 2020