You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Linkerd with thrift interpreter gets stuck with old addresses of the service and does not update the information from Namerd anymore.
Our setup
We run our services in a large Nomad cluster with hundreds of VMs. Each VM has 1 Linkerd deployed to it and any service running there connects to Linkerd on localhost. Namerd is deployed on random VMs and Linkerd discovers it using Consul DNS lookup (e.g. namerd.service.consul:4100). We used to have 3 Namerd instances. We tried to upscale it to 6, but the same problem still remains.
Our services talk via Thrift. Linkerd and Namerd are configured with io.l5d.namerd interpreter so they also talk Thrift to each other.
When I am saying restart Namerd - it is actually not a restart but a re-allocation, thus all of its instances move to random VMs and get new IPs.
Same with releasing services - it is a rolling update of service instances and they move to random VMs + when they start they are assigned random ports.
Here is a part of the Linkerd config:
# ...
- protocol: thriftlabel: port-10001thriftProtocol: binaryinterpreter:
kind: io.l5d.namerddst: /$/inet/namerd.service.consul/4100namespace: 10001retry:
baseSeconds: 1maxSeconds: 1# we had to set this to 1 second because Linkerd had terrible recovery time once Namerd is down for a even a few minutes servers:
- port: 10001ip: 0.0.0.0thriftFramed: trueclient:
thriftFramed: trueloadBalancer:
kind: ewmamaxEffort: 24decayTimeMs: 10000failureAccrual:
kind: io.l5d.successRatesuccessRate: 0requests: 1000backoff:
kind: constantms: 0hostConnectionPool:
minSize: 0idleTimeMs: 5000# ...
By the way, we had to reduce the exponential backoff with limiting it to 1 second max because of another nasty behavior we saw when Linkerd took too long to recover if it lost a connection with its Namerd counterpart. An example to reproduce it: Just stop Namerd and relocate the remoting service while having something requesting that service on the Linkerd. Linkerd will have old address thus giving only errors. But when Namerd gets back up - it takes from 2 to 10 minutes for Linkerd to recover, depending on how long did we wait to bring the Namerd back up. By reducing that maxSeconds setting we lowered the recovery time for Linkerd.
Originally we though that it will also fix this issue, but it didn't really help there.
And the config of Namerd:
admin:
port: 9991ip: 0.0.0.0telemetry:
- kind: io.l5d.prometheuspath: /metricsinterfaces:
- kind: io.l5d.thriftNameInterpreterport: 4100ip: 0.0.0.0
- kind: io.l5d.httpControllerport: 4321namers:
- kind: io.l5d.consulhost: 127.0.0.1port: 8500includeTag: falseuseHealthCheck: truehealthStatuses:
- passingconsistencyMode: stalestorage:
kind: io.l5d.inMemorynamespaces:
1: /svc => /#/io.l5d.consul/dc1/service-a # names and namespaces are made up just for the example2: /svc => /#/io.l5d.consul/dc1/service-b3: /svc => /#/io.l5d.consul/dc1/service-c
The problem
Every time we restart Namerd, some random Linkerd instances get "stuck" with old information of remote service addresses. Once we start releasing our services and they relocate to other VMs and get new IPs and ports - some random Linkerd starts emiting service creation failures and the calling service gets the following error from Linkerd: com.twitter.finagle.naming.buoyant.RichConnectionFailedExceptionWithPath: Unable to establish connection to 10.x.x.x:25404. (the ip and ports are random)
Simplified version of service communication: Service A calls Service B via Linkerd using Thrift protocol:
+-----------+ thrift +---------+ thrift +-----------+
| Service A | <------> | Linkerd | <------> | Service B |
+-----------+ +---------+ +-----------+
I decided to catch this bug on production, since it is impossible to reproduce it in QA.
Log of events:
Nov 18th 16:40 - Service B was released in Nomad so all of its instances (it has dozens of them) moved to other VMs and got new IPs and ports.
Nov 18th 16:50 Namerd was redeployed (6 instances) so it also moved to new VMs and got new IPs. The config.yaml of Namerd has not been updated, we did not add any new services. So it was just a regular restart.
Nov 19th 09:39 - Service B was released in Nomad, and again all of its 81 instances got new IPs and ports.
Nov 19th 09:45 - Service A started to produce exceptions like com.twitter.finagle.naming.buoyant.RichConnectionFailedExceptionWithPath: Unable to establish connection to 10.x.x.x:25404. but only on one VM (also on one Linkerd since we do per host Linkerd deployment)
I have downloaded all error logs of Service A and extracted ip+port it was complaining about and checked against the ip+port of the Service B. It all matched the ips that it had on Nov 18th, so before its release on Nov 19th. So it looks like this particular Linkerd instance did not get any updates of Service B.
Nov 19th 10:00 - I decided to start a tcpdump on port 4100 (which is used by Namerd for thrift interpreter) on two VMs with Linkerd running on them: a "good" one and a "broken" one (the one where we saw service creation failures and those exceptions from above).
Nov 19th 10:05 - Redeployed Service B to trigger yet another rellocation of all of its instances.
When Service B finished its release I stopped tcpdump on both VMs and looked into them.
tcpdump result of a "good" Linkerd
I was not able to decode the hex data in the tcp packet but it was pretty much clear what thrift method and args it was sending to Namerd.
"good" Linkerd instance was sending a lot of packets with addr message to Namerd. Those packets contained the dtab path and namespace of the Service B. It looks likes this, but I changed the real name of the service to Service B:
Namerd was replying to that message with addresses and ip of the newly released instances of Service B. Since I could not decode the thrift message I at least managed to find all those addresses using a Wireshark filter like this: data contains 0A:00:00:01 and data contains 6D:AD (this IP is fake, but I was searching for all combination of Service B IPs and port that I knew at that time).
tcpdump of a "bad" Linkerd
"bad" Linkerd instance for some reason did not send that to Namerd. At least not for the Service B. In all the packets that contained addr thrift method it was only requesting for a completely different service. However at the same time there were services trying to talk to Service B via Linkerd and they kept getting errors.
Both Linkerd instances "good" and "bad" had the same last log message at that time (Nov 18th):
E 1118 16:52:26.846 GMT THREAD529 TraceId:41b99eb3a22972c7: addr on /#/io.l5d.consul/dc1/service-b
Failure(syscall:getsockopt(..) failed: Connection refused: namerd.service.consul/10.x.x.x:4100 at remote address: namerd.service.consul/10.x.x.x:4100. Remote Info: Not Available, flags=0x18) with RemoteInfo -> Upstream Address: /127.0.0.1:42708, Upstream id: Not Available, Downstream Address: namerd.service.consul/10.x.x.x:4100, Downstream label: interpreter/io.buoyant.namerd.iface.NamerdInterpreterConfig, Trace Id: 554952f66844976c.554952f66844976c<:554952f66844976c with Service -> interpreter/io.buoyant.namerd.iface.NamerdInterpreterConfig
Caused by: com.twitter.finagle.ConnectionFailedException: syscall:getsockopt(..) failed: Connection refused: namerd.service.consul/10.x.x.x:4100 at remote address: namerd.service.consul/10.x.x.x:4100. Remote Info: Not Available
at com.twitter.finagle.netty4.ConnectionBuilder$$anon$1.operationComplete(ConnectionBuilder.scala:100)
at com.twitter.finagle.netty4.ConnectionBuilder$$anon$1.operationComplete(ConnectionBuilder.scala:79)
at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:511)
at io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:504)
at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:483)
at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:424)
at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:121)
at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.fulfillConnectPromise(AbstractEpollChannel.java:629)
at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.finishConnect(AbstractEpollChannel.java:648)
at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.epollOutReady(AbstractEpollChannel.java:522)
at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:423)
at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:330)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:897)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at com.twitter.finagle.util.BlockingTimeTrackingThreadFactory$$anon$1.run(BlockingTimeTrackingThreadFactory.scala:23)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(Thread.java:819)
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: syscall:getsockopt(..) failed: Connection refused: namerd.service.consul/10.x.x.x:4100
at io.netty.channel.unix.Socket.finishConnect(..)(Unknown Source)
Caused by: io.netty.channel.unix.Errors$NativeConnectException: syscall:getsockopt(..) failed: Connection refused
... 1 more
Summary
So it looks like the "bad" instance of Linkerd got stuck with stale information about Service B and stopped requesting any updates for it. It did request some updates about another service though. To fix this we had to restart that Linkerd and then it started working correctly.
We could not ever reporoduce it on QA. It looks like it is some very rare condition so you need a big cluster with lots of services moving around to catch it. We also noticed that we started to run into this issue a lot more often once we increased our cluster by 25%. Previously it was not always the case that we run into it. But now it is almost always the case after Namerd is restarted.
Next steps
We suspect that the problem maybe somewhere in the ThriftNamerClient (probably in the loop method). We are going to try to switch to mesh interpreter to see if it fixes the problem.
The text was updated successfully, but these errors were encountered:
@cpretzer Sorry forgot to put it in. We use Linkerd 1.7.1. We are going to upgrade but it will take some time.
But just to mention: this bug existed in all versions of Linkerd in the last 2 years. At least we saw this bug from the version we started using Linkerd 2 years ago. Every time we thought that we should just upgrade. Also back then our clusters were smaller so we had less chance of running into this bug. But now, due to the size of our Nomad clusters, we always run into it with each Namerd restart.
Issue Type:
What happened:
Linkerd with thrift interpreter gets stuck with old addresses of the service and does not update the information from Namerd anymore.
Our setup
We run our services in a large Nomad cluster with hundreds of VMs. Each VM has 1 Linkerd deployed to it and any service running there connects to Linkerd on localhost. Namerd is deployed on random VMs and Linkerd discovers it using Consul DNS lookup (e.g.
namerd.service.consul:4100
). We used to have 3 Namerd instances. We tried to upscale it to 6, but the same problem still remains.Our services talk via Thrift. Linkerd and Namerd are configured with
io.l5d.namerd
interpreter so they also talk Thrift to each other.When I am saying restart Namerd - it is actually not a restart but a re-allocation, thus all of its instances move to random VMs and get new IPs.
Same with releasing services - it is a rolling update of service instances and they move to random VMs + when they start they are assigned random ports.
Here is a part of the Linkerd config:
By the way, we had to reduce the exponential backoff with limiting it to 1 second max because of another nasty behavior we saw when Linkerd took too long to recover if it lost a connection with its Namerd counterpart. An example to reproduce it: Just stop Namerd and relocate the remoting service while having something requesting that service on the Linkerd. Linkerd will have old address thus giving only errors. But when Namerd gets back up - it takes from 2 to 10 minutes for Linkerd to recover, depending on how long did we wait to bring the Namerd back up. By reducing that maxSeconds setting we lowered the recovery time for Linkerd.
Originally we though that it will also fix this issue, but it didn't really help there.
And the config of Namerd:
The problem
Every time we restart Namerd, some random Linkerd instances get "stuck" with old information of remote service addresses. Once we start releasing our services and they relocate to other VMs and get new IPs and ports - some random Linkerd starts emiting service creation failures and the calling service gets the following error from Linkerd:
com.twitter.finagle.naming.buoyant.RichConnectionFailedExceptionWithPath: Unable to establish connection to 10.x.x.x:25404.
(the ip and ports are random)Simplified version of service communication:
Service A
callsService B
via Linkerd using Thrift protocol:I decided to catch this bug on production, since it is impossible to reproduce it in QA.
Log of events:
Service B
was released in Nomad so all of its instances (it has dozens of them) moved to other VMs and got new IPs and ports.Service B
was released in Nomad, and again all of its 81 instances got new IPs and ports.Service A
started to produce exceptions likecom.twitter.finagle.naming.buoyant.RichConnectionFailedExceptionWithPath: Unable to establish connection to 10.x.x.x:25404.
but only on one VM (also on one Linkerd since we do per host Linkerd deployment)Service A
and extracted ip+port it was complaining about and checked against the ip+port of theService B
. It all matched the ips that it had on Nov 18th, so before its release on Nov 19th. So it looks like this particular Linkerd instance did not get any updates ofService B
.Service B
to trigger yet another rellocation of all of its instances.Service B
finished its release I stopped tcpdump on both VMs and looked into them.tcpdump result of a "good" Linkerd
I was not able to decode the hex data in the tcp packet but it was pretty much clear what thrift method and args it was sending to Namerd.
"good" Linkerd instance was sending a lot of packets with
addr
message to Namerd. Those packets contained the dtab path and namespace of theService B
. It looks likes this, but I changed the real name of the service toService B
:Namerd was replying to that message with addresses and ip of the newly released instances of
Service B
. Since I could not decode the thrift message I at least managed to find all those addresses using a Wireshark filter like this:data contains 0A:00:00:01 and data contains 6D:AD
(this IP is fake, but I was searching for all combination ofService B
IPs and port that I knew at that time).tcpdump of a "bad" Linkerd
"bad" Linkerd instance for some reason did not send that to Namerd. At least not for the
Service B
. In all the packets that containedaddr
thrift method it was only requesting for a completely different service. However at the same time there were services trying to talk toService B
via Linkerd and they kept getting errors.Both Linkerd instances "good" and "bad" had the same last log message at that time (Nov 18th):
Summary
So it looks like the "bad" instance of Linkerd got stuck with stale information about
Service B
and stopped requesting any updates for it. It did request some updates about another service though. To fix this we had to restart that Linkerd and then it started working correctly.We could not ever reporoduce it on QA. It looks like it is some very rare condition so you need a big cluster with lots of services moving around to catch it. We also noticed that we started to run into this issue a lot more often once we increased our cluster by 25%. Previously it was not always the case that we run into it. But now it is almost always the case after Namerd is restarted.
Next steps
We suspect that the problem maybe somewhere in the ThriftNamerClient (probably in the
loop
method). We are going to try to switch to mesh interpreter to see if it fixes the problem.The text was updated successfully, but these errors were encountered: