Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support for trace logs for all outbound and inbound communication in linkerd and namerd #2245

Open
1 of 2 tasks
thedebugger opened this issue Mar 6, 2019 · 4 comments
Open
1 of 2 tasks

Comments

@thedebugger
Copy link

Thanks for your help improving the project!

Getting Help

Github issues are for bug reports and feature requests. For questions about
Linkerd, how to use it, or debugging assistance, start by
asking a question in the forums or join us on
Slack.

Full details at CONTRIBUTING.md.

Filing a Linkerd issue

Issue Type:

  • Bug report
  • Feature request

What happened:
We are seeing numerous cases where 1) namerd state diverge from k8s apiserver - dtabs and endpoint 2) linkerd state diverge from namerd e.g. linkerd thinks only 4 pods are available, but namerd sees 10. We suspect events are getting dropped in namerd and linkerd. To figure out where things are going wrong, linkerd and namerd should trace all inbound and outbound watch (or relevant) events.

What you expected to happen:
Namerd and linkerd should log all inbound and outbound events. E.g.

  1. Namerd: received X event from apiserver
  2. Namerd: making X API call to apiserver
  3. Namerd: sending event X to linkerd Y along with success/failure
  4. Namerd: received an open stream from Linkerd X
  5. Linkerd: making X api call to namerd
  6. Linkerd: received X api call to namerd

If we can also log on namerd that linkerd with "hostname" is connected to namerd, that would be really helpful too. IPs are NATed. So we lose the observability what linkerd is connected to what namerd.

How to reproduce it (as minimally and precisely as possible):
N/A

Anything else we need to know?:
Let me know if you know more information

Environment:

  • linkerd/namerd version, config files: linkerd 1.5.2, namerd 1.5.2, http2 mesh, thriftmux
  • Platform, version, and config files (Kubernetes, DC/OS, etc): 1.10.5
  • Cloud provider or hardware configuration: not relevant
@thedebugger thedebugger changed the title support for trace logs for all outbound communication support for trace logs for all outbound and inbound communication in linkerd and namerd Mar 6, 2019
@dadjeibaah
Copy link
Contributor

@thedebugger I've begun working on this and it looks the k8s namer has some reach logging statements that are available. For example, The log lines below show Linkerd with a k8s namer set to trace logging. The namer is watching a service whose endpoints refresh every 30 seconds.

D 0517 15:37:19.813 PDT THREAD105 TraceId:36560018d6e610fa: k8s event on '/api/v1/watch/namespaces/default/endpoints/hello1' received 'EndpointsModified(Endpoints(Some(List(EndpointSubset(None,Some(List(EndpointAddress(10.1.1.55,Some(docker-for-desktop),Some(ObjectReference(Some(Pod),Some(default),Some(hello1-5c4b78f69-tkqld),Some(5317b137-78f4-11e9-b3b7-025000000001),None,Some(173776),None))))),Some(List(EndpointPort(7777,Some(http),Some(TCP))))))),Some(Endpoints),Some(ObjectMeta(Some(hello1),None,Some(default),Some(/api/v1/namespaces/default/endpoints/hello1),Some(b7e56b41-78ec-11e9-b3b7-025000000001),Some(173777),None,Some(2019-05-17T21:42:50Z),None,None,None)),Some(v1)))
D 0517 15:37:19.813 PDT THREAD105 TraceId:36560018d6e610fa: k8s watch on '/api/v1/watch/namespaces/default/endpoints/hello1' registered event with same or newer resource version Some(173777) (older resource version was Some(173759))
D 0517 15:37:19.813 PDT THREAD105 TraceId:36560018d6e610fa: k8s ns default service hello1 modified endpoints
D 0517 15:37:19.813 PDT THREAD105 TraceId:36560018d6e610fa: k8s ns default service hello1 added Endpoint(/10.1.1.55,Some(docker-for-desktop))
D 0517 15:37:19.813 PDT THREAD105 TraceId:36560018d6e610fa: k8s ns default service hello1 mapped port http to 7777
D 0517 15:37:47.726 PDT THREAD105 TraceId:36560018d6e610fa: k8s event on '/api/v1/watch/namespaces/default/endpoints/hello1' received 'EndpointsModified(Endpoints(None,Some(Endpoints),Some(ObjectMeta(Some(hello1),None,Some(default),Some(/api/v1/namespaces/default/endpoints/hello1),Some(b7e56b41-78ec-11e9-b3b7-025000000001),Some(173833),None,Some(2019-05-17T21:42:50Z),None,None,None)),Some(v1)))
D 0517 15:37:47.726 PDT THREAD105 TraceId:36560018d6e610fa: k8s watch on '/api/v1/watch/namespaces/default/endpoints/hello1' registered event with same or newer resource version Some(173833) (older resource version was Some(173777))
D 0517 15:37:47.726 PDT THREAD105 TraceId:36560018d6e610fa: k8s ns default service hello1 modified endpoints
D 0517 15:37:47.727 PDT THREAD105 TraceId:36560018d6e610fa: k8s ns default service hello1 removed Endpoint(/10.1.1.55,Some(docker-for-desktop))
D 0517 15:37:47.727 PDT THREAD105 TraceId:36560018d6e610fa: k8s ns default service hello1 removed port mapping from http to 7777
D 0517 15:37:49.696 PDT THREAD105 TraceId:36560018d6e610fa: k8s event on '/api/v1/watch/namespaces/default/endpoints/hello1' received 'EndpointsModified(Endpoints(Some(List(EndpointSubset(None,Some(List(EndpointAddress(10.1.1.56,Some(docker-for-desktop),Some(ObjectReference(Some(Pod),Some(default),Some(hello1-6b8f7994c8-d5kpp),Some(6515bb6b-78f4-11e9-b3b7-025000000001),None,Some(173843),None))))),Some(List(EndpointPort(7777,Some(http),Some(TCP))))))),Some(Endpoints),Some(ObjectMeta(Some(hello1),None,Some(default),Some(/api/v1/namespaces/default/endpoints/hello1),Some(b7e56b41-78ec-11e9-b3b7-025000000001),Some(173845),None,Some(2019-05-17T21:42:50Z),None,None,None)),Some(v1)))
D 0517 15:37:49.696 PDT THREAD105 TraceId:36560018d6e610fa: k8s watch on '/api/v1/watch/namespaces/default/endpoints/hello1' registered event with same or newer resource version Some(173845) (older resource version was Some(173833))
D 0517 15:37:49.697 PDT THREAD105 TraceId:36560018d6e610fa: k8s ns default service hello1 modified endpoints
D 0517 15:37:49.697 PDT THREAD105 TraceId:36560018d6e610fa: k8s ns default service hello1 added Endpoint(/10.1.1.56,Some(docker-for-desktop))
D 0517 15:37:49.697 PDT THREAD105 TraceId:36560018d6e610fa: k8s ns default service hello1 mapped port http to 7777
...

Note: There is a new log line ( k8s event...) that I added in a dev branch, however, the other log lines are currently available on master and I think are sufficient to track the state of endpoints being watched by Namerd.

What we would need to add is logging between Linkerd and Namerd's hop, which is everything from points 3 to 6. Thoughts?

@cheriot
Copy link

cheriot commented May 20, 2019

Hi @dadjeibaah, thanks for taking a look. The existing logging for 1 and 2 looks good for our purposes. The biggest ongoing problem we have is linkerd state diverging from namerd. A namerd release with [points 3 and 4] logs identifying the route and downstream linkerd might be the key.

@dadjeibaah
Copy link
Contributor

@cheriot, got it. I wonder if you could infer which downstream Linkerd Namerd is sending discovery information by running tracing logging in the h2 module. At this log level, Namerd will log HTTP/2 frames like this:

D 0521 11:35:55.692 PDT THREAD95: [L:/127.0.0.1:4321 R:/127.0.0.1:64000] write: DATA
D 0521 11:35:55.693 PDT THREAD115: [L:/127.0.0.1:4321 R:/127.0.0.1:64000] wrote: DATA: Return(())
D 0521 11:35:55.693 PDT THREAD115: [L:/127.0.0.1:4321 R:/127.0.0.1:64000] wrote: DATA: Return(())

In the example above, you can infer the downstream Linkerd client, which in this case is the ip at R:/127.0.0.1:64000, There is also a PR #2255, that will give access log style output to give you route information. Would this info satisfy requirements 3 and 4?

@cheriot
Copy link

cheriot commented May 22, 2019

I don't think the H2 frames will get us there. Is there any identifier available that's more clear than ip? That will be difficult to trace back (in some datacenters there's a NAT). Let me describe the problem more. The two biggest issues are:

  1. Some linkerds routing to an endpoint that is gone after other linkerds have gotten the updated endpoints.
  2. Some linkerds missing a dtab change.

In both cases, we've been identifying the stale linkerds from metrics and restarting them. With the linkerd per host model that's pretty impactful.

What we need on the namerd side is some identifier for the linkerd receiving the update, the route being updated, and an indication of whether this is a dtab change, an endpoint change, or something else.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

3 participants