Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Traefik doesn't reconnect to Jaeger when connection lost #6093

Closed
Pehesi97 opened this issue Dec 26, 2019 · 7 comments
Closed

Traefik doesn't reconnect to Jaeger when connection lost #6093

Pehesi97 opened this issue Dec 26, 2019 · 7 comments
Labels
area/middleware/tracing kind/bug/confirmed a confirmed bug (reproducible). priority/P2 need to be fixed in the future status/5-frozen-due-to-age
Projects
Milestone

Comments

@Pehesi97
Copy link

Do you want to request a feature or report a bug?

Report a bug.

What did you do?

Configured Traefik to output traces to a Jaeger instance running on the same Kubernetes cluster using the following arguments:

- --tracing
- --tracing.jaeger.samplingServerURL=jaeger-agent.tracing:5778/sampling
- --tracing.jaeger.localAgentHostPort=jaeger-agent.tracing:68311

It worked correctly, until my Jaeger Agent pod restarted. Traefik didn't connect to the service again and therefore my traces weren't being written to Jaeger.

What did you expect to see?

Traefik reconnecting to Jaeger Agent when it was available again.

What did you see instead?

Traefik didn't reconnect to Jaeger Agent when it was available again.

Output of traefik version: (What version of Traefik are you using?)

Version:      2.0.4
Codename:     montdor
Go version:   go1.13.3
Built:        2019-10-28T20:23:57Z
OS/Arch:      linux/amd64

What is your environment & configuration (arguments, toml, provider, platform, ...)?

image

@juliens juliens added area/middleware/tracing kind/bug/possible a possible bug that needs analysis before it is confirmed or fixed. and removed status/0-needs-triage labels Dec 26, 2019
@ddtmachado
Copy link
Contributor

I can confirm the issue on Kubernetes, and probably on any other platform when using a service name as the Jaeger host endpoint.

To me it looks like the issue is caused by Traefik not recreating the Jaeger client because it's an UDP connection (no fail control), for example:

  • Traefik (actually the imported Jaeger client) resolves the ip address, given by the service name, during the tracing Middleware setup
  • Kubernetes will update the internal DNS records for the service when a Pod is restarted
  • As it is an UDP connection, it doesn't matter if the other side is not listening anymore, Traefik will keep sending to the old pod address and won't recreate the Jaeger client

Not sure what we could do in this case though

@mpl mpl added priority/P2 need to be fixed in the future kind/bug/confirmed a confirmed bug (reproducible). and removed kind/bug/possible a possible bug that needs analysis before it is confirmed or fixed. labels Jan 2, 2020
@ct27stf
Copy link

ct27stf commented Mar 25, 2020

the samplingServerURL is a HTTP endpoint
also the jaeger-agent provides port 14271 | HTTP | Healthcheck at / and metrics at /metrics

@kevtainer
Copy link
Contributor

jaegertracing/jaeger-client-go#403

I believe this is an issue with the jaeger-client-go library, a fix was proposed but rejected as creating too much overhead.

@dtomcej
Copy link
Contributor

dtomcej commented May 12, 2020

@Pehesi97 in this case, isn't jaeger-agent.tracing a service in the tracing namespace?
Similar to (https://github.com/jaegertracing/jaeger-kubernetes/blob/master/all-in-one/jaeger-all-in-one-template.yml#L108)?

If so, the service IP should not have changed, and a restarted pod should not have made any changes to the service IP.

Traefik would never have connected directly to a pod IP, as pods in kubernetes do not normally get a DNS record. There are exceptions, but this does not look like one.

Is it possible that your pod did not pass the readiness checks, and did not get re-added as a service endpoint?

Or is there another reason that your service IP would have changed?

@terev
Copy link

terev commented Jun 27, 2020

@dtomcej this can happen because that template deploys a headless service for the agent. this means no proxying is done, it merely provides a dns convenience for round robinning of pods selected by the service. i've submitted a pr to the jaeger-client-go library jaegertracing/jaeger-client-go#520 that should resolve this issue. would anyone mind having a look and possibly giving it a 👍 for more exposure

@terev
Copy link

terev commented Aug 17, 2020

fyi if you upgrade your dependency now you should see that this issue is resolved by the linked pr

@ldez ldez added this to the 2.3 milestone Aug 20, 2020
@ldez
Copy link
Member

ldez commented Aug 20, 2020

Closed by #7198

@ldez ldez closed this as completed Aug 20, 2020
v2 automation moved this from issues to Done Aug 20, 2020
@traefik traefik locked and limited conversation to collaborators Sep 20, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area/middleware/tracing kind/bug/confirmed a confirmed bug (reproducible). priority/P2 need to be fixed in the future status/5-frozen-due-to-age
Projects
No open projects
v2
Done
Development

No branches or pull requests

10 participants