Skip to content
This repository has been archived by the owner on Jul 1, 2022. It is now read-only.

Large percentage of spans captured by jaeger_tracer_reporter_spans_total metric are resulting in error #821

Closed
naveedyahyazadeh opened this issue Dec 20, 2021 · 3 comments
Labels

Comments

@naveedyahyazadeh
Copy link

naveedyahyazadeh commented Dec 20, 2021

Describe the bug
A percentage of our instances running with Jaeger Tracing enabled are reporting a high percentage of occurrences of the jaeger_tracer_reporter_spans_total{result="err"} metric. At times the percentage can be greater than 10% or 30%, but never 100% (which we would expect would be the case in the case of, say, an agent outage or networking failure). The percentage is somewhat consistent, but sometimes we see it go to ~0% after fully restarting our application. The below image shows that behavior in action:

image

Expected behavior
We expect to have a much lower percentage of spans failing to report on our instances, ideally near 0%.

Version (please complete the following information):

  • OS: CentOS 7 Linux
uname -r
3.10.0-1160.49.1.el7.x86_64
  • Jaeger version: observed on both 1.1 and 1.6
  • Deployment: EC2 + docker, and K8s

What troubleshooting steps did you try?

  • Enabled debug logging for the RemoteReporter and saw this error message
  • After seeing this change get merged and released, we tried upgrading to Jaeger Client 1.6.0 to see if that would resolve the issue we were experiencing. Unfortunately, it did not, and I believe the reason for that is due to our network configuration for the jaeger client and agent. We have the jaeger client running on one docker image and communicating to the agent through the host.docker.internal host. Because of this, I don't believe there is any DNS resolution happening during client/agent communications.

Previous Gitter Inquiries
More context can be found regarding our environment and the tests that we tried in two small discussions we had on Gitter:
https://gitter.im/jaegertracing/Lobby?at=5eced9bc89941d051a28aa0d
https://gitter.im/jaegertracing/Lobby?at=603fe580d1aee44e2dc0ed11

@naveedyahyazadeh
Copy link
Author

naveedyahyazadeh commented Dec 20, 2021

Additionally, we tried certain tests on the instance to see if we could reproduce this "bad state" where we see persisting client reporting failures. We tried the following tests on live instances:

  1. Restarting the Jaeger Agent
  2. Blocking ports 6831/6832 on the instance to mock a networking failure
  3. Throttling bandwidth on the network
  4. Adding latency on the network

The first two tests resulted in 100% of spans reporting result="err" while the agent was down or the ports were blocked; however, the metric returned to 0% result="err" after we the test concluded and our instance was returned to a normal "healthy" state.

@mehta-ankit
Copy link
Member

Another thread about it on Jaeger Slack: https://cloud-native.slack.com/archives/CGG7NFUJ3/p1615574357080000

@naveedyahyazadeh
Copy link
Author

This is the error that we are seeing from the RemoteReporter logs:

2021-12-21 20:32:29,235 [jaeger.RemoteReporter-QueueProcessor] WARN  io.jaegertracing.internal.reporters.RemoteReporter - FlushCommand execution failed! Repeated errors of this command will not be logged.
io.jaegertracing.internal.exceptions.SenderException: Failed to flush spans.
        at io.jaegertracing.thrift.internal.senders.ThriftSender.flush(ThriftSender.java:116)
        at io.jaegertracing.internal.reporters.RemoteReporter$FlushCommand.execute(RemoteReporter.java:160)
        at io.jaegertracing.internal.reporters.RemoteReporter$QueueProcessor.run(RemoteReporter.java:182)
        at java.lang.Thread.run(Thread.java:748)
Caused by: io.jaegertracing.internal.exceptions.SenderException: Could not send 104 spans
        at io.jaegertracing.thrift.internal.senders.UdpSender.send(UdpSender.java:86)
        at io.jaegertracing.thrift.internal.senders.ThriftSender.flush(ThriftSender.java:114)
        ... 3 more
Caused by: org.apache.thrift.transport.TTransportException: Cannot flush closed transport
        at io.jaegertracing.thrift.internal.reporters.protocols.ThriftUdpTransport.flush(ThriftUdpTransport.java:151)
        at org.apache.thrift.TServiceClient.sendBase(TServiceClient.java:73)
        at org.apache.thrift.TServiceClient.sendBaseOneway(TServiceClient.java:66)
        at io.jaegertracing.agent.thrift.Agent$Client.send_emitBatch(Agent.java:70)
        at io.jaegertracing.agent.thrift.Agent$Client.emitBatch(Agent.java:63)
        at io.jaegertracing.thrift.internal.senders.UdpSender.send(UdpSender.java:84)
        ... 4 more
Caused by: java.io.IOException: No buffer space available (sendto failed)
        at java.net.PlainDatagramSocketImpl.send(Native Method)
        at java.net.DatagramSocket.send(DatagramSocket.java:693)
        at io.jaegertracing.thrift.internal.reporters.protocols.ThriftUdpTransport.flush(ThriftUdpTransport.java:149)
        ... 9 more

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

3 participants