Large percentage of spans captured by jaeger_tracer_reporter_spans_total metric are resulting in error #821

naveedyahyazadeh · 2021-12-20T16:27:25Z

Describe the bug
A percentage of our instances running with Jaeger Tracing enabled are reporting a high percentage of occurrences of the jaeger_tracer_reporter_spans_total{result="err"} metric. At times the percentage can be greater than 10% or 30%, but never 100% (which we would expect would be the case in the case of, say, an agent outage or networking failure). The percentage is somewhat consistent, but sometimes we see it go to ~0% after fully restarting our application. The below image shows that behavior in action:

Expected behavior
We expect to have a much lower percentage of spans failing to report on our instances, ideally near 0%.

Version (please complete the following information):

OS: CentOS 7 Linux

uname -r
3.10.0-1160.49.1.el7.x86_64

Jaeger version: observed on both 1.1 and 1.6
Deployment: EC2 + docker, and K8s

What troubleshooting steps did you try?

Enabled debug logging for the RemoteReporter and saw this error message
After seeing this change get merged and released, we tried upgrading to Jaeger Client 1.6.0 to see if that would resolve the issue we were experiencing. Unfortunately, it did not, and I believe the reason for that is due to our network configuration for the jaeger client and agent. We have the jaeger client running on one docker image and communicating to the agent through the host.docker.internal host. Because of this, I don't believe there is any DNS resolution happening during client/agent communications.

Previous Gitter Inquiries
More context can be found regarding our environment and the tests that we tried in two small discussions we had on Gitter:
https://gitter.im/jaegertracing/Lobby?at=5eced9bc89941d051a28aa0d
https://gitter.im/jaegertracing/Lobby?at=603fe580d1aee44e2dc0ed11

The text was updated successfully, but these errors were encountered:

naveedyahyazadeh · 2021-12-20T16:46:02Z

Additionally, we tried certain tests on the instance to see if we could reproduce this "bad state" where we see persisting client reporting failures. We tried the following tests on live instances:

Restarting the Jaeger Agent
Blocking ports 6831/6832 on the instance to mock a networking failure
Throttling bandwidth on the network
Adding latency on the network

The first two tests resulted in 100% of spans reporting result="err" while the agent was down or the ports were blocked; however, the metric returned to 0% result="err" after we the test concluded and our instance was returned to a normal "healthy" state.

mehta-ankit · 2021-12-20T18:08:53Z

Another thread about it on Jaeger Slack: https://cloud-native.slack.com/archives/CGG7NFUJ3/p1615574357080000

naveedyahyazadeh · 2021-12-22T19:50:47Z

This is the error that we are seeing from the RemoteReporter logs:

2021-12-21 20:32:29,235 [jaeger.RemoteReporter-QueueProcessor] WARN  io.jaegertracing.internal.reporters.RemoteReporter - FlushCommand execution failed! Repeated errors of this command will not be logged.
io.jaegertracing.internal.exceptions.SenderException: Failed to flush spans.
        at io.jaegertracing.thrift.internal.senders.ThriftSender.flush(ThriftSender.java:116)
        at io.jaegertracing.internal.reporters.RemoteReporter$FlushCommand.execute(RemoteReporter.java:160)
        at io.jaegertracing.internal.reporters.RemoteReporter$QueueProcessor.run(RemoteReporter.java:182)
        at java.lang.Thread.run(Thread.java:748)
Caused by: io.jaegertracing.internal.exceptions.SenderException: Could not send 104 spans
        at io.jaegertracing.thrift.internal.senders.UdpSender.send(UdpSender.java:86)
        at io.jaegertracing.thrift.internal.senders.ThriftSender.flush(ThriftSender.java:114)
        ... 3 more
Caused by: org.apache.thrift.transport.TTransportException: Cannot flush closed transport
        at io.jaegertracing.thrift.internal.reporters.protocols.ThriftUdpTransport.flush(ThriftUdpTransport.java:151)
        at org.apache.thrift.TServiceClient.sendBase(TServiceClient.java:73)
        at org.apache.thrift.TServiceClient.sendBaseOneway(TServiceClient.java:66)
        at io.jaegertracing.agent.thrift.Agent$Client.send_emitBatch(Agent.java:70)
        at io.jaegertracing.agent.thrift.Agent$Client.emitBatch(Agent.java:63)
        at io.jaegertracing.thrift.internal.senders.UdpSender.send(UdpSender.java:84)
        ... 4 more
Caused by: java.io.IOException: No buffer space available (sendto failed)
        at java.net.PlainDatagramSocketImpl.send(Native Method)
        at java.net.DatagramSocket.send(DatagramSocket.java:693)
        at io.jaegertracing.thrift.internal.reporters.protocols.ThriftUdpTransport.flush(ThriftUdpTransport.java:149)
        ... 9 more

naveedyahyazadeh added the bug label Dec 20, 2021

yurishkuro closed this as completed Jan 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large percentage of spans captured by jaeger_tracer_reporter_spans_total metric are resulting in error #821

Large percentage of spans captured by jaeger_tracer_reporter_spans_total metric are resulting in error #821

naveedyahyazadeh commented Dec 20, 2021 •

edited

naveedyahyazadeh commented Dec 20, 2021 •

edited

mehta-ankit commented Dec 20, 2021

naveedyahyazadeh commented Dec 22, 2021

Large percentage of spans captured by jaeger_tracer_reporter_spans_total metric are resulting in error #821

Large percentage of spans captured by jaeger_tracer_reporter_spans_total metric are resulting in error #821

Comments

naveedyahyazadeh commented Dec 20, 2021 • edited

naveedyahyazadeh commented Dec 20, 2021 • edited

mehta-ankit commented Dec 20, 2021

naveedyahyazadeh commented Dec 22, 2021

naveedyahyazadeh commented Dec 20, 2021 •

edited

naveedyahyazadeh commented Dec 20, 2021 •

edited