Skip to content
This repository has been archived by the owner on Jul 1, 2022. It is now read-only.

io.jaegertracing.jaeger-client 1.7.0 ICMP port unreachable if agent daemonset restarts #827

Closed
ghost opened this issue Feb 1, 2022 · 4 comments

Comments

@ghost
Copy link

ghost commented Feb 1, 2022

We have agents deployed as daemonset on kubernetes. If for some reason agents restart: microservices that have io.jaegertracing.jaeger-client 1.7.0 will not be able to reconnect to agent nodeIP:6831 udp port. In logs I see this exception:

│ WARN     12:55:17.150    [jaeger.RemoteReporter-QueueProcessor] io.jaegertracing.internal.reporters.RemoteReporter    : FlushCommand execution failed! Repeated errors of this command will │
│  not be logged.                                                                                                                                                                             │
│ io.jaegertracing.internal.exceptions.SenderException: Failed to flush spans.                                                                                                                │
│     at io.jaegertracing.thrift.internal.senders.ThriftSender.flush(ThriftSender.java:116)                                                                                                   │
│     at io.jaegertracing.internal.reporters.RemoteReporter$FlushCommand.execute(RemoteReporter.java:158)                                                                                     │
│     at io.jaegertracing.internal.reporters.RemoteReporter$QueueProcessor.run(RemoteReporter.java:179)                                                                                       │
│     at java.lang.Thread.run(Thread.java:748)                                                                                                                                                │
│ Caused by: io.jaegertracing.internal.exceptions.SenderException: Could not send 1 spans                                                                                                     │
│     at io.jaegertracing.thrift.internal.senders.UdpSender.send(UdpSender.java:86)                                                                                                           │
│     at io.jaegertracing.thrift.internal.senders.ThriftSender.flush(ThriftSender.java:114)                                                                                                   │
│     ... 3 common frames omitted                                                                                                                                                             │
│ Caused by: org.apache.thrift.transport.TTransportException: Cannot flush closed transport                                                                                                   │
│     at io.jaegertracing.thrift.internal.reporters.protocols.ThriftUdpTransport.flush(ThriftUdpTransport.java:151)                                                                           │
│     at org.apache.thrift.TServiceClient.sendBase(TServiceClient.java:73)                                                                                                                    │
│     at org.apache.thrift.TServiceClient.sendBaseOneway(TServiceClient.java:66)                                                                                                              │
│     at io.jaegertracing.agent.thrift.Agent$Client.send_emitBatch(Agent.java:70)                                                                                                             │
│     at io.jaegertracing.agent.thrift.Agent$Client.emitBatch(Agent.java:63)                                                                                                                  │
│     at io.jaegertracing.thrift.internal.senders.UdpSender.send(UdpSender.java:84)                                                                                                           │
│     ... 4 common frames omitted                                                                                                                                                             │
│ Caused by: java.net.PortUnreachableException: ICMP Port Unreachable                                                                                                                         │
│     at java.net.PlainDatagramSocketImpl.send(Native Method)                                                                                                                                 │
│     at java.net.DatagramSocket.send(DatagramSocket.java:693)                                                                                                                                │
│     at io.jaegertracing.thrift.internal.reporters.protocols.ThriftUdpTransport.flush(ThriftUdpTransport.java:149)                                                                           │
│     ... 9 common frames omitted

If microservice is restarted everything will work again.

Microservices are configured with:

env:
- name: JAEGER_AGENT_HOST
  valueFrom:
    fieldRef:
      fieldPath: status.hostIP
- name: JAEGER_AGENT_PORT
  value: "6831"
- name: JAEGER_SAMPLER_MANAGER_HOST_PORT
  value: "$(JAEGER_AGENT_HOST):5778"
- name: JAEGER_SAMPLER_TYPE
  value: "remote"

To Reproduce
Steps to reproduce the behavior:

  1. Restart jaeger daemonset agents
  2. Microservices with jaeger-client java library will report ICMP port unreachable error. Client will not recover when UDP port 6831 becomes available after restart.

Expected behavior
When Jaeger agent daemonset restarts jaeger-client should reconnect successfully to agent.

Version (please complete the following information):

  • OS: Linux
  • Jaeger version: Jaeger Agent 1.7.0. Jaeger operator helm chart: 2.27.1
  • Deployment: Kubernetes

What troubleshooting steps did you try?

  • tried to do: nc -vuzw 3 6831 from microservice pod. Port is reachable from pod but jaeger-client can't connect to it after agent daemonset restart.
@ghost ghost added the bug label Feb 1, 2022
@yurishkuro
Copy link
Member

There was an attempt to fix something similar in #726.

Note that this library is deprecated, there are no plans to fix any bugs unless they are security related. Please see the notice in the readme.

@ghost
Copy link
Author

ghost commented Feb 2, 2022

Thanks. Can I ask a question here? Please ignore if it is not appropriate.

I says in readme that jaeger-client library is being deprecated and open telemetry library should be used instead. I followed this suggested guide https://medium.com/jaegertracing/migrating-from-jaeger-client-to-opentelemetry-sdk-bd337d796759
to implement the new open telemetry client (with latest versions of boms).

Why and how is open telemetry library with jaeger bridge considered ready?

  1. To work in a service you need to use alpha version of a library according to the guide. I used the latest one:
    <dependencyManagement>
        <dependencies>
            <dependency>
                <groupId>io.opentelemetry</groupId>
                <artifactId>opentelemetry-bom</artifactId>
                <version>1.10.1</version>
                <type>pom</type>
                <scope>import</scope>
            </dependency>
            <dependency>
                <groupId>io.opentelemetry</groupId>
                <artifactId>opentelemetry-bom-alpha</artifactId>
                <version>1.10.1-alpha</version>
                <type>pom</type>
                <scope>import</scope>
            </dependency>
        </dependencies>
    </dependencyManagement>
  1. This library does not support UDP communication between client and agent. Client is connecting directly to jaeger collector via gRPC. Agent are circumvented totally and are not needed any more in the jaeger architecture. How is this good? And why does Jaeger documentation, helm chart etc. still say to use agents if they can not be accessed using the new client?

  2. And in the end it doesn't event work. I tried to use open telemetry client and I get this error bellow when starting the microservice that is using it:

│ # A fatal error has been detected by the Java Runtime Environment:                                                                                                                          │
│ #                                                                                                                                                                                           │
│ #  SIGSEGV (0xb) at pc=0x0000000000003efe, pid=1, tid=0x00007fc9a087cae8                                                                                                                    │
│ #                                                                                                                                                                                           │
│ # JRE version: OpenJDK Runtime Environment (8.0_171-b11) (build 1.8.0_171-b11)                                                                                                              │
│ # Java VM: OpenJDK 64-Bit Server VM (25.171-b11 mixed mode linux-amd64 compressed oops)                                                                                                     │
│ # Derivative: IcedTea 3.8.0                                                                                                                                                                 │
│ # Distribution: Custom build (Wed Jun 13 18:28:11 UTC 2018)                                                                                                                                 │
│ # Problematic frame:                                                                                                                                                                        │
│ # C  0x0000000000003efe                                                                                                                                                                     │
│ #                                                                                                                                                                                           │
│ # Core dump written. Default location: /xxxxx/core or core.1                                                                                                                               │
│ #                                                                                                                                                                                           │
│ # An error report file with more information is saved as:                                                                                                                                   │
│ # /xxxxx/hs_err_pid1.log                                                                                                                                                                   │
│ #                                                                                                                                                                                           │
│ # If you would like to submit a bug report, please include                                                                                                                                  │
│ # instructions on how to reproduce the bug and visit:                                                                                                                                       │
│ #   http://icedtea.classpath.org/bugzilla                                                                                                                                                   │
│ #                                                                                                                   

@ghost
Copy link
Author

ghost commented Feb 4, 2022

Reported problem with ICMP port unreachable can be fixed with change in flush method in class io.jaegertracing.thrift.internal.reporters.protocols.ThriftUdpTransport.

If agents are used as Daemonset and not sidecars every redeployment or restart of agents will cause all jaeger-client in services to receive ICMP port uncreachable which will cause socket: java.net.DatagramSocket to be closed.

When agents become available again spans will not continue to be sent from client to agent automatically because socket is closed.

This fix will try to reconnect the socket first and then flush again.

I tested it and it works. If this is not the correct solution it would be good if a maintainer could fix it in the right way.

It is not a security bug but it is a major problem which is causing critical problems in production. For instance we can't restart all services in production every time there is an upgrade of jaeger for which restart of agent Daemonset is needed.

Fix in ThriftUdpTransport:

  @Override
  public void flush() throws TTransportException {
    if (this.writeBuffer != null) {
      byte[] bytes = new byte[MAX_PACKET_SIZE];
      int len = this.writeBuffer.position();
      this.writeBuffer.flip();
      this.writeBuffer.get(bytes, 0, len);
      try {
        this.socket.send(new DatagramPacket(bytes, len));
      } catch (PortUnreachableException e) {
        reconnectSocketAndFlush(bytes, len);
      } catch (IOException e) {
        throw new TTransportException(
            TTransportException.UNKNOWN, "Cannot flush closed transport", e);
      } finally {
        this.writeBuffer = null;
      }
    }
  }

  private void reconnectSocketAndFlush(byte[] bytes, int len) throws TTransportException {
    try {
      this.socket = new DatagramSocket(null);
      this.socket.connect(new InetSocketAddress(host, port));
    } catch (SocketException se) {
      throw new TTransportException(
              TTransportException.UNKNOWN, "TUDPTransport cannot reconnect:", se);
    }
    try {
      this.socket.send(new DatagramPacket(bytes, len));
    } catch (IOException ioe) {
      throw new TTransportException(
              TTransportException.UNKNOWN, "Cannot flush on reconnected transport", ioe);
    }
  }

@yurishkuro
Copy link
Member

Won't fix - this repository is being archived.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

1 participant