Use exponential backoff when retrying connecting to statsd via UDP #4042

jon-signal · 2023-08-22T21:43:34Z

When connecting to a statsd server via UDP, StatsdMeterRegistry will retry the connection attempt indefinitely and without delay if the connection cannot be established immediately. In a scenario where an application container is launching alongside a statsd sidecar container, it may be the case that the statsd container will not be ready immediately when the application container starts. This can lead to excessive CPU consumption and the logging of thousands or even millions of spurious warnings.

This change moves from an indefinite/immediate retry strategy to an indefinite/exponential backoff strategy. This should dramatically reduce resource consumption and log production in cases where statsd isn't immediately available, and it fixes #2624.

pivotal-cla · 2023-08-22T21:43:37Z

@jon-signal Please sign the Contributor License Agreement!

Click here to manually synchronize the status of this Pull Request.

See the FAQ for frequently asked questions.

pivotal-cla · 2023-08-22T22:03:14Z

@jon-signal Thank you for signing the Contributor License Agreement!

shakuzen · 2024-04-18T10:03:25Z

Thanks for the PR. I was trying to reproduce the behavior you mentioned of immediate/infinite retries (which I agree is what the changed code looks like it does), but I struggled to see that behavior in the logs. Running in a Linux container from my macbook, see the following logs I get with about a minute of attempting to connect while no statsd daemon is available.

Before this PR (with Micrometer 1.12.4):

2024-04-18T09:35:10.703Z  WARN 1 --- [    udp-epoll-2] [                                                 ] i.m.s.reactor.netty.channel.FluxReceive  : [c662394c, L:/127.0.0.1:52488 - R:localhost/127.0.0.1:8125] An exception has been observed post termination, use DEBUG level to see the full stack: java.net.PortUnreachableException: recvAddress(..) failed: Connection refused
2024-04-18T09:35:20.643Z  WARN 1 --- [    udp-epoll-2] [                                                 ] i.m.s.reactor.netty.channel.FluxReceive  : [c662394c, L:/127.0.0.1:52488 - R:localhost/127.0.0.1:8125] An exception has been observed post termination, use DEBUG level to see the full stack: java.net.PortUnreachableException: recvAddress(..) failed: Connection refused
2024-04-18T09:35:20.655Z  WARN 1 --- [    udp-epoll-2] [                                                 ] i.m.s.reactor.netty.channel.FluxReceive  : [c662394c, L:/127.0.0.1:52488 - R:localhost/127.0.0.1:8125] An exception has been observed post termination, use DEBUG level to see the full stack: java.net.PortUnreachableException: recvAddress(..) failed: Connection refused
2024-04-18T09:35:20.660Z  WARN 1 --- [    udp-epoll-2] [                                                 ] i.m.s.reactor.netty.channel.FluxReceive  : [c662394c, L:/127.0.0.1:52488 - R:localhost/127.0.0.1:8125] An exception has been observed post termination, use DEBUG level to see the full stack: java.net.PortUnreachableException: recvAddress(..) failed: Connection refused
2024-04-18T09:35:30.646Z  WARN 1 --- [    udp-epoll-2] [                                                 ] i.m.s.reactor.netty.channel.FluxReceive  : [c662394c, L:/127.0.0.1:52488 - R:localhost/127.0.0.1:8125] An exception has been observed post termination, use DEBUG level to see the full stack: java.net.PortUnreachableException: recvAddress(..) failed: Connection refused
2024-04-18T09:35:30.655Z  WARN 1 --- [    udp-epoll-2] [                                                 ] i.m.s.reactor.netty.channel.FluxReceive  : [c662394c, L:/127.0.0.1:52488 - R:localhost/127.0.0.1:8125] An exception has been observed post termination, use DEBUG level to see the full stack: java.net.PortUnreachableException: recvAddress(..) failed: Connection refused
2024-04-18T09:35:30.658Z  WARN 1 --- [    udp-epoll-2] [                                                 ] i.m.s.reactor.netty.channel.FluxReceive  : [c662394c, L:/127.0.0.1:52488 - R:localhost/127.0.0.1:8125] An exception has been observed post termination, use DEBUG level to see the full stack: java.net.PortUnreachableException: recvAddress(..) failed: Connection refused
2024-04-18T09:35:40.644Z  WARN 1 --- [    udp-epoll-2] [                                                 ] i.m.s.reactor.netty.channel.FluxReceive  : [c662394c, L:/127.0.0.1:52488 - R:localhost/127.0.0.1:8125] An exception has been observed post termination, use DEBUG level to see the full stack: java.net.PortUnreachableException: recvAddress(..) failed: Connection refused
2024-04-18T09:35:40.653Z  WARN 1 --- [    udp-epoll-2] [                                                 ] i.m.s.reactor.netty.channel.FluxReceive  : [c662394c, L:/127.0.0.1:52488 - R:localhost/127.0.0.1:8125] An exception has been observed post termination, use DEBUG level to see the full stack: java.net.PortUnreachableException: recvAddress(..) failed: Connection refused
2024-04-18T09:35:40.657Z  WARN 1 --- [    udp-epoll-2] [                                                 ] i.m.s.reactor.netty.channel.FluxReceive  : [c662394c, L:/127.0.0.1:52488 - R:localhost/127.0.0.1:8125] An exception has been observed post termination, use DEBUG level to see the full stack: java.net.PortUnreachableException: recvAddress(..) failed: Connection refused
2024-04-18T09:35:50.660Z  WARN 1 --- [    udp-epoll-2] [                                                 ] i.m.s.reactor.netty.channel.FluxReceive  : [c662394c, L:/127.0.0.1:52488 - R:localhost/127.0.0.1:8125] An exception has been observed post termination, use DEBUG level to see the full stack: java.net.PortUnreachableException: recvAddress(..) failed: Connection refused
2024-04-18T09:35:50.669Z  WARN 1 --- [    udp-epoll-2] [                                                 ] i.m.s.reactor.netty.channel.FluxReceive  : [c662394c, L:/127.0.0.1:52488 - R:localhost/127.0.0.1:8125] An exception has been observed post termination, use DEBUG level to see the full stack: java.net.PortUnreachableException: recvAddress(..) failed: Connection refused
2024-04-18T09:36:00.646Z  WARN 1 --- [    udp-epoll-2] [                                                 ] i.m.s.reactor.netty.channel.FluxReceive  : [c662394c, L:/127.0.0.1:52488 - R:localhost/127.0.0.1:8125] An exception has been observed post termination, use DEBUG level to see the full stack: java.net.PortUnreachableException: recvAddress(..) failed: Connection refused
2024-04-18T09:36:00.650Z  WARN 1 --- [    udp-epoll-2] [                                                 ] i.m.s.reactor.netty.channel.FluxReceive  : [c662394c, L:/127.0.0.1:52488 - R:localhost/127.0.0.1:8125] An exception has been observed post termination, use DEBUG level to see the full stack: java.net.PortUnreachableException: recvAddress(..) failed: Connection refused
2024-04-18T09:36:00.653Z  WARN 1 --- [    udp-epoll-2] [                                                 ] i.m.s.reactor.netty.channel.FluxReceive  : [c662394c, L:/127.0.0.1:52488 - R:localhost/127.0.0.1:8125] An exception has been observed post termination, use DEBUG level to see the full stack: java.net.PortUnreachableException: recvAddress(..) failed: Connection refused

With this pull request:

2024-04-18T09:51:00.930Z  WARN 1 --- [    udp-epoll-2] [                                                 ] i.m.s.reactor.netty.channel.FluxReceive  : [3f9d8414, L:/127.0.0.1:46846 - R:localhost/127.0.0.1:8125] An exception has been observed post termination, use DEBUG level to see the full stack: java.net.PortUnreachableException: recvAddress(..) failed: Connection refused
2024-04-18T09:51:10.868Z  WARN 1 --- [    udp-epoll-2] [                                                 ] i.m.s.reactor.netty.channel.FluxReceive  : [3f9d8414, L:/127.0.0.1:46846 - R:localhost/127.0.0.1:8125] An exception has been observed post termination, use DEBUG level to see the full stack: java.net.PortUnreachableException: recvAddress(..) failed: Connection refused
2024-04-18T09:51:10.888Z  WARN 1 --- [    udp-epoll-2] [                                                 ] i.m.s.reactor.netty.channel.FluxReceive  : [3f9d8414, L:/127.0.0.1:46846 - R:localhost/127.0.0.1:8125] An exception has been observed post termination, use DEBUG level to see the full stack: java.net.PortUnreachableException: recvAddress(..) failed: Connection refused
2024-04-18T09:51:20.869Z  WARN 1 --- [    udp-epoll-2] [                                                 ] i.m.s.reactor.netty.channel.FluxReceive  : [3f9d8414, L:/127.0.0.1:46846 - R:localhost/127.0.0.1:8125] An exception has been observed post termination, use DEBUG level to see the full stack: java.net.PortUnreachableException: recvAddress(..) failed: Connection refused
2024-04-18T09:51:20.876Z  WARN 1 --- [    udp-epoll-2] [                                                 ] i.m.s.reactor.netty.channel.FluxReceive  : [3f9d8414, L:/127.0.0.1:46846 - R:localhost/127.0.0.1:8125] An exception has been observed post termination, use DEBUG level to see the full stack: java.net.PortUnreachableException: recvAddress(..) failed: Connection refused
2024-04-18T09:51:20.879Z  WARN 1 --- [    udp-epoll-2] [                                                 ] i.m.s.reactor.netty.channel.FluxReceive  : [3f9d8414, L:/127.0.0.1:46846 - R:localhost/127.0.0.1:8125] An exception has been observed post termination, use DEBUG level to see the full stack: java.net.PortUnreachableException: recvAddress(..) failed: Connection refused
2024-04-18T09:51:30.875Z  WARN 1 --- [    udp-epoll-2] [                                                 ] i.m.s.reactor.netty.channel.FluxReceive  : [3f9d8414, L:/127.0.0.1:46846 - R:localhost/127.0.0.1:8125] An exception has been observed post termination, use DEBUG level to see the full stack: java.net.PortUnreachableException: recvAddress(..) failed: Connection refused
2024-04-18T09:51:30.886Z  WARN 1 --- [    udp-epoll-2] [                                                 ] i.m.s.reactor.netty.channel.FluxReceive  : [3f9d8414, L:/127.0.0.1:46846 - R:localhost/127.0.0.1:8125] An exception has been observed post termination, use DEBUG level to see the full stack: java.net.PortUnreachableException: recvAddress(..) failed: Connection refused
2024-04-18T09:51:40.871Z  WARN 1 --- [    udp-epoll-2] [                                                 ] i.m.s.reactor.netty.channel.FluxReceive  : [3f9d8414, L:/127.0.0.1:46846 - R:localhost/127.0.0.1:8125] An exception has been observed post termination, use DEBUG level to see the full stack: java.net.PortUnreachableException: recvAddress(..) failed: Connection refused
2024-04-18T09:51:40.874Z  WARN 1 --- [    udp-epoll-2] [                                                 ] i.m.s.reactor.netty.channel.FluxReceive  : [3f9d8414, L:/127.0.0.1:46846 - R:localhost/127.0.0.1:8125] An exception has been observed post termination, use DEBUG level to see the full stack: java.net.PortUnreachableException: recvAddress(..) failed: Connection refused
2024-04-18T09:51:40.875Z  WARN 1 --- [    udp-epoll-2] [                                                 ] i.m.s.reactor.netty.channel.FluxReceive  : [3f9d8414, L:/127.0.0.1:46846 - R:localhost/127.0.0.1:8125] An exception has been observed post termination, use DEBUG level to see the full stack: java.net.PortUnreachableException: recvAddress(..) failed: Connection refused
2024-04-18T09:51:50.865Z  WARN 1 --- [    udp-epoll-2] [                                                 ] i.m.s.reactor.netty.channel.FluxReceive  : [3f9d8414, L:/127.0.0.1:46846 - R:localhost/127.0.0.1:8125] An exception has been observed post termination, use DEBUG level to see the full stack: java.net.PortUnreachableException: recvAddress(..) failed: Connection refused
2024-04-18T09:51:50.869Z  WARN 1 --- [    udp-epoll-2] [                                                 ] i.m.s.reactor.netty.channel.FluxReceive  : [3f9d8414, L:/127.0.0.1:46846 - R:localhost/127.0.0.1:8125] An exception has been observed post termination, use DEBUG level to see the full stack: java.net.PortUnreachableException: recvAddress(..) failed: Connection refused
2024-04-18T09:51:50.872Z  WARN 1 --- [    udp-epoll-2] [                                                 ] i.m.s.reactor.netty.channel.FluxReceive  : [3f9d8414, L:/127.0.0.1:46846 - R:localhost/127.0.0.1:8125] An exception has been observed post termination, use DEBUG level to see the full stack: java.net.PortUnreachableException: recvAddress(..) failed: Connection refused

The logs aren't reduced and there aren't as many as would be expected for infinite immediate retries. I think this is due to the following (called from connectAndSubscribe which is in turn called from the doOnDisconnected callback):

micrometer/implementations/micrometer-registry-statsd/src/main/java/io/micrometer/statsd/StatsdMeterRegistry.java

Lines 295 to 304 in a637ab6

    
           private void retryReplaceClient(Mono<? extends Connection> connectMono) { 
        
               connectMono.retryWhen(Retry.backoff(Long.MAX_VALUE, Duration.ofSeconds(1)).maxBackoff(Duration.ofMinutes(1))) 
        
                   .subscribe(connection -> { 
        
                       this.statsdConnection.replace(connection); 
        
                       // now that we're connected, start polling gauges and other pollable 
        
                       // meter types 
        
                       startPolling(); 
        
                   }); 
        
           }

In fact, because that is being used, the code changed in this PR doesn't seem to come into play at all when starting an application with the statsd daemon down. I added .doAfterRetry(a -> System.out.println("new statsd udp retry")) to the changed retry policy and it never shows up in sysout. I believe the part changed in this PR is for resubscribing to the publisher from which statsd lines to be sent are produced, which doesn't come into play until successfully connecting.

So I don't think this PR is addressing the issue unless I'm missing something. Did you see different behavior to what I described above?

jon-signal · 2024-04-18T14:20:59Z

So I don't think this PR is addressing the issue unless I'm missing something. Did you see different behavior to what I described above?

Interesting! I'll have to take a second look; we absolutely saw the "millions of log lines" phenomenon in the past, and so I wonder if we may somehow go down different pathways depending on the precise failure mode 🤔

In any case, I'll need a little time before I can prioritize this, but I will follow up.

shakuzen · 2024-04-19T06:22:01Z

We've seen in the past different behavior depending on exactly how the failure to connect to the backend is happening, so I wouldn't be surprised if there is another path than the one I got. Ideally we would want to be able to test this, and that has been the challenge of narrowing down exactly how to reproduce such scenarios. Any help with doing that would be greatly appreciated.

jon-signal · 2024-04-24T21:04:20Z

Well, I do have a live use case right now. I'll try to isolate it in a unit test, but our setup is that we're deploying an application to EKS/Fargate. Fargate doesn't support daemonsets, so we have a pod with two containers: our application and a Datadog Agent pod. At pod startup/shutdown, we see thousands of log entries per second per pod with:

java.net.PortUnreachableException: recvAddress(..) failed: Connection refused

As I said, I'll try to isolate that further, but I hope that's a helpful hint in the meantime.

shakuzen · 2024-04-25T09:34:05Z

@jon-signal Thanks for the follow-up. That part of the log looks the same as what I saw, but the difference is in the frequency of the log. That's where we need to understand what is different to understand how to reproduce it and make sure it gets fixed.

Use exponential backoff when connecting to statsd via UDP

7c3cf41

jon-signal mentioned this pull request Aug 22, 2023

Statsd UDP epoll warn logs are verbose when statsd down #2624

Open

shakuzen changed the title ~~Use exponential backoff when connecting to statsd via UDP~~ Use exponential backoff when retrying connecting to statsd via UDP Apr 17, 2024

shakuzen added bug A general bug registry: statsd A StatsD Registry related issue labels Apr 18, 2024

shakuzen added this to the 1.9.18 milestone Apr 18, 2024

shakuzen removed this from the 1.9.18 milestone Apr 18, 2024

shakuzen removed the bug A general bug label Apr 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use exponential backoff when retrying connecting to statsd via UDP #4042

Use exponential backoff when retrying connecting to statsd via UDP #4042

jon-signal commented Aug 22, 2023 •

edited

pivotal-cla commented Aug 22, 2023

pivotal-cla commented Aug 22, 2023

shakuzen commented Apr 18, 2024

jon-signal commented Apr 18, 2024

shakuzen commented Apr 19, 2024

jon-signal commented Apr 24, 2024

shakuzen commented Apr 25, 2024

Use exponential backoff when retrying connecting to statsd via UDP #4042

Are you sure you want to change the base?

Use exponential backoff when retrying connecting to statsd via UDP #4042

Conversation

jon-signal commented Aug 22, 2023 • edited

pivotal-cla commented Aug 22, 2023

pivotal-cla commented Aug 22, 2023

shakuzen commented Apr 18, 2024

jon-signal commented Apr 18, 2024

shakuzen commented Apr 19, 2024

jon-signal commented Apr 24, 2024

shakuzen commented Apr 25, 2024

jon-signal commented Aug 22, 2023 •

edited