Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use exponential backoff when retrying connecting to statsd via UDP #4042

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

jon-signal
Copy link

@jon-signal jon-signal commented Aug 22, 2023

When connecting to a statsd server via UDP, StatsdMeterRegistry will retry the connection attempt indefinitely and without delay if the connection cannot be established immediately. In a scenario where an application container is launching alongside a statsd sidecar container, it may be the case that the statsd container will not be ready immediately when the application container starts. This can lead to excessive CPU consumption and the logging of thousands or even millions of spurious warnings.

This change moves from an indefinite/immediate retry strategy to an indefinite/exponential backoff strategy. This should dramatically reduce resource consumption and log production in cases where statsd isn't immediately available, and it fixes #2624.

@pivotal-cla
Copy link

@jon-signal Please sign the Contributor License Agreement!

Click here to manually synchronize the status of this Pull Request.

See the FAQ for frequently asked questions.

@pivotal-cla
Copy link

@jon-signal Thank you for signing the Contributor License Agreement!

@shakuzen shakuzen changed the title Use exponential backoff when connecting to statsd via UDP Use exponential backoff when retrying connecting to statsd via UDP Apr 17, 2024
@shakuzen shakuzen added bug A general bug registry: statsd A StatsD Registry related issue labels Apr 18, 2024
@shakuzen shakuzen added this to the 1.9.18 milestone Apr 18, 2024
@shakuzen
Copy link
Member

Thanks for the PR. I was trying to reproduce the behavior you mentioned of immediate/infinite retries (which I agree is what the changed code looks like it does), but I struggled to see that behavior in the logs. Running in a Linux container from my macbook, see the following logs I get with about a minute of attempting to connect while no statsd daemon is available.

Before this PR (with Micrometer 1.12.4):

2024-04-18T09:35:10.703Z  WARN 1 --- [    udp-epoll-2] [                                                 ] i.m.s.reactor.netty.channel.FluxReceive  : [c662394c, L:/127.0.0.1:52488 - R:localhost/127.0.0.1:8125] An exception has been observed post termination, use DEBUG level to see the full stack: java.net.PortUnreachableException: recvAddress(..) failed: Connection refused
2024-04-18T09:35:20.643Z  WARN 1 --- [    udp-epoll-2] [                                                 ] i.m.s.reactor.netty.channel.FluxReceive  : [c662394c, L:/127.0.0.1:52488 - R:localhost/127.0.0.1:8125] An exception has been observed post termination, use DEBUG level to see the full stack: java.net.PortUnreachableException: recvAddress(..) failed: Connection refused
2024-04-18T09:35:20.655Z  WARN 1 --- [    udp-epoll-2] [                                                 ] i.m.s.reactor.netty.channel.FluxReceive  : [c662394c, L:/127.0.0.1:52488 - R:localhost/127.0.0.1:8125] An exception has been observed post termination, use DEBUG level to see the full stack: java.net.PortUnreachableException: recvAddress(..) failed: Connection refused
2024-04-18T09:35:20.660Z  WARN 1 --- [    udp-epoll-2] [                                                 ] i.m.s.reactor.netty.channel.FluxReceive  : [c662394c, L:/127.0.0.1:52488 - R:localhost/127.0.0.1:8125] An exception has been observed post termination, use DEBUG level to see the full stack: java.net.PortUnreachableException: recvAddress(..) failed: Connection refused
2024-04-18T09:35:30.646Z  WARN 1 --- [    udp-epoll-2] [                                                 ] i.m.s.reactor.netty.channel.FluxReceive  : [c662394c, L:/127.0.0.1:52488 - R:localhost/127.0.0.1:8125] An exception has been observed post termination, use DEBUG level to see the full stack: java.net.PortUnreachableException: recvAddress(..) failed: Connection refused
2024-04-18T09:35:30.655Z  WARN 1 --- [    udp-epoll-2] [                                                 ] i.m.s.reactor.netty.channel.FluxReceive  : [c662394c, L:/127.0.0.1:52488 - R:localhost/127.0.0.1:8125] An exception has been observed post termination, use DEBUG level to see the full stack: java.net.PortUnreachableException: recvAddress(..) failed: Connection refused
2024-04-18T09:35:30.658Z  WARN 1 --- [    udp-epoll-2] [                                                 ] i.m.s.reactor.netty.channel.FluxReceive  : [c662394c, L:/127.0.0.1:52488 - R:localhost/127.0.0.1:8125] An exception has been observed post termination, use DEBUG level to see the full stack: java.net.PortUnreachableException: recvAddress(..) failed: Connection refused
2024-04-18T09:35:40.644Z  WARN 1 --- [    udp-epoll-2] [                                                 ] i.m.s.reactor.netty.channel.FluxReceive  : [c662394c, L:/127.0.0.1:52488 - R:localhost/127.0.0.1:8125] An exception has been observed post termination, use DEBUG level to see the full stack: java.net.PortUnreachableException: recvAddress(..) failed: Connection refused
2024-04-18T09:35:40.653Z  WARN 1 --- [    udp-epoll-2] [                                                 ] i.m.s.reactor.netty.channel.FluxReceive  : [c662394c, L:/127.0.0.1:52488 - R:localhost/127.0.0.1:8125] An exception has been observed post termination, use DEBUG level to see the full stack: java.net.PortUnreachableException: recvAddress(..) failed: Connection refused
2024-04-18T09:35:40.657Z  WARN 1 --- [    udp-epoll-2] [                                                 ] i.m.s.reactor.netty.channel.FluxReceive  : [c662394c, L:/127.0.0.1:52488 - R:localhost/127.0.0.1:8125] An exception has been observed post termination, use DEBUG level to see the full stack: java.net.PortUnreachableException: recvAddress(..) failed: Connection refused
2024-04-18T09:35:50.660Z  WARN 1 --- [    udp-epoll-2] [                                                 ] i.m.s.reactor.netty.channel.FluxReceive  : [c662394c, L:/127.0.0.1:52488 - R:localhost/127.0.0.1:8125] An exception has been observed post termination, use DEBUG level to see the full stack: java.net.PortUnreachableException: recvAddress(..) failed: Connection refused
2024-04-18T09:35:50.669Z  WARN 1 --- [    udp-epoll-2] [                                                 ] i.m.s.reactor.netty.channel.FluxReceive  : [c662394c, L:/127.0.0.1:52488 - R:localhost/127.0.0.1:8125] An exception has been observed post termination, use DEBUG level to see the full stack: java.net.PortUnreachableException: recvAddress(..) failed: Connection refused
2024-04-18T09:36:00.646Z  WARN 1 --- [    udp-epoll-2] [                                                 ] i.m.s.reactor.netty.channel.FluxReceive  : [c662394c, L:/127.0.0.1:52488 - R:localhost/127.0.0.1:8125] An exception has been observed post termination, use DEBUG level to see the full stack: java.net.PortUnreachableException: recvAddress(..) failed: Connection refused
2024-04-18T09:36:00.650Z  WARN 1 --- [    udp-epoll-2] [                                                 ] i.m.s.reactor.netty.channel.FluxReceive  : [c662394c, L:/127.0.0.1:52488 - R:localhost/127.0.0.1:8125] An exception has been observed post termination, use DEBUG level to see the full stack: java.net.PortUnreachableException: recvAddress(..) failed: Connection refused
2024-04-18T09:36:00.653Z  WARN 1 --- [    udp-epoll-2] [                                                 ] i.m.s.reactor.netty.channel.FluxReceive  : [c662394c, L:/127.0.0.1:52488 - R:localhost/127.0.0.1:8125] An exception has been observed post termination, use DEBUG level to see the full stack: java.net.PortUnreachableException: recvAddress(..) failed: Connection refused

With this pull request:

2024-04-18T09:51:00.930Z  WARN 1 --- [    udp-epoll-2] [                                                 ] i.m.s.reactor.netty.channel.FluxReceive  : [3f9d8414, L:/127.0.0.1:46846 - R:localhost/127.0.0.1:8125] An exception has been observed post termination, use DEBUG level to see the full stack: java.net.PortUnreachableException: recvAddress(..) failed: Connection refused
2024-04-18T09:51:10.868Z  WARN 1 --- [    udp-epoll-2] [                                                 ] i.m.s.reactor.netty.channel.FluxReceive  : [3f9d8414, L:/127.0.0.1:46846 - R:localhost/127.0.0.1:8125] An exception has been observed post termination, use DEBUG level to see the full stack: java.net.PortUnreachableException: recvAddress(..) failed: Connection refused
2024-04-18T09:51:10.888Z  WARN 1 --- [    udp-epoll-2] [                                                 ] i.m.s.reactor.netty.channel.FluxReceive  : [3f9d8414, L:/127.0.0.1:46846 - R:localhost/127.0.0.1:8125] An exception has been observed post termination, use DEBUG level to see the full stack: java.net.PortUnreachableException: recvAddress(..) failed: Connection refused
2024-04-18T09:51:20.869Z  WARN 1 --- [    udp-epoll-2] [                                                 ] i.m.s.reactor.netty.channel.FluxReceive  : [3f9d8414, L:/127.0.0.1:46846 - R:localhost/127.0.0.1:8125] An exception has been observed post termination, use DEBUG level to see the full stack: java.net.PortUnreachableException: recvAddress(..) failed: Connection refused
2024-04-18T09:51:20.876Z  WARN 1 --- [    udp-epoll-2] [                                                 ] i.m.s.reactor.netty.channel.FluxReceive  : [3f9d8414, L:/127.0.0.1:46846 - R:localhost/127.0.0.1:8125] An exception has been observed post termination, use DEBUG level to see the full stack: java.net.PortUnreachableException: recvAddress(..) failed: Connection refused
2024-04-18T09:51:20.879Z  WARN 1 --- [    udp-epoll-2] [                                                 ] i.m.s.reactor.netty.channel.FluxReceive  : [3f9d8414, L:/127.0.0.1:46846 - R:localhost/127.0.0.1:8125] An exception has been observed post termination, use DEBUG level to see the full stack: java.net.PortUnreachableException: recvAddress(..) failed: Connection refused
2024-04-18T09:51:30.875Z  WARN 1 --- [    udp-epoll-2] [                                                 ] i.m.s.reactor.netty.channel.FluxReceive  : [3f9d8414, L:/127.0.0.1:46846 - R:localhost/127.0.0.1:8125] An exception has been observed post termination, use DEBUG level to see the full stack: java.net.PortUnreachableException: recvAddress(..) failed: Connection refused
2024-04-18T09:51:30.886Z  WARN 1 --- [    udp-epoll-2] [                                                 ] i.m.s.reactor.netty.channel.FluxReceive  : [3f9d8414, L:/127.0.0.1:46846 - R:localhost/127.0.0.1:8125] An exception has been observed post termination, use DEBUG level to see the full stack: java.net.PortUnreachableException: recvAddress(..) failed: Connection refused
2024-04-18T09:51:40.871Z  WARN 1 --- [    udp-epoll-2] [                                                 ] i.m.s.reactor.netty.channel.FluxReceive  : [3f9d8414, L:/127.0.0.1:46846 - R:localhost/127.0.0.1:8125] An exception has been observed post termination, use DEBUG level to see the full stack: java.net.PortUnreachableException: recvAddress(..) failed: Connection refused
2024-04-18T09:51:40.874Z  WARN 1 --- [    udp-epoll-2] [                                                 ] i.m.s.reactor.netty.channel.FluxReceive  : [3f9d8414, L:/127.0.0.1:46846 - R:localhost/127.0.0.1:8125] An exception has been observed post termination, use DEBUG level to see the full stack: java.net.PortUnreachableException: recvAddress(..) failed: Connection refused
2024-04-18T09:51:40.875Z  WARN 1 --- [    udp-epoll-2] [                                                 ] i.m.s.reactor.netty.channel.FluxReceive  : [3f9d8414, L:/127.0.0.1:46846 - R:localhost/127.0.0.1:8125] An exception has been observed post termination, use DEBUG level to see the full stack: java.net.PortUnreachableException: recvAddress(..) failed: Connection refused
2024-04-18T09:51:50.865Z  WARN 1 --- [    udp-epoll-2] [                                                 ] i.m.s.reactor.netty.channel.FluxReceive  : [3f9d8414, L:/127.0.0.1:46846 - R:localhost/127.0.0.1:8125] An exception has been observed post termination, use DEBUG level to see the full stack: java.net.PortUnreachableException: recvAddress(..) failed: Connection refused
2024-04-18T09:51:50.869Z  WARN 1 --- [    udp-epoll-2] [                                                 ] i.m.s.reactor.netty.channel.FluxReceive  : [3f9d8414, L:/127.0.0.1:46846 - R:localhost/127.0.0.1:8125] An exception has been observed post termination, use DEBUG level to see the full stack: java.net.PortUnreachableException: recvAddress(..) failed: Connection refused
2024-04-18T09:51:50.872Z  WARN 1 --- [    udp-epoll-2] [                                                 ] i.m.s.reactor.netty.channel.FluxReceive  : [3f9d8414, L:/127.0.0.1:46846 - R:localhost/127.0.0.1:8125] An exception has been observed post termination, use DEBUG level to see the full stack: java.net.PortUnreachableException: recvAddress(..) failed: Connection refused

The logs aren't reduced and there aren't as many as would be expected for infinite immediate retries. I think this is due to the following (called from connectAndSubscribe which is in turn called from the doOnDisconnected callback):

private void retryReplaceClient(Mono<? extends Connection> connectMono) {
connectMono.retryWhen(Retry.backoff(Long.MAX_VALUE, Duration.ofSeconds(1)).maxBackoff(Duration.ofMinutes(1)))
.subscribe(connection -> {
this.statsdConnection.replace(connection);
// now that we're connected, start polling gauges and other pollable
// meter types
startPolling();
});
}

In fact, because that is being used, the code changed in this PR doesn't seem to come into play at all when starting an application with the statsd daemon down. I added .doAfterRetry(a -> System.out.println("new statsd udp retry")) to the changed retry policy and it never shows up in sysout. I believe the part changed in this PR is for resubscribing to the publisher from which statsd lines to be sent are produced, which doesn't come into play until successfully connecting.

So I don't think this PR is addressing the issue unless I'm missing something. Did you see different behavior to what I described above?

@shakuzen shakuzen removed this from the 1.9.18 milestone Apr 18, 2024
@shakuzen shakuzen removed the bug A general bug label Apr 18, 2024
@jon-signal
Copy link
Author

So I don't think this PR is addressing the issue unless I'm missing something. Did you see different behavior to what I described above?

Interesting! I'll have to take a second look; we absolutely saw the "millions of log lines" phenomenon in the past, and so I wonder if we may somehow go down different pathways depending on the precise failure mode 🤔

In any case, I'll need a little time before I can prioritize this, but I will follow up.

@shakuzen
Copy link
Member

We've seen in the past different behavior depending on exactly how the failure to connect to the backend is happening, so I wouldn't be surprised if there is another path than the one I got. Ideally we would want to be able to test this, and that has been the challenge of narrowing down exactly how to reproduce such scenarios. Any help with doing that would be greatly appreciated.

@jon-signal
Copy link
Author

Well, I do have a live use case right now. I'll try to isolate it in a unit test, but our setup is that we're deploying an application to EKS/Fargate. Fargate doesn't support daemonsets, so we have a pod with two containers: our application and a Datadog Agent pod. At pod startup/shutdown, we see thousands of log entries per second per pod with:

java.net.PortUnreachableException: recvAddress(..) failed: Connection refused

As I said, I'll try to isolate that further, but I hope that's a helpful hint in the meantime.

@shakuzen
Copy link
Member

@jon-signal Thanks for the follow-up. That part of the log looks the same as what I saw, but the difference is in the frequency of the log. That's where we need to understand what is different to understand how to reproduce it and make sure it gets fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
registry: statsd A StatsD Registry related issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Statsd UDP epoll warn logs are verbose when statsd down
3 participants