New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use exponential backoff when retrying connecting to statsd via UDP #4042
base: main
Are you sure you want to change the base?
Use exponential backoff when retrying connecting to statsd via UDP #4042
Conversation
@jon-signal Please sign the Contributor License Agreement! Click here to manually synchronize the status of this Pull Request. See the FAQ for frequently asked questions. |
@jon-signal Thank you for signing the Contributor License Agreement! |
Thanks for the PR. I was trying to reproduce the behavior you mentioned of immediate/infinite retries (which I agree is what the changed code looks like it does), but I struggled to see that behavior in the logs. Running in a Linux container from my macbook, see the following logs I get with about a minute of attempting to connect while no statsd daemon is available. Before this PR (with Micrometer 1.12.4):
With this pull request:
The logs aren't reduced and there aren't as many as would be expected for infinite immediate retries. I think this is due to the following (called from Lines 295 to 304 in a637ab6
In fact, because that is being used, the code changed in this PR doesn't seem to come into play at all when starting an application with the statsd daemon down. I added So I don't think this PR is addressing the issue unless I'm missing something. Did you see different behavior to what I described above? |
Interesting! I'll have to take a second look; we absolutely saw the "millions of log lines" phenomenon in the past, and so I wonder if we may somehow go down different pathways depending on the precise failure mode 🤔 In any case, I'll need a little time before I can prioritize this, but I will follow up. |
We've seen in the past different behavior depending on exactly how the failure to connect to the backend is happening, so I wouldn't be surprised if there is another path than the one I got. Ideally we would want to be able to test this, and that has been the challenge of narrowing down exactly how to reproduce such scenarios. Any help with doing that would be greatly appreciated. |
Well, I do have a live use case right now. I'll try to isolate it in a unit test, but our setup is that we're deploying an application to EKS/Fargate. Fargate doesn't support daemonsets, so we have a pod with two containers: our application and a Datadog Agent pod. At pod startup/shutdown, we see thousands of log entries per second per pod with:
As I said, I'll try to isolate that further, but I hope that's a helpful hint in the meantime. |
@jon-signal Thanks for the follow-up. That part of the log looks the same as what I saw, but the difference is in the frequency of the log. That's where we need to understand what is different to understand how to reproduce it and make sure it gets fixed. |
When connecting to a statsd server via UDP,
StatsdMeterRegistry
will retry the connection attempt indefinitely and without delay if the connection cannot be established immediately. In a scenario where an application container is launching alongside a statsd sidecar container, it may be the case that the statsd container will not be ready immediately when the application container starts. This can lead to excessive CPU consumption and the logging of thousands or even millions of spurious warnings.This change moves from an indefinite/immediate retry strategy to an indefinite/exponential backoff strategy. This should dramatically reduce resource consumption and log production in cases where statsd isn't immediately available, and it fixes #2624.