Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

notifier: stop queue filling due to single failed AM #14099

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

krajorama
Copy link
Member

WIP

Adds a unit test to emulate the problem of throughput dropping in #7676

Solution ideas (not implemented yet):

  1. Put failed alertmanagers into a quarantine for some time. This would preserve the throughput much better. Possibly use exponential back-off to determine that next time we try to contact the alertmanager. Reset the timer if no alive alertmanagers are left.
  2. Separate queues?

Fixes: #7676

Ref: prometheus#7676

Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
@nielsole
Copy link

nielsole commented May 15, 2024

Nice work on the unit test.

Principally I'd prefer 2nd, separate queues. This would allow completely independent processing and overflowing of queues.
Unfortunately when I looked into this, it seems like quite the large lift as a lot of the functions like sendAll and the exported metrics do assume that a single queue is worked off in lockstep. We would need to have all metrics separately by alertmanager instance which is imo the right thing to do, but might be considered a breaking change.
On the bright side, having separate queues might accidentally also fix #13676

@nielsole
Copy link

an alternative 3rd solution with a shared queue I was considering was having a ring buffer, where there's a go routine for every alert manager that has a pointer into the ring buffer. Whenever the insert operation into the ring buffer catches up with an alert manager pointer, it would move that pointer forward, effectively dropping the oldest alert.
This may allow us to keep the existing metrics, thus keeping backward compatibility. But that doesn't feel like idiomatic golang.

@machine424
Copy link
Collaborator

Thanks for this! (I always appreciate creative tests)
We do trim the queue before each sendAll iteration in nextBatch, if we couldn't send the alerts to any AM, we increase a "dropped alerts" metrics. But the alerts are already gone from the queue. I see the unit test you added doesn't take nextBatch into account.
Also, given the current implementation, I don't see why we set the timeout to 1y and expect it not to hang. (I don't think the timeout was set to 1y in #7676)

In an ideal word, An SD is to be used for those AM instead of static config, so the faulty ones are excluded and the notifier doesn't have to worry about that. I'm afraid a "quarantine logic" would look like re-implementing the SD logic...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Notification queue fills with single down AM instance
3 participants