notifier: stop queue filling due to single failed AM #14099

krajorama · 2024-05-14T15:15:00Z

WIP

Adds a unit test to emulate the problem of throughput dropping in #7676

Solution ideas (not implemented yet):

Put failed alertmanagers into a quarantine for some time. This would preserve the throughput much better. Possibly use exponential back-off to determine that next time we try to contact the alertmanager. Reset the timer if no alive alertmanagers are left.
Separate queues?

Fixes: #7676

Ref: prometheus#7676 Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>

nielsole · 2024-05-15T09:05:21Z

Nice work on the unit test.

Principally I'd prefer 2nd, separate queues. This would allow completely independent processing and overflowing of queues.
Unfortunately when I looked into this, it seems like quite the large lift as a lot of the functions like sendAll and the exported metrics do assume that a single queue is worked off in lockstep. We would need to have all metrics separately by alertmanager instance which is imo the right thing to do, but might be considered a breaking change.
On the bright side, having separate queues might accidentally also fix #13676

nielsole · 2024-05-15T09:59:37Z

an alternative 3rd solution with a shared queue I was considering was having a ring buffer, where there's a go routine for every alert manager that has a pointer into the ring buffer. Whenever the insert operation into the ring buffer catches up with an alert manager pointer, it would move that pointer forward, effectively dropping the oldest alert.
This may allow us to keep the existing metrics, thus keeping backward compatibility. But that doesn't feel like idiomatic golang.

machine424 · 2024-05-16T13:32:40Z

Thanks for this! (I always appreciate creative tests)
We do trim the queue before each sendAll iteration in nextBatch, if we couldn't send the alerts to any AM, we increase a "dropped alerts" metrics. But the alerts are already gone from the queue. I see the unit test you added doesn't take nextBatch into account.
Also, given the current implementation, I don't see why we set the timeout to 1y and expect it not to hang. (I don't think the timeout was set to 1y in #7676)

In an ideal word, An SD is to be used for those AM instead of static config, so the faulty ones are excluded and the notifier doesn't have to worry about that. I'm afraid a "quarantine logic" would look like re-implementing the SD logic...

notifier: unit test for dropping throughput on stuck AM

d13d7b2

Ref: prometheus#7676 Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>

krajorama mentioned this pull request May 14, 2024

Notification queue fills with single down AM instance #7676

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

notifier: stop queue filling due to single failed AM #14099

notifier: stop queue filling due to single failed AM #14099

krajorama commented May 14, 2024

nielsole commented May 15, 2024 •

edited

nielsole commented May 15, 2024

machine424 commented May 16, 2024

notifier: stop queue filling due to single failed AM #14099

Are you sure you want to change the base?

notifier: stop queue filling due to single failed AM #14099

Conversation

krajorama commented May 14, 2024

nielsole commented May 15, 2024 • edited

nielsole commented May 15, 2024

machine424 commented May 16, 2024

nielsole commented May 15, 2024 •

edited