Rule Manager: Only query once per alert rule when restoring alert state #13980

gotjosh · 2024-04-23T18:47:28Z

Prometheus restores alert state between restarts and updates. For each rule, it looks at the alerts that are meant to be active and then queries the ALERTS_FOR_STATE series for each alert within the rules.

If the alert rule has 120 instances (or series) it'll execute the same query with slightly different labels.

This PR changes the approach so that we only query once per alert rule and then match the corresponding alert that we're about to restore against the series-set. While the approach might use a bit more memory at start-up (if even?) the restore proccess is only ran once per restart so I'd consider this a big win.

This builds on top of #13974

pracucci · 2024-04-24T09:37:35Z

rules/group.go

-				return
+			// Find the series for the given alert from the set.
+			for sset.Next() {
+				if sset.At().Labels().Hash() == a.Labels.Hash() {


Once the hash the same, should we also compare the actual labels to avoid any hash collision issue? Hash is generated using xxhash, which is not cryptographically secure.

Thanks for this 🙏, I have used the string representation of the labels but at this point I think I have more questions than answers - do I need to sort the labels on each side or am I guarantee to get the same order? Is this a good idea?

do I need to sort the labels on each side or am I guarantee to get the same order?

No you don't need to do it because they're already sorted by contract:

prometheus/model/labels/labels_stringlabels.go

Line 29 in bd18787

// Names are in order.

pracucci · 2024-04-24T09:39:04Z

rules/group.go

-				)
-				return
+			// Find the series for the given alert from the set.
+			for sset.Next() {


I don't fully understand how this works. Aren't we looking matches, given sset is never reset to "beginning" each when we look for each alert? How can we assume the alerts are iterated in the same order of the series in the response?

You're right; thanks for pointing it out my silly mistake. It didn't work correctly to begin with, as I realised while fixing the tests.

How can we assume the alerts are iterated in the same order of the series in the response?

I think I get where you're going with this (but please correct me if I'm wrong). You want to co-relate the queried series and the "active alerts" by position to avoid the nested loops, but I don't think we can. There's no guarantee that the first evaluation after a restart (which is what we use to populate "active alerts") and whatever we have stored in ALERT_FOR_SERIES will be the same results.

That being said - I'm pretty inexperienced when it comes to querying semantics in Prometheus, so if you have a better approach to do this, I'm more than happy to implement it.

You want to co-relate the queried series and the "active alerts" by position to avoid the nested loops

Sorry for the misunderstanding. I wasn't suggesting doing it. My open question was "I think you're assuming they're in the same order but how is it possible?" but saying "we" instead of "you". Anyway, I see you've already changed the code and I think the approach you took (using a map) makes sense.

Prometheus restores alert state between restarts and updates. For each rule, it looks at the alerts that are meant to be active and then queries the `ALERTS_FOR_STATE` series for _each_ alert within the rules. If the alert rule has 120 instances (or series) it'll execute the same query with slightly different labels. This PR changes the approach so that we only query once per alert rule and then match the corresponding alert that we're about to restore against the series-set. While the approach might use a bit more memory at start-up (if even?) the restore proccess is only ran once per restart so I'd consider this a big win. This builds on top of #13974 Signed-off-by: gotjosh <josue.abreu@gmail.com>

Signed-off-by: gotjosh <josue.abreu@gmail.com>

- Improve variable name of the map produced by the series set Signed-off-by: gotjosh <josue.abreu@gmail.com>

Signed-off-by: gotjosh <josue.abreu@gmail.com>

pracucci

LGTM. I left few minor comments.

rules/alerting_test.go

rules/alerting.go

rules/group.go

rules/manager_test.go

dimitarvdimitrov

LGTM

rules/alerting_test.go

Signed-off-by: gotjosh <josue.abreu@gmail.com>

In #13980 I introduced a change to reduce the number of queries executed when we restore alert statuses. With this, the querying semantics changed as we now need to go through all series before we enter the alert restoration loop and I missed the fact that exiting early when there are no rules to restore would lead to an incomplete restoration. An alert being restored is used as a proxy for "we're now ready to write `ALERTS/ALERTS_FOR_SERIES` metrics" so as a result we weren't writing the series if we didn't restore anything the first time around. Signed-off-by: gotjosh <josue.abreu@gmail.com>

* BUGFIX: Mark the rule's restoration process as completed always In #13980 I introduced a change to reduce the number of queries executed when we restore alert statuses. With this, the querying semantics changed as we now need to go through all series before we enter the alert restoration loop and I missed the fact that exiting early when there are no rules to restore would lead to an incomplete restoration. An alert being restored is used as a proxy for "we're now ready to write `ALERTS/ALERTS_FOR_SERIES` metrics" so as a result we weren't writing the series if we didn't restore anything the first time around. --------- Signed-off-by: gotjosh <josue.abreu@gmail.com>

pracucci reviewed Apr 24, 2024

View reviewed changes

gotjosh added 6 commits April 24, 2024 18:46

bug: nil check against the series set not errors

e6dcbd2

Signed-off-by: gotjosh <josue.abreu@gmail.com>

Fix tests and a bug with the series lookup logic.

2762015

Signed-off-by: gotjosh <josue.abreu@gmail.com>

Use the string representation of the labels instead of the hash

fa75985

Signed-off-by: gotjosh <josue.abreu@gmail.com>

- Add a changelog entry

6cfc584

- Improve variable name of the map produced by the series set Signed-off-by: gotjosh <josue.abreu@gmail.com>

Allow the result map for the series set before hand with a hint.

2de2fee

Signed-off-by: gotjosh <josue.abreu@gmail.com>

gotjosh force-pushed the gotjosh/restore-only-with-rule-query branch from dbfe392 to 2de2fee Compare April 24, 2024 18:10

fix typo

cc22071

Signed-off-by: gotjosh <josue.abreu@gmail.com>

gotjosh marked this pull request as ready for review April 25, 2024 08:54

pracucci approved these changes Apr 26, 2024

View reviewed changes

rules/alerting_test.go Show resolved Hide resolved

rules/alerting.go Outdated Show resolved Hide resolved

rules/group.go Outdated Show resolved Hide resolved

rules/manager_test.go Outdated Show resolved Hide resolved

rules/manager_test.go Outdated Show resolved Hide resolved

dimitarvdimitrov approved these changes Apr 29, 2024

View reviewed changes

rules/alerting_test.go Outdated Show resolved Hide resolved

gotjosh added 6 commits April 30, 2024 12:17

Add an assertion on the count of alerts before adding an active alert

151f6e0

Signed-off-by: gotjosh <josue.abreu@gmail.com>

Rename QueryforStateSeries to QueryForStateSeries

ccfafae

Signed-off-by: gotjosh <josue.abreu@gmail.com>

Use labels.Len() instead of manually counting the labels

63b0994

Signed-off-by: gotjosh <josue.abreu@gmail.com>

Remove duplicated sorted and assignment of expected alerts.

f63dbc3

Signed-off-by: gotjosh <josue.abreu@gmail.com>

Rename alerts to expectedAlerts in the test case input

05ca082

Signed-off-by: gotjosh <josue.abreu@gmail.com>

querier.Select cannot return a nil series set.

379dec9

Signed-off-by: gotjosh <josue.abreu@gmail.com>

gotjosh merged commit 1dd0bff into main Apr 30, 2024
40 checks passed

gotjosh deleted the gotjosh/restore-only-with-rule-query branch April 30, 2024 14:29

gotjosh mentioned this pull request May 2, 2024

Merge upstream prometheus/prometheus at 85e3c43 grafana/mimir-prometheus#622

Merged

gotjosh mentioned this pull request May 3, 2024

BUGFIX: Mark the rule's restoration process as completed always #14048

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rule Manager: Only query once per alert rule when restoring alert state #13980

Rule Manager: Only query once per alert rule when restoring alert state #13980

gotjosh commented Apr 23, 2024

pracucci Apr 24, 2024

gotjosh Apr 24, 2024

pracucci Apr 24, 2024

pracucci Apr 24, 2024

gotjosh Apr 24, 2024 •

edited

pracucci Apr 24, 2024

pracucci left a comment

dimitarvdimitrov left a comment

Rule Manager: Only query once per alert rule when restoring alert state #13980

Rule Manager: Only query once per alert rule when restoring alert state #13980

Conversation

gotjosh commented Apr 23, 2024

pracucci Apr 24, 2024

Choose a reason for hiding this comment

gotjosh Apr 24, 2024

Choose a reason for hiding this comment

pracucci Apr 24, 2024

Choose a reason for hiding this comment

pracucci Apr 24, 2024

Choose a reason for hiding this comment

gotjosh Apr 24, 2024 • edited

Choose a reason for hiding this comment

pracucci Apr 24, 2024

Choose a reason for hiding this comment

pracucci left a comment

Choose a reason for hiding this comment

dimitarvdimitrov left a comment

Choose a reason for hiding this comment

gotjosh Apr 24, 2024 •

edited