Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rule Manager: Only query once per alert rule when restoring alert state #13980

Merged
merged 13 commits into from Apr 30, 2024

Conversation

gotjosh
Copy link
Member

@gotjosh gotjosh commented Apr 23, 2024

Prometheus restores alert state between restarts and updates. For each rule, it looks at the alerts that are meant to be active and then queries the ALERTS_FOR_STATE series for each alert within the rules.

If the alert rule has 120 instances (or series) it'll execute the same query with slightly different labels.

This PR changes the approach so that we only query once per alert rule and then match the corresponding alert that we're about to restore against the series-set. While the approach might use a bit more memory at start-up (if even?) the restore proccess is only ran once per restart so I'd consider this a big win.

This builds on top of #13974

rules/group.go Outdated
return
// Find the series for the given alert from the set.
for sset.Next() {
if sset.At().Labels().Hash() == a.Labels.Hash() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once the hash the same, should we also compare the actual labels to avoid any hash collision issue? Hash is generated using xxhash, which is not cryptographically secure.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this 🙏, I have used the string representation of the labels but at this point I think I have more questions than answers - do I need to sort the labels on each side or am I guarantee to get the same order? Is this a good idea?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do I need to sort the labels on each side or am I guarantee to get the same order?

No you don't need to do it because they're already sorted by contract:

// Names are in order.

rules/group.go Outdated
)
return
// Find the series for the given alert from the set.
for sset.Next() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't fully understand how this works. Aren't we looking matches, given sset is never reset to "beginning" each when we look for each alert? How can we assume the alerts are iterated in the same order of the series in the response?

Copy link
Member Author

@gotjosh gotjosh Apr 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right; thanks for pointing it out my silly mistake. It didn't work correctly to begin with, as I realised while fixing the tests.

How can we assume the alerts are iterated in the same order of the series in the response?

I think I get where you're going with this (but please correct me if I'm wrong). You want to co-relate the queried series and the "active alerts" by position to avoid the nested loops, but I don't think we can. There's no guarantee that the first evaluation after a restart (which is what we use to populate "active alerts") and whatever we have stored in ALERT_FOR_SERIES will be the same results.

That being said - I'm pretty inexperienced when it comes to querying semantics in Prometheus, so if you have a better approach to do this, I'm more than happy to implement it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You want to co-relate the queried series and the "active alerts" by position to avoid the nested loops

Sorry for the misunderstanding. I wasn't suggesting doing it. My open question was "I think you're assuming they're in the same order but how is it possible?" but saying "we" instead of "you". Anyway, I see you've already changed the code and I think the approach you took (using a map) makes sense.

Prometheus restores alert state between restarts and updates. For each rule, it looks at the alerts that are meant to be active and then queries the `ALERTS_FOR_STATE` series for _each_ alert within the rules.

If the alert rule has 120 instances (or series) it'll execute the same query with slightly different labels.

This PR changes the approach so that we only query once per alert rule and then match the corresponding alert that we're about to restore against the series-set. While the approach might use a bit more memory at start-up (if even?) the restore proccess is only ran once per restart so I'd consider this a big win.

This builds on top of #13974

Signed-off-by: gotjosh <josue.abreu@gmail.com>
Signed-off-by: gotjosh <josue.abreu@gmail.com>
Signed-off-by: gotjosh <josue.abreu@gmail.com>
Signed-off-by: gotjosh <josue.abreu@gmail.com>
- Improve variable name of the map produced by the series set

Signed-off-by: gotjosh <josue.abreu@gmail.com>
Signed-off-by: gotjosh <josue.abreu@gmail.com>
@gotjosh gotjosh force-pushed the gotjosh/restore-only-with-rule-query branch from dbfe392 to 2de2fee Compare April 24, 2024 18:10
Signed-off-by: gotjosh <josue.abreu@gmail.com>
@gotjosh gotjosh marked this pull request as ready for review April 25, 2024 08:54
Copy link
Contributor

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I left few minor comments.

rules/alerting_test.go Show resolved Hide resolved
rules/alerting.go Outdated Show resolved Hide resolved
rules/group.go Outdated Show resolved Hide resolved
rules/manager_test.go Outdated Show resolved Hide resolved
rules/manager_test.go Outdated Show resolved Hide resolved
Copy link
Contributor

@dimitarvdimitrov dimitarvdimitrov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

rules/alerting_test.go Outdated Show resolved Hide resolved
Signed-off-by: gotjosh <josue.abreu@gmail.com>
Signed-off-by: gotjosh <josue.abreu@gmail.com>
Signed-off-by: gotjosh <josue.abreu@gmail.com>
Signed-off-by: gotjosh <josue.abreu@gmail.com>
Signed-off-by: gotjosh <josue.abreu@gmail.com>
Signed-off-by: gotjosh <josue.abreu@gmail.com>
@gotjosh gotjosh merged commit 1dd0bff into main Apr 30, 2024
40 checks passed
@gotjosh gotjosh deleted the gotjosh/restore-only-with-rule-query branch April 30, 2024 14:29
gotjosh added a commit that referenced this pull request May 3, 2024
In #13980 I introduced a change to reduce the number of queries executed when we restore alert statuses.

With this, the querying semantics changed as we now need to go through all series before we enter the alert restoration loop and I missed the fact that exiting early when there are no rules to restore would lead to an incomplete restoration.

An alert being restored is used as a proxy for "we're now ready to write `ALERTS/ALERTS_FOR_SERIES` metrics" so as a result we weren't writing the series if we didn't restore anything the first time around.

Signed-off-by: gotjosh <josue.abreu@gmail.com>
gotjosh added a commit that referenced this pull request May 3, 2024
* BUGFIX: Mark the rule's restoration process as completed always

In #13980 I introduced a change to reduce the number of queries executed when we restore alert statuses.

With this, the querying semantics changed as we now need to go through all series before we enter the alert restoration loop and I missed the fact that exiting early when there are no rules to restore would lead to an incomplete restoration.

An alert being restored is used as a proxy for "we're now ready to write `ALERTS/ALERTS_FOR_SERIES` metrics" so as a result we weren't writing the series if we didn't restore anything the first time around.
---------

Signed-off-by: gotjosh <josue.abreu@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants