New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rule Manager: Only query once per alert rule when restoring alert state #13980
Conversation
rules/group.go
Outdated
return | ||
// Find the series for the given alert from the set. | ||
for sset.Next() { | ||
if sset.At().Labels().Hash() == a.Labels.Hash() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once the hash the same, should we also compare the actual labels to avoid any hash collision issue? Hash is generated using xxhash, which is not cryptographically secure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this 🙏, I have used the string representation of the labels but at this point I think I have more questions than answers - do I need to sort the labels on each side or am I guarantee to get the same order? Is this a good idea?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do I need to sort the labels on each side or am I guarantee to get the same order?
No you don't need to do it because they're already sorted by contract:
// Names are in order. |
rules/group.go
Outdated
) | ||
return | ||
// Find the series for the given alert from the set. | ||
for sset.Next() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't fully understand how this works. Aren't we looking matches, given sset
is never reset to "beginning" each when we look for each alert? How can we assume the alerts are iterated in the same order of the series in the response?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right; thanks for pointing it out my silly mistake. It didn't work correctly to begin with, as I realised while fixing the tests.
How can we assume the alerts are iterated in the same order of the series in the response?
I think I get where you're going with this (but please correct me if I'm wrong). You want to co-relate the queried series and the "active alerts" by position to avoid the nested loops, but I don't think we can. There's no guarantee that the first evaluation after a restart (which is what we use to populate "active alerts") and whatever we have stored in ALERT_FOR_SERIES
will be the same results.
That being said - I'm pretty inexperienced when it comes to querying semantics in Prometheus, so if you have a better approach to do this, I'm more than happy to implement it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You want to co-relate the queried series and the "active alerts" by position to avoid the nested loops
Sorry for the misunderstanding. I wasn't suggesting doing it. My open question was "I think you're assuming they're in the same order but how is it possible?" but saying "we" instead of "you". Anyway, I see you've already changed the code and I think the approach you took (using a map) makes sense.
Prometheus restores alert state between restarts and updates. For each rule, it looks at the alerts that are meant to be active and then queries the `ALERTS_FOR_STATE` series for _each_ alert within the rules. If the alert rule has 120 instances (or series) it'll execute the same query with slightly different labels. This PR changes the approach so that we only query once per alert rule and then match the corresponding alert that we're about to restore against the series-set. While the approach might use a bit more memory at start-up (if even?) the restore proccess is only ran once per restart so I'd consider this a big win. This builds on top of #13974 Signed-off-by: gotjosh <josue.abreu@gmail.com>
Signed-off-by: gotjosh <josue.abreu@gmail.com>
Signed-off-by: gotjosh <josue.abreu@gmail.com>
Signed-off-by: gotjosh <josue.abreu@gmail.com>
- Improve variable name of the map produced by the series set Signed-off-by: gotjosh <josue.abreu@gmail.com>
Signed-off-by: gotjosh <josue.abreu@gmail.com>
dbfe392
to
2de2fee
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I left few minor comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Signed-off-by: gotjosh <josue.abreu@gmail.com>
Signed-off-by: gotjosh <josue.abreu@gmail.com>
Signed-off-by: gotjosh <josue.abreu@gmail.com>
Signed-off-by: gotjosh <josue.abreu@gmail.com>
Signed-off-by: gotjosh <josue.abreu@gmail.com>
Signed-off-by: gotjosh <josue.abreu@gmail.com>
In #13980 I introduced a change to reduce the number of queries executed when we restore alert statuses. With this, the querying semantics changed as we now need to go through all series before we enter the alert restoration loop and I missed the fact that exiting early when there are no rules to restore would lead to an incomplete restoration. An alert being restored is used as a proxy for "we're now ready to write `ALERTS/ALERTS_FOR_SERIES` metrics" so as a result we weren't writing the series if we didn't restore anything the first time around. Signed-off-by: gotjosh <josue.abreu@gmail.com>
* BUGFIX: Mark the rule's restoration process as completed always In #13980 I introduced a change to reduce the number of queries executed when we restore alert statuses. With this, the querying semantics changed as we now need to go through all series before we enter the alert restoration loop and I missed the fact that exiting early when there are no rules to restore would lead to an incomplete restoration. An alert being restored is used as a proxy for "we're now ready to write `ALERTS/ALERTS_FOR_SERIES` metrics" so as a result we weren't writing the series if we didn't restore anything the first time around. --------- Signed-off-by: gotjosh <josue.abreu@gmail.com>
Prometheus restores alert state between restarts and updates. For each rule, it looks at the alerts that are meant to be active and then queries the
ALERTS_FOR_STATE
series for each alert within the rules.If the alert rule has 120 instances (or series) it'll execute the same query with slightly different labels.
This PR changes the approach so that we only query once per alert rule and then match the corresponding alert that we're about to restore against the series-set. While the approach might use a bit more memory at start-up (if even?) the restore proccess is only ran once per restart so I'd consider this a big win.
This builds on top of #13974