Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metric alertmanager_alerts report incorrect number of alerts #2619

Closed
gotjosh opened this issue Jun 15, 2021 · 1 comment
Closed

Metric alertmanager_alerts report incorrect number of alerts #2619

gotjosh opened this issue Jun 15, 2021 · 1 comment

Comments

@gotjosh
Copy link
Member

gotjosh commented Jun 15, 2021

What did you do?

Setup the alertmanager with clustering, sent alerts to all Alertmanagers.

What did you expect to see?

The alertmanager_alerts metric always converge across replicas.

What did you see instead? Under which circumstances?

As time passes the alertmanager alerts metric drift apart although no difference in API results - they are consistent.

As you can see in the screenshot, as time progresses the alertmanager_alerts metric drift apart in the range of 10s per replica.

Screenshot 2021-06-15 at 11 36 08

Verifying the API responses across replicas, I can see that no inconsistent results are being provided:

$ cat alertmanager-0 | jq '.data | length'
956
$ cat alertmanager-1 | jq '.data | length'
956

Looking at the metric implementation, we can see that the Marker is responsible for giving us the information of the current alerts but the in-memory alerting store coming from memAlerts is responsible for setting the callback that will keep the marker synced with the current alerts held in the Store. However, it is very apparent that the Store would execute the callback ONLY when we garbage collect and NOT when we delete an alert directly.

We believe the fix is to also execute the callback when an alert is directly deleted.

func (a *Alerts) gc() {
a.Lock()
defer a.Unlock()
var resolved []*types.Alert
for fp, alert := range a.c {
if alert.Resolved() {
delete(a.c, fp)
resolved = append(resolved, alert)
}
}
a.cb(resolved)
}
// Get returns the Alert with the matching fingerprint, or an error if it is
// not found.
func (a *Alerts) Get(fp model.Fingerprint) (*types.Alert, error) {
a.Lock()
defer a.Unlock()
alert, prs := a.c[fp]
if !prs {
return nil, ErrNotFound
}
return alert, nil
}
// Set unconditionally sets the alert in memory.
func (a *Alerts) Set(alert *types.Alert) error {
a.Lock()
defer a.Unlock()
a.c[alert.Fingerprint()] = alert
return nil
}
// Delete removes the Alert with the matching fingerprint from the store.
func (a *Alerts) Delete(fp model.Fingerprint) error {
a.Lock()
defer a.Unlock()
delete(a.c, fp)
return nil
}

gotjosh added a commit to gotjosh/alertmanager that referenced this issue Jun 15, 2021
The garbage collection process within the store is in charge of
determining if an alert is resolved, deleting it, and then communicating
this back to the callback set.

When an alert was explicitly deleted, these were not being communicated
back to the callback and caused the metric to report incorrect results.

Fixes prometheus#2619
gotjosh added a commit to gotjosh/alertmanager that referenced this issue Jun 15, 2021
The garbage collection process within the store is in charge of
determining if an alert is resolved, deleting it, and then communicating
this back to the callback set.

When an alert was explicitly deleted, these were not being communicated
back to the callback and caused the metric to report incorrect results.

Fixes prometheus#2619
gotjosh added a commit to gotjosh/alertmanager that referenced this issue Jun 15, 2021
The garbage collection process within the store is in charge of
determining if an alert is resolved, deleting it, and then communicating
this back to the callback set.

When an alert was explicitly deleted, these were not being communicated
back to the callback and caused the metric to report incorrect results.

Fixes prometheus#2619
gotjosh added a commit to gotjosh/alertmanager that referenced this issue Jun 15, 2021
The garbage collection process within the store is in charge of
determining if an alert is resolved, deleting it, and then communicating
this back to the callback set.

When an alert was explicitly deleted, these were not being communicated
back to the callback and caused the metric to report incorrect results.

Fixes prometheus#2619

Signed-off-by: gotjosh <josue@grafana.com>
gotjosh added a commit to gotjosh/alertmanager that referenced this issue Jun 15, 2021
The garbage collection process within the store is in charge of
determining if an alert is resolved, deleting it, and then communicating
this back to the callback set.

When an alert was explicitly deleted, these were not being communicated
back to the callback and caused the metric to report incorrect results.

Fixes prometheus#2619

Signed-off-by: gotjosh <josue@grafana.com>
gotjosh added a commit to gotjosh/alertmanager that referenced this issue Jun 15, 2021
The garbage collection process within the store is in charge of
determining if an alert is resolved, deleting it, and then communicating
this back to the callback set.

When an alert was explicitly deleted, these were not being communicated
back to the callback and caused the metric to report incorrect results.

Fixes prometheus#2619

Signed-off-by: gotjosh <josue@grafana.com>
gotjosh added a commit to gotjosh/alertmanager that referenced this issue Jun 15, 2021
The garbage collection process within the store is in charge of
determining if an alert is resolved, deleting it, and then communicating
this back to the callback set.

When an alert was explicitly deleted, these were not being communicated
back to the callback and caused the metric to report incorrect results.

Fixes prometheus#2619

Signed-off-by: gotjosh <josue@grafana.com>
gotjosh added a commit to gotjosh/alertmanager that referenced this issue Jun 15, 2021
The garbage collection process within the store is in charge of
determining if an alert is resolved, deleting it, and then communicating
this back to the callback set.

When an alert was explicitly deleted, these were not being communicated
back to the callback and caused the metric to report incorrect results.

Fixes prometheus#2619

Signed-off-by: gotjosh <josue@grafana.com>
gotjosh added a commit to gotjosh/alertmanager that referenced this issue Jun 15, 2022
Fixes prometheus#1439 and prometheus#2619.

The previous metric is not _technically_ reporting incorrect results as the alerts _are_ still around and will be re-used if that same alert (equal fingerprint) is received before it is GCed. Therefore, I have kept the old metric under a new name `alertmanager_marked_alerts` and repurpose the current metric to match what the user sees in the UI.
gotjosh added a commit to gotjosh/alertmanager that referenced this issue Jun 15, 2022
Fixes prometheus#1439 and prometheus#2619.

The previous metric is not _technically_ reporting incorrect results as the alerts _are_ still around and will be re-used if that same alert (equal fingerprint) is received before it is GCed. Therefore, I have kept the old metric under a new name `alertmanager_marked_alerts` and repurpose the current metric to match what the user sees in the UI.
gotjosh added a commit to gotjosh/alertmanager that referenced this issue Jun 15, 2022
Fixes prometheus#1439 and prometheus#2619.

The previous metric is not _technically_ reporting incorrect results as the alerts _are_ still around and will be re-used if that same alert (equal fingerprint) is received before it is GCed. Therefore, I have kept the old metric under a new name `alertmanager_marked_alerts` and repurpose the current metric to match what the user sees in the UI.

Signed-off-by: gotjosh <josue.abreu@gmail.com>
gotjosh added a commit to gotjosh/alertmanager that referenced this issue Jun 16, 2022
Fixes prometheus#1439 and prometheus#2619.

The previous metric is not _technically_ reporting incorrect results as the alerts _are_ still around and will be re-used if that same alert (equal fingerprint) is received before it is GCed. Therefore, I have kept the old metric under a new name `alertmanager_marked_alerts` and repurpose the current metric to match what the user sees in the UI.

Signed-off-by: gotjosh <josue.abreu@gmail.com>
roidelapluie pushed a commit that referenced this issue Jun 16, 2022
…2943)

* Alert metric reports different results to what the user sees via API

Fixes #1439 and #2619.

The previous metric is not _technically_ reporting incorrect results as the alerts _are_ still around and will be re-used if that same alert (equal fingerprint) is received before it is GCed. Therefore, I have kept the old metric under a new name `alertmanager_marked_alerts` and repurpose the current metric to match what the user sees in the UI.

Signed-off-by: gotjosh <josue.abreu@gmail.com>
@gotjosh
Copy link
Member Author

gotjosh commented Jun 16, 2022

Fixed by #2943

@gotjosh gotjosh closed this as completed Jun 16, 2022
qinxx108 pushed a commit to qinxx108/alertmanager that referenced this issue Dec 13, 2022
…rometheus#2943)

* Alert metric reports different results to what the user sees via API

Fixes prometheus#1439 and prometheus#2619.

The previous metric is not _technically_ reporting incorrect results as the alerts _are_ still around and will be re-used if that same alert (equal fingerprint) is received before it is GCed. Therefore, I have kept the old metric under a new name `alertmanager_marked_alerts` and repurpose the current metric to match what the user sees in the UI.

Signed-off-by: gotjosh <josue.abreu@gmail.com>
Signed-off-by: Yijie Qin <qinyijie@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant