Metric `alertmanager_alerts` report incorrect number of alerts #2619

gotjosh · 2021-06-15T11:08:41Z

What did you do?

Setup the alertmanager with clustering, sent alerts to all Alertmanagers.

What did you expect to see?

The alertmanager_alerts metric always converge across replicas.

What did you see instead? Under which circumstances?

As time passes the alertmanager alerts metric drift apart although no difference in API results - they are consistent.

As you can see in the screenshot, as time progresses the alertmanager_alerts metric drift apart in the range of 10s per replica.

Verifying the API responses across replicas, I can see that no inconsistent results are being provided:

$ cat alertmanager-0 | jq '.data | length'
956
$ cat alertmanager-1 | jq '.data | length'
956

Looking at the metric implementation, we can see that the Marker is responsible for giving us the information of the current alerts but the in-memory alerting store coming from memAlerts is responsible for setting the callback that will keep the marker synced with the current alerts held in the Store. However, it is very apparent that the Store would execute the callback ONLY when we garbage collect and NOT when we delete an alert directly.

We believe the fix is to also execute the callback when an alert is directly deleted.

alertmanager/store/store.go

Lines 73 to 116 in 58169c1

    
           func (a *Alerts) gc() { 
        
           	a.Lock() 
        
           	defer a.Unlock() 
        
           	var resolved []*types.Alert 
        
           	for fp, alert := range a.c { 
        
           		if alert.Resolved() { 
        
           			delete(a.c, fp) 
        
           			resolved = append(resolved, alert) 
        
           		} 
        
           	} 
        
           	a.cb(resolved) 
        
           } 
        
           // Get returns the Alert with the matching fingerprint, or an error if it is 
        
           // not found. 
        
           func (a *Alerts) Get(fp model.Fingerprint) (*types.Alert, error) { 
        
           	a.Lock() 
        
           	defer a.Unlock() 
        
           	alert, prs := a.c[fp] 
        
           	if !prs { 
        
           		return nil, ErrNotFound 
        
           	} 
        
           	return alert, nil 
        
           } 
        
           // Set unconditionally sets the alert in memory. 
        
           func (a *Alerts) Set(alert *types.Alert) error { 
        
           	a.Lock() 
        
           	defer a.Unlock() 
        
           	a.c[alert.Fingerprint()] = alert 
        
           	return nil 
        
           } 
        
           // Delete removes the Alert with the matching fingerprint from the store. 
        
           func (a *Alerts) Delete(fp model.Fingerprint) error { 
        
           	a.Lock() 
        
           	defer a.Unlock() 
        
           	delete(a.c, fp) 
        
           	return nil 
        
           }

The text was updated successfully, but these errors were encountered:

The garbage collection process within the store is in charge of determining if an alert is resolved, deleting it, and then communicating this back to the callback set. When an alert was explicitly deleted, these were not being communicated back to the callback and caused the metric to report incorrect results. Fixes prometheus#2619

The garbage collection process within the store is in charge of determining if an alert is resolved, deleting it, and then communicating this back to the callback set. When an alert was explicitly deleted, these were not being communicated back to the callback and caused the metric to report incorrect results. Fixes prometheus#2619 Signed-off-by: gotjosh <josue@grafana.com>

Fixes prometheus#1439 and prometheus#2619. The previous metric is not _technically_ reporting incorrect results as the alerts _are_ still around and will be re-used if that same alert (equal fingerprint) is received before it is GCed. Therefore, I have kept the old metric under a new name `alertmanager_marked_alerts` and repurpose the current metric to match what the user sees in the UI.

Fixes prometheus#1439 and prometheus#2619. The previous metric is not _technically_ reporting incorrect results as the alerts _are_ still around and will be re-used if that same alert (equal fingerprint) is received before it is GCed. Therefore, I have kept the old metric under a new name `alertmanager_marked_alerts` and repurpose the current metric to match what the user sees in the UI. Signed-off-by: gotjosh <josue.abreu@gmail.com>

…2943) * Alert metric reports different results to what the user sees via API Fixes #1439 and #2619. The previous metric is not _technically_ reporting incorrect results as the alerts _are_ still around and will be re-used if that same alert (equal fingerprint) is received before it is GCed. Therefore, I have kept the old metric under a new name `alertmanager_marked_alerts` and repurpose the current metric to match what the user sees in the UI. Signed-off-by: gotjosh <josue.abreu@gmail.com>

gotjosh · 2022-06-16T12:54:05Z

Fixed by #2943

…rometheus#2943) * Alert metric reports different results to what the user sees via API Fixes prometheus#1439 and prometheus#2619. The previous metric is not _technically_ reporting incorrect results as the alerts _are_ still around and will be re-used if that same alert (equal fingerprint) is received before it is GCed. Therefore, I have kept the old metric under a new name `alertmanager_marked_alerts` and repurpose the current metric to match what the user sees in the UI. Signed-off-by: gotjosh <josue.abreu@gmail.com> Signed-off-by: Yijie Qin <qinyijie@amazon.com>

gotjosh mentioned this issue Jun 15, 2021

Alert store not calling callback when an alert is deleted #2622

Closed

gotjosh mentioned this issue Jun 15, 2022

Alert metric reports different results to what the user sees via API #2943

Merged

gotjosh closed this as completed Jun 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metric `alertmanager_alerts` report incorrect number of alerts #2619

Metric `alertmanager_alerts` report incorrect number of alerts #2619

gotjosh commented Jun 15, 2021

gotjosh commented Jun 16, 2022

Metric alertmanager_alerts report incorrect number of alerts #2619

Metric alertmanager_alerts report incorrect number of alerts #2619

Comments

gotjosh commented Jun 15, 2021

gotjosh commented Jun 16, 2022

Metric `alertmanager_alerts` report incorrect number of alerts #2619

Metric `alertmanager_alerts` report incorrect number of alerts #2619