Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The 'alertmanager_max_alerts_count' is not functioning properly #5720

Closed
damnever opened this issue Dec 28, 2023 · 8 comments · Fixed by #5956
Closed

The 'alertmanager_max_alerts_count' is not functioning properly #5720

damnever opened this issue Dec 28, 2023 · 8 comments · Fixed by #5956

Comments

@damnever
Copy link
Contributor

Describe the bug

Due to the race condition in Alertmanager, the alertmanager_max_alerts_count is not functioning properly, the cortex_alertmanager_alerts_limiter_current_alerts may keep increasing until alerts become limited. To address this, I have sent a patch here: prometheus/alertmanager#3648

To Reproduce

See prometheus/alertmanager#3648

Expected behavior

Additional Context

I have noticed that Alertmanager has now removed support for the v1 API. Perhaps we should also consider deprecating support for the v1 API.

@damnever
Copy link
Contributor Author

The temporary fix is as follows:

--- a/pkg/alertmanager/alertmanager.go
+++ b/pkg/alertmanager/alertmanager.go
@@ -615,7 +615,6 @@ type alertsLimiter struct {
 
 	mx        sync.Mutex
 	sizes     map[model.Fingerprint]int
-	count     int
 	totalSize int
 }
 
@@ -664,7 +663,8 @@ func (a *alertsLimiter) PreStore(alert *types.Alert, existing bool) error {
 	a.mx.Lock()
 	defer a.mx.Unlock()
 
-	if !existing && countLimit > 0 && (a.count+1) > countLimit {
+	_, existing = a.sizes[fp]
+	if !existing && countLimit > 0 && len(a.sizes)+1 > countLimit {
 		a.failureCounter.Inc()
 		return fmt.Errorf(errTooManyAlerts, countLimit)
 	}
@@ -692,11 +692,7 @@ func (a *alertsLimiter) PostStore(alert *types.Alert, existing bool) {
 	a.mx.Lock()
 	defer a.mx.Unlock()
 
-	if existing {
-		a.totalSize -= a.sizes[fp]
-	} else {
-		a.count++
-	}
+	a.totalSize -= a.sizes[fp]
 	a.sizes[fp] = newSize
 	a.totalSize += newSize
 }
@@ -713,14 +709,13 @@ func (a *alertsLimiter) PostDelete(alert *types.Alert) {
 
 	a.totalSize -= a.sizes[fp]
 	delete(a.sizes, fp)
-	a.count--
 }
 
 func (a *alertsLimiter) currentStats() (count, totalSize int) {
 	a.mx.Lock()
 	defer a.mx.Unlock()
 
-	return a.count, a.totalSize
+	return len(a.sizes), a.totalSize
 }

@yeya24
Copy link
Collaborator

yeya24 commented Jan 16, 2024

@qinxx108 @alvinlin123 Would you mind taking a look at this issue?

@qinxx108
Copy link
Contributor

Hi @damnever is the temporary fix a clean up and real fix is in prometheus repo?

@damnever
Copy link
Contributor Author

@qinxx108 yes, the real fix is in the Prometheus repo.

@yeya24
Copy link
Collaborator

yeya24 commented Mar 8, 2024

I guess this has been fixed in upstream alertmanager with fix prometheus/alertmanager#3715? @rajagopalanand Can you help confirm? We need to update alertmanager dependency to pull the fix

@damnever
Copy link
Contributor Author

prometheus/alertmanager#3648 fix this issue, but it appears that alertmanager is not being actively maintained.

@yeya24
Copy link
Collaborator

yeya24 commented Mar 12, 2024

Is prometheus/alertmanager#3715 fixing different things as prometheus/alertmanager#3648

@damnever
Copy link
Contributor Author

quote from prometheus/alertmanager#3715 (comment)

This is a duplicate of https://github.com/prometheus/alertmanager/pull/3648/files, right?

Was not aware of this but I will take a look

The other PR addresses issues more than just the deadlock issue mentioned in #3682. If this can be reviewed and merged, then when the other PR gets reviewed and merged it can address race conditions and can retain a test from this PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants