Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race conditions in the memory alerts store #3648

Merged
merged 6 commits into from
May 16, 2024

Conversation

damnever
Copy link
Contributor

The main branch will easily fail the newly added test case:

=== RUN   TestAlertsConcurrently
    mem_test.go:565:
                Error Trace:    /prometheus-io/alertmanager/provider/mem/mem_test.go:565
                Error:          Not equal:
                                expected: 0
                                actual  : -171
                Test:           TestAlertsConcurrently

There are multiple race conditions in the provider/mem:

  1. Any Put or GC operation that occurs concurrently with this code block will introduce a race condition
    existing := false
    // Check that there's an alert existing within the store before
    // trying to merge.
    if old, err := a.alerts.Get(fp); err == nil {
    existing = true
    // Merge alerts if there is an overlap in activity range.
    if (alert.EndsAt.After(old.StartsAt) && alert.EndsAt.Before(old.EndsAt)) ||
    (alert.StartsAt.After(old.StartsAt) && alert.StartsAt.Before(old.EndsAt)) {
    alert = old.Merge(alert)
    }
    }
    if err := a.callback.PreStore(alert, existing); err != nil {
    level.Error(a.logger).Log("msg", "pre-store callback returned error on set alert", "err", err)
    continue
    }
    if err := a.alerts.Set(alert); err != nil {
    level.Error(a.logger).Log("msg", "error on set alert", "err", err)
    continue
    }
    a.callback.PostStore(alert, existing)
  2. A race condition between Put() and Subscribe() can cause some newly added Alerts to be missed.

@simonpasquier @w0rm @gotjosh please take a look

provider/mem/mem.go Outdated Show resolved Hide resolved
@@ -100,37 +101,52 @@ func NewAlerts(ctx context.Context, m types.Marker, intervalGC time.Duration, al
logger: log.With(l, "component", "provider"),
callback: alertCallback,
}
a.alerts.SetGCCallback(func(alerts []*types.Alert) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SetGCCallback and Run method on Alerts store are now unused. Should those be deleted?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Run is called by Inhibitor but SetGCCallback is unused in AM but since it's a public method, deleting it might break other things. Could Inhibitor be also changed to use the new GC and Run can be removed?

@damnever
Copy link
Contributor Author

damnever commented Mar 5, 2024

@simonpasquier @w0rm @gotjosh please take a look

@beorn7
Copy link
Member

beorn7 commented Apr 30, 2024

@simonpasquier @gotjosh is this on your radar?

@damnever
Copy link
Contributor Author

damnever commented May 6, 2024

@grobinson-grafana would you also mind taking a look at this?

@grobinson-grafana
Copy link
Contributor

A race condition between Put() and Subscribe() can cause some newly added Alerts to be missed.

Apologies for asking lots of questions, but I would like to understand this case. I can see there is a race condition between the Get and Set operations in Put because 1.) two or more goroutines can call Put at the same time and 2.) the calls to Get and Set can be interleaved with a gc operation on the store, causing the alert to be deleted and the added back. I agree this needs to be fixed.

What I would still like to understand is how would the alert be missed? The mutex is acquired in Subscribe and before the alert is sent to the listeners, so it should not happen that a new listener misses an alert? Would it be possible to show how this happens with a test?

provider/mem/mem.go Outdated Show resolved Hide resolved
provider/mem/mem.go Outdated Show resolved Hide resolved
grobinson-grafana added a commit to grobinson-grafana/alertmanager that referenced this pull request May 13, 2024
This commit removes the GC and callback function from store.go
to address a number of data races that have occurred in the past
(prometheus#2040 and prometheus#3648). The store is no longer responsible for removing
resolved alerts after some elapsed period of time, and is instead
deferred to the consumer of the store (as done in prometheus#2040 and prometheus#3648).

Signed-off-by: George Robinson <george.robinson@grafana.com>
@grobinson-grafana
Copy link
Contributor

I also opened a draft PR #3840 that builds on this fix. It removes gc and the callback from store.go.

@damnever
Copy link
Contributor Author

A race condition between Put() and Subscribe() can cause some newly added Alerts to be missed.

Apologies for asking lots of questions, but I would like to understand this case. I can see there is a race condition between the Get and Set operations in Put because 1.) two or more goroutines can call Put at the same time and 2.) the calls to Get and Set can be interleaved with a gc operation on the store, causing the alert to be deleted and the added back. I agree this needs to be fixed.

What I would still like to understand is how would the alert be missed? The mutex is acquired in Subscribe and before the alert is sent to the listeners, so it should not happen that a new listener misses an alert? Would it be possible to show how this happens with a test?

This all comes together, not as an independent case. The reason is that we can delete alerts solely in store.Alerts without the lock from mem.Alerts.mtx. Imagine if the gc happens right after the Put(before the fix), then the listener might miss some alerts due to the race condition.

@grobinson-grafana
Copy link
Contributor

A race condition between Put() and Subscribe() can cause some newly added Alerts to be missed.

Apologies for asking lots of questions, but I would like to understand this case. I can see there is a race condition between the Get and Set operations in Put because 1.) two or more goroutines can call Put at the same time and 2.) the calls to Get and Set can be interleaved with a gc operation on the store, causing the alert to be deleted and the added back. I agree this needs to be fixed.
What I would still like to understand is how would the alert be missed? The mutex is acquired in Subscribe and before the alert is sent to the listeners, so it should not happen that a new listener misses an alert? Would it be possible to show how this happens with a test?

This all comes together, not as an independent case. The reason is that we can delete alerts solely in store.Alerts without the lock from mem.Alerts.mtx. Imagine if the gc happens right after the Put(before the fix), then the listener might miss some alerts due to the race condition.

The reason is that we can delete alerts solely in store.Alerts without the lock from mem.Alerts.mtx.

I agree!

Imagine if the gc happens right after the Put(before the fix), then the listener might miss some alerts due to the race condition.

In this case the listener will receive two events: an event for the Put and a second event for the GC? It can receive them out of order, but I don't think events can be lost?

The reason I think that is the alerts that are sent to a.listeners in Put comes from the argument alerts and not from the alerts in the mem store:

func (a *Alerts) Put(alerts ...*types.Alert) error {
	for _, alert := range alerts {
		...
		a.mtx.Lock()
		for _, l := range a.listeners {
			select {
			case l.alerts <- alert:
			case <-l.done:
			}
		}
		a.mtx.Unlock()
	}

That means even when there is a gc between Set and range a.listeners, the gc operation cannot stop these alerts from being sent to the listeners. The listeners might receive the events out of order, but I still don't see how an event can be missed?

@damnever
Copy link
Contributor Author

okay, they cannot be stopped, but this is a race.

Signed-off-by: Xiaochao Dong (@damnever) <the.xcdong@gmail.com>
Signed-off-by: Xiaochao Dong (@damnever) <the.xcdong@gmail.com>
Signed-off-by: Xiaochao Dong (@damnever) <the.xcdong@gmail.com>
Signed-off-by: Xiaochao Dong (@damnever) <the.xcdong@gmail.com>
Signed-off-by: Xiaochao Dong (@damnever) <the.xcdong@gmail.com>
@damnever
Copy link
Contributor Author

@grobinson-grafana I have rebased with the main branch and changed some locks to use defer. Please take another look. If there are no major change requirements, I believe we should merge this first and make further improvements as needed.

Copy link
Contributor

@grobinson-grafana grobinson-grafana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Just one comment!

provider/mem/mem.go Outdated Show resolved Hide resolved
Copy link
Contributor

@grobinson-grafana grobinson-grafana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@@ -90,6 +91,7 @@ func (a *Alerts) gc() {
}
a.Unlock()
a.cb(resolved)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gotjosh I want to remove the callback in a future PR, so I'm not too worried about both returning resolved and passing it to the callback.

Signed-off-by: Xiaochao Dong (@damnever) <the.xcdong@gmail.com>
Copy link
Member

@gotjosh gotjosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

delete(a.listeners, i)
close(l.alerts)
default:
// listener is not closed yet, hence proceed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// listener is not closed yet, hence proceed.
// Listener is not closed yet, hence proceed.

You can address it in the next PR.

@gotjosh gotjosh merged commit 91a94f0 into prometheus:main May 16, 2024
11 checks passed
@gotjosh
Copy link
Member

gotjosh commented May 16, 2024

Thank you very much for your contribution!

grobinson-grafana added a commit to grobinson-grafana/alertmanager that referenced this pull request May 23, 2024
This commit removes the GC and callback function from store.go
to address a number of data races that have occurred in the past
(prometheus#2040 and prometheus#3648). The store is no longer responsible for removing
resolved alerts after some elapsed period of time, and is instead
deferred to the consumer of the store (as done in prometheus#2040 and prometheus#3648).

Signed-off-by: George Robinson <george.robinson@grafana.com>
grobinson-grafana added a commit to grobinson-grafana/alertmanager that referenced this pull request May 26, 2024
This commit removes the GC and callback function from store.go
to address a number of data races that have occurred in the past
(prometheus#2040 and prometheus#3648). The store is no longer responsible for removing
resolved alerts after some elapsed period of time, and is instead
deferred to the consumer of the store (as done in prometheus#2040 and prometheus#3648).

Signed-off-by: George Robinson <george.robinson@grafana.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants