-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix race conditions in the memory alerts store #3648
Conversation
2c6748d
to
90b6352
Compare
provider/mem/mem.go
Outdated
@@ -100,37 +101,52 @@ func NewAlerts(ctx context.Context, m types.Marker, intervalGC time.Duration, al | |||
logger: log.With(l, "component", "provider"), | |||
callback: alertCallback, | |||
} | |||
a.alerts.SetGCCallback(func(alerts []*types.Alert) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SetGCCallback
and Run
method on Alerts store are now unused. Should those be deleted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Run
is called by Inhibitor but SetGCCallback
is unused in AM but since it's a public method, deleting it might break other things. Could Inhibitor be also changed to use the new GC
and Run
can be removed?
@simonpasquier @w0rm @gotjosh please take a look |
ca4810d
to
e86379e
Compare
e86379e
to
5f20c84
Compare
@simonpasquier @gotjosh is this on your radar? |
@grobinson-grafana would you also mind taking a look at this? |
Apologies for asking lots of questions, but I would like to understand this case. I can see there is a race condition between the What I would still like to understand is how would the alert be missed? The mutex is acquired in |
This commit removes the GC and callback function from store.go to address a number of data races that have occurred in the past (prometheus#2040 and prometheus#3648). The store is no longer responsible for removing resolved alerts after some elapsed period of time, and is instead deferred to the consumer of the store (as done in prometheus#2040 and prometheus#3648). Signed-off-by: George Robinson <george.robinson@grafana.com>
I also opened a draft PR #3840 that builds on this fix. It removes |
This all comes together, not as an independent case. The reason is that we can delete alerts solely in store.Alerts without the lock from mem.Alerts.mtx. Imagine if the gc happens right after the Put(before the fix), then the listener might miss some alerts due to the race condition. |
I agree!
In this case the listener will receive two events: an event for the Put and a second event for the GC? It can receive them out of order, but I don't think events can be lost? The reason I think that is the alerts that are sent to func (a *Alerts) Put(alerts ...*types.Alert) error {
for _, alert := range alerts {
...
a.mtx.Lock()
for _, l := range a.listeners {
select {
case l.alerts <- alert:
case <-l.done:
}
}
a.mtx.Unlock()
} That means even when there is a gc between |
okay, they cannot be stopped, but this is a race. |
Signed-off-by: Xiaochao Dong (@damnever) <the.xcdong@gmail.com>
364c50c
to
00cd93d
Compare
Signed-off-by: Xiaochao Dong (@damnever) <the.xcdong@gmail.com>
00cd93d
to
6f686b3
Compare
@grobinson-grafana I have rebased with the main branch and changed some locks to use defer. Please take another look. If there are no major change requirements, I believe we should merge this first and make further improvements as needed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Just one comment!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
@@ -90,6 +91,7 @@ func (a *Alerts) gc() { | |||
} | |||
a.Unlock() | |||
a.cb(resolved) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gotjosh I want to remove the callback in a future PR, so I'm not too worried about both returning resolved
and passing it to the callback.
Signed-off-by: Xiaochao Dong (@damnever) <the.xcdong@gmail.com>
10a0b11
to
1d72d10
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
delete(a.listeners, i) | ||
close(l.alerts) | ||
default: | ||
// listener is not closed yet, hence proceed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// listener is not closed yet, hence proceed. | |
// Listener is not closed yet, hence proceed. |
You can address it in the next PR.
Thank you very much for your contribution! |
This commit removes the GC and callback function from store.go to address a number of data races that have occurred in the past (prometheus#2040 and prometheus#3648). The store is no longer responsible for removing resolved alerts after some elapsed period of time, and is instead deferred to the consumer of the store (as done in prometheus#2040 and prometheus#3648). Signed-off-by: George Robinson <george.robinson@grafana.com>
This commit removes the GC and callback function from store.go to address a number of data races that have occurred in the past (prometheus#2040 and prometheus#3648). The store is no longer responsible for removing resolved alerts after some elapsed period of time, and is instead deferred to the consumer of the store (as done in prometheus#2040 and prometheus#3648). Signed-off-by: George Robinson <george.robinson@grafana.com>
The main branch will easily fail the newly added test case:
There are multiple race conditions in the provider/mem:
alertmanager/provider/mem/mem.go
Lines 204 to 228 in c920b60
@simonpasquier @w0rm @gotjosh please take a look