Persist alert 'keep_firing_for' state across restarts #13957

mustafain117 · 2024-04-18T20:29:58Z

Proposal

Currently if Prometheus restarts, we lose the 'keep_firing_for' state for firing alerts. This is a gap in the feature as alerts that should keep firing are prematurely resolved when Prometheus restarts (deployment or OOM).

It'd be good to persist this state in some way, so that alerts don't resolve before the keep firing duration/ stabilization delay expires.

What alert information needs to be persisted?

state
labels
annotations
activeAt
keepFiringSince
value

The problem described above can be reproduced by:

creating an alerting rule with keep_firing_for defined
Restart server when alert expression is resolved and keep_firing_for duration is being used
Alert will be resolved when Prometheus restarts since the expression was resolved prior to the restart and the rule query no longer produces a result.

Example:
Alerting rule test is firing with KeepFiringSince timestamp set implying that the keep_firing_for duration is being used.

{
    "status": "success",
    "data": {
        "groups": [{
            "name": "recording",
            "file": "/etc/prometheus/rules/kodkxfqmyiipgsa.yml",
            "rules": [{
                "name": "recording",
                "query": "time() % 3600",
                "labels": {
                    "severity": "warning"
                },
                "health": "ok",
                "evaluationTime": 0.000487083,
                "lastEvaluation": "2024-04-18T20:22:49.82698325Z",
                "type": "recording"
            }],
            "interval": 15,
            "limit": 0,
            "evaluationTime": 0.000498208,
            "lastEvaluation": "2024-04-18T20:22:49.826974708Z"
        }, {
            "name": "test",
            "file": "/etc/prometheus/rules/rules2.yaml",
            "rules": [{
                "state": "firing",
                "name": "Alert",
                "query": "absent(recording)",
                "duration": 0,
                "keepFiringFor": 3600,
                "labels": {
                    "event": "recording",
                    "severity": "major"
                },
                "annotations": {
                    "description": "absent recording from from prometheus",
                    "summary": "absent recording from prometheus"
                },
                "alerts": [{
                    "labels": {
                        "alertname": "Alert",
                        "event": "recording",
                        "severity": "major"
                    },
                    "annotations": {
                        "description": "absent recording from from prometheus",
                        "summary": "absent recording from prometheus"
                    },
                    "state": "firing",
                    "activeAt": "2024-04-18T20:19:20.052884355Z",
                    "keepFiringSince": "2024-04-18T20:22:20.052884355Z",
                    "value": "1e+00"
                }],
                "health": "ok",
                "evaluationTime": 0.000622041,
                "lastEvaluation": "2024-04-18T20:22:54.31640421Z",
                "type": "alerting"
            }],
            "interval": 15,
            "limit": 0,
            "evaluationTime": 0.000654333,
            "lastEvaluation": "2024-04-18T20:22:54.316376794Z"
        }]
    }
}

After restart:
Alerting rule test no longer firing, even though lastEvaluation < KeepFiringSince + keep_firing_for duration

{
    "status": "success",
    "data": {
        "groups": [{
            "name": "recording",
            "file": "/etc/prometheus/rules/kodkxfqmyiipgsa.yml",
            "rules": [{
                "name": "recording",
                "query": "time() % 3600",
                "labels": {
                    "severity": "warning"
                },
                "health": "ok",
                "evaluationTime": 0.004894042,
                "lastEvaluation": "2024-04-18T20:25:19.825592167Z",
                "type": "recording"
            }],
            "interval": 15,
            "limit": 0,
            "evaluationTime": 0.008571708,
            "lastEvaluation": "2024-04-18T20:25:19.821916792Z"
        }, {
            "name": "test",
            "file": "/etc/prometheus/rules/rules2.yaml",
            "rules": [{
                "state": "inactive",
                "name": "Alert",
                "query": "absent(recording)",
                "duration": 0,
                "keepFiringFor": 3600,
                "labels": {
                    "event": "recording",
                    "severity": "major"
                },
                "annotations": {
                    "description": "absent recording from from prometheus",
                    "summary": "absent recording from prometheus"
                },
                "alerts": [],
                "health": "ok",
                "evaluationTime": 0.000352792,
                "lastEvaluation": "2024-04-18T20:25:20.060370417Z",
                "type": "alerting"
            }],
            "interval": 15,
            "limit": 0,
            "evaluationTime": 0.000387791,
            "lastEvaluation": "2024-04-18T20:25:20.060339Z"
        }]
    }
}

Expected behavior: After server restarts, test Alerting rule should keep firing until the duration expires.

The text was updated successfully, but these errors were encountered:

dragoangel · 2024-04-22T12:59:20Z

@mustafain117 you not considering running Prometheus and Alertmanager in HA mode? This can be easily achievable and would not require any special "magic", you just need to be sure that Prometheus working always.

dragoangel · 2024-04-22T16:07:31Z

One thing I thinking on is annotation: keep_firing_for: x just to inform reciever that this alert has it set. It's could be useful for information purpose, for now I using extra label keep_firing_for which I set on alerts additionally, but as said such information can be part of annontations

mustafain117 · 2024-04-22T17:47:04Z

@mustafain117 you not considering running Prometheus and Alertmanager in HA mode? This can be easily achievable and would not require any special "magic", you just need to be sure that Prometheus working always.

@dragoangel How do you handle the scenario where all replicas/instances restart? for e.g when there is a deployment.

dragoangel · 2024-04-22T17:50:42Z

@mustafain117 you not considering running Prometheus and Alertmanager in HA mode? This can be easily achievable and would not require any special "magic", you just need to be sure that Prometheus working always.

@dragoangel How do you handle the scenario where all replicas/instances restart? for e.g when there is a deployment.

Healthcheck should pass after service is fully loaded, but maybe you are right about the rollout restart. Still I'm not sure if this is so big issue. I also not sure if that could be fixed easily with current design

roidelapluie · 2024-04-23T08:40:27Z

Hello,

I agree that keep firing for should be kept across restarts. I would accept a pull request addressing this.

mustafain117 · 2024-05-02T17:24:09Z

@roidelapluie Can you please take a look at this draft PR: #14018

mustafain117 mentioned this issue Apr 30, 2024

persist keep firing for state across restarts #14018

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Persist alert 'keep_firing_for' state across restarts #13957

Persist alert 'keep_firing_for' state across restarts #13957

mustafain117 commented Apr 18, 2024 •

edited

dragoangel commented Apr 22, 2024 •

edited

dragoangel commented Apr 22, 2024

mustafain117 commented Apr 22, 2024 •

edited

dragoangel commented Apr 22, 2024

roidelapluie commented Apr 23, 2024

mustafain117 commented May 2, 2024

Persist alert 'keep_firing_for' state across restarts #13957

Persist alert 'keep_firing_for' state across restarts #13957

Comments

mustafain117 commented Apr 18, 2024 • edited

Proposal

dragoangel commented Apr 22, 2024 • edited

dragoangel commented Apr 22, 2024

mustafain117 commented Apr 22, 2024 • edited

dragoangel commented Apr 22, 2024

roidelapluie commented Apr 23, 2024

mustafain117 commented May 2, 2024

mustafain117 commented Apr 18, 2024 •

edited

dragoangel commented Apr 22, 2024 •

edited

mustafain117 commented Apr 22, 2024 •

edited