Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Persist alert 'keep_firing_for' state across restarts #13957

Open
mustafain117 opened this issue Apr 18, 2024 · 6 comments
Open

Persist alert 'keep_firing_for' state across restarts #13957

mustafain117 opened this issue Apr 18, 2024 · 6 comments

Comments

@mustafain117
Copy link

mustafain117 commented Apr 18, 2024

Proposal

Currently if Prometheus restarts, we lose the 'keep_firing_for' state for firing alerts. This is a gap in the feature as alerts that should keep firing are prematurely resolved when Prometheus restarts (deployment or OOM).

It'd be good to persist this state in some way, so that alerts don't resolve before the keep firing duration/ stabilization delay expires.

What alert information needs to be persisted?

  • state
  • labels
  • annotations
  • activeAt
  • keepFiringSince
  • value

The problem described above can be reproduced by:

  1. creating an alerting rule with keep_firing_for defined
  2. Restart server when alert expression is resolved and keep_firing_for duration is being used
  3. Alert will be resolved when Prometheus restarts since the expression was resolved prior to the restart and the rule query no longer produces a result.

Example:
Alerting rule test is firing with KeepFiringSince timestamp set implying that the keep_firing_for duration is being used.

{
    "status": "success",
    "data": {
        "groups": [{
            "name": "recording",
            "file": "/etc/prometheus/rules/kodkxfqmyiipgsa.yml",
            "rules": [{
                "name": "recording",
                "query": "time() % 3600",
                "labels": {
                    "severity": "warning"
                },
                "health": "ok",
                "evaluationTime": 0.000487083,
                "lastEvaluation": "2024-04-18T20:22:49.82698325Z",
                "type": "recording"
            }],
            "interval": 15,
            "limit": 0,
            "evaluationTime": 0.000498208,
            "lastEvaluation": "2024-04-18T20:22:49.826974708Z"
        }, {
            "name": "test",
            "file": "/etc/prometheus/rules/rules2.yaml",
            "rules": [{
                "state": "firing",
                "name": "Alert",
                "query": "absent(recording)",
                "duration": 0,
                "keepFiringFor": 3600,
                "labels": {
                    "event": "recording",
                    "severity": "major"
                },
                "annotations": {
                    "description": "absent recording from from prometheus",
                    "summary": "absent recording from prometheus"
                },
                "alerts": [{
                    "labels": {
                        "alertname": "Alert",
                        "event": "recording",
                        "severity": "major"
                    },
                    "annotations": {
                        "description": "absent recording from from prometheus",
                        "summary": "absent recording from prometheus"
                    },
                    "state": "firing",
                    "activeAt": "2024-04-18T20:19:20.052884355Z",
                    "keepFiringSince": "2024-04-18T20:22:20.052884355Z",
                    "value": "1e+00"
                }],
                "health": "ok",
                "evaluationTime": 0.000622041,
                "lastEvaluation": "2024-04-18T20:22:54.31640421Z",
                "type": "alerting"
            }],
            "interval": 15,
            "limit": 0,
            "evaluationTime": 0.000654333,
            "lastEvaluation": "2024-04-18T20:22:54.316376794Z"
        }]
    }
}

After restart:
Alerting rule test no longer firing, even though lastEvaluation < KeepFiringSince + keep_firing_for duration

{
    "status": "success",
    "data": {
        "groups": [{
            "name": "recording",
            "file": "/etc/prometheus/rules/kodkxfqmyiipgsa.yml",
            "rules": [{
                "name": "recording",
                "query": "time() % 3600",
                "labels": {
                    "severity": "warning"
                },
                "health": "ok",
                "evaluationTime": 0.004894042,
                "lastEvaluation": "2024-04-18T20:25:19.825592167Z",
                "type": "recording"
            }],
            "interval": 15,
            "limit": 0,
            "evaluationTime": 0.008571708,
            "lastEvaluation": "2024-04-18T20:25:19.821916792Z"
        }, {
            "name": "test",
            "file": "/etc/prometheus/rules/rules2.yaml",
            "rules": [{
                "state": "inactive",
                "name": "Alert",
                "query": "absent(recording)",
                "duration": 0,
                "keepFiringFor": 3600,
                "labels": {
                    "event": "recording",
                    "severity": "major"
                },
                "annotations": {
                    "description": "absent recording from from prometheus",
                    "summary": "absent recording from prometheus"
                },
                "alerts": [],
                "health": "ok",
                "evaluationTime": 0.000352792,
                "lastEvaluation": "2024-04-18T20:25:20.060370417Z",
                "type": "alerting"
            }],
            "interval": 15,
            "limit": 0,
            "evaluationTime": 0.000387791,
            "lastEvaluation": "2024-04-18T20:25:20.060339Z"
        }]
    }
}

Expected behavior: After server restarts, test Alerting rule should keep firing until the duration expires.

@dragoangel
Copy link

dragoangel commented Apr 22, 2024

@mustafain117 you not considering running Prometheus and Alertmanager in HA mode? This can be easily achievable and would not require any special "magic", you just need to be sure that Prometheus working always.

@dragoangel
Copy link

One thing I thinking on is annotation: keep_firing_for: x just to inform reciever that this alert has it set. It's could be useful for information purpose, for now I using extra label keep_firing_for which I set on alerts additionally, but as said such information can be part of annontations

@mustafain117
Copy link
Author

mustafain117 commented Apr 22, 2024

@mustafain117 you not considering running Prometheus and Alertmanager in HA mode? This can be easily achievable and would not require any special "magic", you just need to be sure that Prometheus working always.

@dragoangel How do you handle the scenario where all replicas/instances restart? for e.g when there is a deployment.

@dragoangel
Copy link

@mustafain117 you not considering running Prometheus and Alertmanager in HA mode? This can be easily achievable and would not require any special "magic", you just need to be sure that Prometheus working always.

@dragoangel How do you handle the scenario where all replicas/instances restart? for e.g when there is a deployment.

Healthcheck should pass after service is fully loaded, but maybe you are right about the rollout restart. Still I'm not sure if this is so big issue. I also not sure if that could be fixed easily with current design

@roidelapluie
Copy link
Member

Hello,

I agree that keep firing for should be kept across restarts. I would accept a pull request addressing this.

@mustafain117
Copy link
Author

@roidelapluie Can you please take a look at this draft PR: #14018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants