New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Persist alert 'keep_firing_for' state across restarts #13957
Comments
@mustafain117 you not considering running Prometheus and Alertmanager in HA mode? This can be easily achievable and would not require any special "magic", you just need to be sure that Prometheus working always. |
One thing I thinking on is annotation: |
@dragoangel How do you handle the scenario where all replicas/instances restart? for e.g when there is a deployment. |
Healthcheck should pass after service is fully loaded, but maybe you are right about the rollout restart. Still I'm not sure if this is so big issue. I also not sure if that could be fixed easily with current design |
Hello, I agree that keep firing for should be kept across restarts. I would accept a pull request addressing this. |
@roidelapluie Can you please take a look at this draft PR: #14018 |
Proposal
Currently if Prometheus restarts, we lose the 'keep_firing_for' state for firing alerts. This is a gap in the feature as alerts that should keep firing are prematurely resolved when Prometheus restarts (deployment or OOM).
It'd be good to persist this state in some way, so that alerts don't resolve before the keep firing duration/ stabilization delay expires.
What alert information needs to be persisted?
The problem described above can be reproduced by:
Example:
Alerting rule
test
is firing withKeepFiringSince
timestamp set implying that the keep_firing_for duration is being used.After restart:
Alerting rule
test
no longer firing, even though lastEvaluation < KeepFiringSince + keep_firing_for durationExpected behavior: After server restarts,
test
Alerting rule should keep firing until the duration expires.The text was updated successfully, but these errors were encountered: