Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OOM crashloop auto-recovery #13939

Open
bwplotka opened this issue Apr 16, 2024 · 1 comment
Open

OOM crashloop auto-recovery #13939

bwplotka opened this issue Apr 16, 2024 · 1 comment

Comments

@bwplotka
Copy link
Member

Proposal

I would love to start the discussion on better mechanism for OOM crashloop auto-recovery for users to use, for both agent and TSDB mode.

Motivation

I think there is some confusion around OOM cases, which led to many reports for the replay overhead that actually might have been related to the "classic" OOM crashloop.

To me the definition of OOM crashloop is when we let Prometheus to scrape too many series/samples (or too cardinal series) for the given resources we have on node (or other memory limits). Because both agent and TSDB modes are highly durable, with WAL and checkpointing, replay will add the same samples that caused OOM initially back to memory causing repeated situation. As a result, improving WAL replay memory use to be lower than ~105% of normal Prometheus use, has a little point. Note that persisting TSDB blocks and head chunks do not contribute to subsequent OOMs (unless you add query load, but this traffic is memory mapped, so don't generally use a lot of heap).

This is important and common issue, because once the user gets into OOM state e.g. one target exposed too many metrics, there are two common ways out of this

A) Increasing memory limits (if possible) even just for the 2h window to recover, or only to learn what target is broken.
B) Delete the WAL directory (or renaming or backuping it first and deleting, but likely ignoring it later on anyway), losing last ~2h of data. Ideally in the same time you fix your target (e.g. limit metrics) or use next few minutes to detect where the cardinality grow happened etc and do the change afterwards (with a couple of OOMs and manual WAL removal cycles).

Generally, this is not ideal. A is not always possible (and not infinitely), but can be potentially automated with VPA, but manual in practice.
B is extremely manual and not easy to automate in many environments.

To divide B more it consists of three manual steps:

B1) "detection" Finding the source of OOM (is it one target or amount of targets), is it one metric? is it one label that consumes unexpected amount of memory?
B2) "limit" Limit the source manually. Be it a target sample limit, filtering out a label, disabling target (Note: Recording rule won't work).
B3) "recover" Delete the problematic data. There is no mechanism to delete partial data on replay (yet), plus we care about fastest startup possible.

Again, often people do B3 first and then B1 and B2 because there is no other way.

Ideas

My motivation is to make B easier, or fully automated, and I would start with B3.

Generally Prometheus project cares about user data, it's number one priority, thus maintainers were generally against any "destroying" data mechanisms. I wonder if there is a room to revisit this, because at this point, majority of users (every user I talked to with this issue), literally does B3, so removes the WAL data manually. Yes, it destroys 2h of data, but it's better than crashlooping and not monitoring at all for a few hours or days once we get time to fix it etc.

For agent mode, we could start with "dangerous" flag that will remove the files on every restart, then we could talk about some "no-wal" mode, for those who prefer availability over persistence.

For TSDB mode, we could have similar flag for WAL removal, but only when OOM is detected (how though?). Lot's of questions but even aiming first Kubernetes env would help. Perhaps users had something like that already scripted/via sidecar. Or perhaps we could innovate with some WAL backup mechanisms first (e.g. to object storage or simply renaming) when OOM happens?

Further Context

To add more context, at Google, for one very common GKE use case, we run heavily tuned Prometheus in TSDB mode (we plan to move to agent mode at some point) with ephemeral storage, but still due to K8s semantics, the directory persist across restart, so we are adding --storage.tsdb.delete-data-on-start and --storage.agent.delete-data-on-start (to our fork for now) to forcefully delete all data files for consistency with the normal pod rollout (which is ephemeral), to have ultra fast startup and allow any recovery in OOM cases. We also work on other automation on memory limits (A) and scrape limits would be nice as well (B1 and B2), so keen on discussing more here on the potential solution (:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants