Memory usage spikes during WAL replay to more than normal usage #6934

erkexzcx · 2020-03-05T14:24:15Z

What did you do?
Tried to start prometheus.

What did you expect to see?
Prometheus up & running, web interface showing up.

What did you see instead? Under which circumstances?
Prometheus runs out of RAM during "WAL segment loaded" process.

Environment
Debian 9

System information:
Linux 4.9.0-11-amd64 x86_64
Prometheus version:

prometheus, version 2.16.0 (branch: HEAD, revision: b90be6f32a33c03163d700e1452b54454ddce0ec)
  build user:       root@7ea0ae865f12
  build date:       20200213-23:50:02
  go version:       go1.13.8

Prometheus configuration file:

global:
  evaluation_interval: 60s
  scrape_interval: 60s
...
...
...

Logs:

This is what happends during the start after 10+- minutes:

... prometheus[39101]: level=info ts=2020-03-05T14:02:26.811Z caller=head.go:625 component=tsdb msg="WAL segment loaded" segment=41869 maxSegment=41871
... prometheus[39101]: level=info ts=2020-03-05T14:02:26.812Z caller=head.go:625 component=tsdb msg="WAL segment loaded" segment=41870 maxSegment=41871
... prometheus[39101]: level=info ts=2020-03-05T14:02:26.812Z caller=head.go:625 component=tsdb msg="WAL segment loaded" segment=41871 maxSegment=41871
... prometheus[39101]: fatal error: runtime: out of memory
... prometheus[39101]: runtime stack:
... prometheus[39101]: runtime.throw(0x253885d, 0x16)
... prometheus[39101]:         /usr/local/go/src/runtime/panic.go:774 +0x72
... prometheus[39101]: runtime.sysMap(0xce78000000, 0x14000000, 0x3f5bc78)
... prometheus[39101]:         /usr/local/go/src/runtime/mem_linux.go:169 +0xc5
... prometheus[39101]: runtime.(*mheap).sysAlloc(0x3f432c0, 0x11de6000, 0xc000, 0x4373e7)
... prometheus[39101]:         /usr/local/go/src/runtime/malloc.go:701 +0x1cd
... prometheus[39101]: runtime.(*mheap).grow(0x3f432c0, 0x8ef3, 0xffffffff)
... prometheus[39101]:         /usr/local/go/src/runtime/mheap.go:1255 +0xa3
... prometheus[39101]: runtime.(*mheap).allocSpanLocked(0x3f432c0, 0x8ef3, 0x3f5bc88, 0x20339d00000000)
... prometheus[39101]:         /usr/local/go/src/runtime/mheap.go:1170 +0x266
... prometheus[39101]: runtime.(*mheap).alloc_m(0x3f432c0, 0x8ef3, 0x101, 0x7f5861cc3fff)
... prometheus[39101]:         /usr/local/go/src/runtime/mheap.go:1022 +0xc2
... prometheus[39101]: runtime.(*mheap).alloc.func1()
... prometheus[39101]:         /usr/local/go/src/runtime/mheap.go:1093 +0x4c
... prometheus[39101]: runtime.(*mheap).alloc(0x3f432c0, 0x8ef3, 0x7f5861010101, 0x7f5861d11008)
... prometheus[39101]:         /usr/local/go/src/runtime/mheap.go:1092 +0x8a
... prometheus[39101]: runtime.largeAlloc(0x11de5ec0, 0x450101, 0x7f5861d11008)
... prometheus[39101]:         /usr/local/go/src/runtime/malloc.go:1138 +0x97
... prometheus[39101]: runtime.mallocgc.func1()
... prometheus[39101]:         /usr/local/go/src/runtime/malloc.go:1033 +0x46
... prometheus[39101]: runtime.systemstack(0x0)
... prometheus[39101]:         /usr/local/go/src/runtime/asm_amd64.s:370 +0x66
... prometheus[39101]: runtime.mstart()
... prometheus[39101]:         /usr/local/go/src/runtime/proc.go:1146
... prometheus[39101]: goroutine 225 [running]:
... prometheus[39101]: runtime.systemstack_switch()
... prometheus[39101]:         /usr/local/go/src/runtime/asm_amd64.s:330 fp=0xc0022234c0 sp=0xc0022234b8 pc=0x45d180
... prometheus[39101]: runtime.mallocgc(0x11de5ec0, 0x1fd78e0, 0x5949d401, 0xc0003631d0)
... prometheus[39101]:         /usr/local/go/src/runtime/malloc.go:1032 +0x895 fp=0xc002223560 sp=0xc0022234c0 pc=0x40c755
... prometheus[39101]: runtime.makeslice(0x1fd78e0, 0x0, 0x23bcbd8, 0xd)
... prometheus[39101]:         /usr/local/go/src/runtime/slice.go:49 +0x6c fp=0xc002223590 sp=0xc002223560 pc=0x445bac
... prometheus[39101]: github.com/prometheus/prometheus/tsdb/index.(*MemPostings).Delete(0xc000f29a70, 0xce71eef470)

This is how systemd service looks like:

[Unit]
Description=Prometheus Server
Documentation=https://prometheus.io/docs/introduction/overview/
After=network-online.target

[Service]
User=prometheus
Restart=always
RestartSec=5s
LimitNOFILE=infinity
ExecStart=/usr/local/bin/prometheus \
    --config.file=/etc/prometheus/prometheus.yml \
    --storage.tsdb.path=/var/lib/prometheus/data \
    --web.listen-address="127.0.0.1:1234" \
    --web.external-url="https://example.com" \
    --web.enable-admin-api \
    --storage.tsdb.retention.time=30d
ExecReload=/bin/kill -HUP $MAINPID

[Install]
WantedBy=multi-user.target

Here is RAM usage of the server (pas 1 hour) - note that RAM fills up, runs out of RAM, service gets killed and is being restarted:

Please advise how do I troubleshoot further this issue?

The text was updated successfully, but these errors were encountered:

cstyan · 2020-03-06T21:57:14Z

Your Prometheus instance either needs more RAM, or you can delete the WAL directory (you will lose some data).

Don't think there's been changes recently that would affect replay memory usage cc @codesome

roidelapluie · 2020-03-06T22:06:53Z

How many targets / timeseries do you have? For only 3 segments that's A LOT of memory.

erkexzcx · 2020-03-09T10:57:12Z

Hi. Thanks. That's not 3 segments - I just cut everything else out. I deleted WAL files and prometheus started just fine. Since WAL files are no longer, this issue cannot be reproduced and therefore closing it.

jugani · 2020-11-11T12:02:32Z

We are also facing the same issue.
level=info ts=2020-11-11T11:34:43.556Z caller=repair.go:59 component=tsdb msg="Found healthy block" mint=1605081600000 maxt=1605088800000 ulid=01EPVFCPX3M7FESTRSB5Q2TF51 level=info ts=2020-11-11T11:34:43.985Z caller=head.go:575 component=tsdb msg="Replaying WAL, this may take awhile" level=info ts=2020-11-11T11:34:58.706Z caller=head.go:600 component=tsdb msg="WAL checkpoint loaded" level=info ts=2020-11-11T11:35:01.019Z caller=head.go:624 component=tsdb msg="WAL segment loaded" segment=2294 maxSegment=2319 level=info ts=2020-11-11T11:35:08.143Z caller=head.go:624 component=tsdb msg="WAL segment loaded" segment=2295 maxSegment=2319 level=info ts=2020-11-11T11:35:11.164Z caller=head.go:624 component=tsdb msg="WAL segment loaded" segment=2296 maxSegment=2319 level=info ts=2020-11-11T11:35:14.823Z caller=head.go:624 component=tsdb msg="WAL segment loaded" segment=2297 maxSegment=2319 level=info ts=2020-11-11T11:35:18.494Z caller=head.go:624 component=tsdb msg="WAL segment loaded" segment=2298 maxSegment=2319 level=info ts=2020-11-11T11:35:22.327Z caller=head.go:624 component=tsdb msg="WAL segment loaded" segment=2299 maxSegment=2319 level=info ts=2020-11-11T11:35:28.307Z caller=head.go:624 component=tsdb msg="WAL segment loaded" segment=2300 maxSegment=2319 level=info ts=2020-11-11T11:35:31.207Z caller=head.go:624 component=tsdb msg="WAL segment loaded" segment=2301 maxSegment=2319 level=info ts=2020-11-11T11:35:34.117Z caller=head.go:624 component=tsdb msg="WAL segment loaded" segment=2302 maxSegment=2319 level=info ts=2020-11-11T11:35:37.019Z caller=head.go:624 component=tsdb msg="WAL segment loaded" segment=2303 maxSegment=2319 level=info ts=2020-11-11T11:35:39.883Z caller=head.go:624 component=tsdb msg="WAL segment loaded" segment=2304 maxSegment=2319 level=info ts=2020-11-11T11:35:43.309Z caller=head.go:624 component=tsdb msg="WAL segment loaded" segment=2305 maxSegment=2319 level=info ts=2020-11-11T11:35:48.017Z caller=head.go:624 component=tsdb msg="WAL segment loaded" segment=2306 maxSegment=2319

We are running 2 pods of Prometheus with a memory limit of 10 Gib and a request limit of 6 Gib

jsravn · 2020-11-12T12:09:39Z

I'm also observing this issue. @cstyan is there anything we can do to further debug? In my case, the WAL replay memory usage seems to require 2-3x as much memory as the running Prometheus instance. So while the instance happily runs at around 30Gi memory usage, replay uses over 50+Gi which leads to being OOMKilled at startup. I can't simply increase the RAM limit here since it requires an excessive amount of memory to just start. Is that expected? It seems like it is keeping too much in memory while doing the replay.

codesome · 2020-11-18T09:08:18Z

I have also seen that the memory spikes during WAL replay to almost double the consumption before shutdown (no remote write). Ideally WAL replay should not take so much memory, so there is some room for improvement here I believe. I am re-opening this issue and update the title (maybe the snapshotting work will help here).

brian-brazil · 2020-11-18T10:57:02Z

I don't think that snapshotting would make a difference, because we're still putting the same data in the head one way or the other. I think we should figure out why this is happening first.

yechiel-optimalq · 2020-12-08T13:56:14Z

Any way to disable WAL completely? this has is reproducing on a daily basis now and causing quite an issue where our memory spike of prometheus is causing some issues.,

jsravn · 2020-12-08T15:24:37Z

@yechiel-optimalq as a workaround you can remove the WAL folder in an init container. You will lose the data in the WAL, of course.

yechiel-optimalq · 2020-12-08T15:40:08Z

Yes, i'm doing it, but it becomes a daily thing..

roidelapluie · 2020-12-08T15:55:29Z

Is there any chance you could setup conprof to help taking relevant memory dumps at startup?

yechiel-optimalq · 2020-12-08T15:57:57Z

We are running with helm chart, if you have some instructions for the conprof on the helm chart i can try

yechiel-optimalq · 2020-12-16T16:30:04Z

this ofcourse continue to happen for us, just OOM killer due to WAL Reply. If i delete WAL all is good.

pdruzhinin · 2021-01-07T08:57:10Z

Hello,

Has anyone been able to resolve it? I keep having to delete 'wal' folder every few weeks.

roidelapluie · 2021-01-07T13:53:34Z

Can we get the full logs for the first hour?

codesome · 2021-01-07T15:21:14Z

We would need some profiling for WAL replay to find the pieces that is holding onto memory. I think this is a common occurrance. While it is in my interest to profile and investigate, I don't think I would have time for that anytime soon.

meredrica · 2021-01-21T09:50:43Z

We've been having this issue for a long time now too. Prometheus runs fine until it gets evicted and then we have to go and delete the WAL.
We increased the RAM to insane amounts, just for starting and still run into this problem.

codesome · 2021-01-21T09:58:42Z

I did a quick profiling of WAL replay yesterday and there was nothing that stood out as an obvious thing to fix/optimise. I am expecting #7229 to help in this.

brian-brazil · 2021-01-21T10:12:14Z

I don't think it'll make a difference, we still have all the exact same data structures being built up one way or the other.

codesome · 2021-01-21T10:46:38Z

I don't think it'll make a difference

Building a chunk from samples takes more allocations and temporary memory than simply taking encoded chunk bytes and sticking into the memory. We will know more after some benchmarks.

brian-brazil · 2021-01-21T10:54:32Z

Which is the same as the RAM it took when it was originally being ingested, and thus shouldn't cause any extra RAM requirements during replay. This feels like a red herring to me.

gimpiron · 2021-01-24T08:16:07Z

+1

lwsanty · 2021-02-17T09:05:46Z

+1

rasheedamir · 2021-02-18T18:26:03Z

we have 4 instances of exact same version of prometheus; and only one of them is showing this behaviour; any clue how to debug it further? we tried deleting WAL directory but it returns back

bboreham · 2023-06-15T20:45:52Z

Can you try the release candidate? Might be better than hoping!

prometheus/prometheus#6934 is the core problem, and we can afford to lose a tiny bit of data during restarts by killing the wal at startup instead of giving prometheus a lot of memory to read it all in. We lose data during the WAL reading phase anyway, and the amount of memory required seems ginormous (the linked issue has folks topping out at 64G). Apparently 2.45 is supposed to help, but it's not out yet (see prometheus/prometheus#6934 (comment)). We should only do the wal deletions on prometheii that have this specific problem. I've removed the higher memory request for this, and brought it in line with our other prometheii - and it seems to work fine without any issues. I've thus also reduced the size of the core node, practically reverting 2i2c-org#2701

consideRatio · 2023-06-28T07:38:45Z

Issue discussion summary

I read through this issue, and here are some takeaways.

The wal directory can be cleared during startup, also the sibling chunks_head directory.
- ref: Memory usage spikes during WAL replay to more than normal usage #6934 (comment)
To clear directories on startup in a k8s context, doing it in an init_container or providing an emptyDir volume won't do the trick for situations when the container restarts, only when the entire pod restarts.
- Stop prometheus from balooning memory usage on startup 2i2c-org/infrastructure#2706 represents a config change to clear the wal directory on startup from the prometheus-server container itself
Some settings can be tweaked, for example setting storage.tsdb.min-block-duration and storage.tsdb.[min|max]-block-duration to 30m
- ref: Memory usage spikes during WAL replay to more than normal usage #6934 (comment)

Indications of possible issues before they arise can be seen from a running prometheus via specific log lines:

caller=db.go:1075 level=warn component=tsdb msg="Head compaction took longer than the block time range, compactions are falling behind and won't be able to catch up" duration=3h2m17.340717958s block_range=7200000

ref: Memory usage spikes during WAL replay to more than normal usage #6934 (comment)

Prometheus 2.45 is released now and includes a promising improvement by @bboreham.
- ref: Memory usage spikes during WAL replay to more than normal usage #6934 (comment)

bboreham · 2023-07-06T21:31:40Z

Has anyone got any results to report from version 2.45 ?

shixuyue · 2023-07-06T21:59:03Z

Has anyone got any results to report from version 2.45 ?

I trying to use 2.45.
So far so good. But I dont know if there is a way to trigger a massive WAL replay to test it.

Shahard2 · 2023-07-07T16:25:38Z

Has anyone got any results to report from version 2.45 ?

We are on 2.45

Still having restarts and crashes..

wedge76 · 2023-07-11T14:17:45Z

Same here.

awx-fuyuanchu · 2023-07-14T02:07:38Z

same here.

neno13453 · 2023-07-27T11:47:24Z

same here. We still encounter the issue

Shahard2 · 2023-07-27T11:50:46Z

ֿAnyone is going to push this forward?

We suffer badly from this issue!!

levya4 · 2023-07-30T17:53:22Z

Same issue here !!

alon-dotan-starkware · 2023-08-08T12:33:00Z

Same issue here

narendiran11 · 2023-08-25T03:27:34Z

Same issue here

aleg-malinovsky · 2023-09-04T06:39:19Z

So... did 2.45 fix the WAL loading memory spike?

bboreham · 2023-09-05T17:13:01Z

A number of people reported no improvement.

It would be helpful to supply details, e.g. a memory profile when the process gets large.
Best to open a new issue if you do have details, so we can concentrate on the specific symptoms there.

aleg-malinovsky · 2023-09-06T12:22:36Z

A number of people reported no improvement.

It would be helpful to supply details, e.g. a memory profile when the process gets large. Best to open a new issue if you do have details, so we can concentrate on the specific symptoms there.

Sure, will keep an eye for when the loop of death is happening again.

Could you please provide what details are you looking for exactly and how to get them?
Specific commands to run will be even better.

For context, we're running prometheus inside k8s with operator.

Thanks!

bboreham · 2023-09-06T14:37:14Z

Prometheus (and most Go programs) have an http service that lets you fetch profiles.

The command will vary according to your particular setup, but basically you want to get access to Prometheus' http port (perhaps via kubectl port-forward) and then:

curl http://localhost:9090/debug/pprof/heap > heap.pprof.gz

If the output is less than 1KB it is probably an error; look at the output and try to resolve whatever it says.

The standard issue template asks for other details such as config and logs.

Stefan-Mihail · 2023-09-29T13:19:20Z

I would limit the storage.tsdb.retention.size to a lower value as a temporary fix

john-jam · 2023-10-05T00:50:51Z

Same issue with 2.47.

bboreham · 2023-10-05T12:44:10Z

Seems I need to say it louder: adding "me too" with no details (a) emails everyone who commented so far and (b) gets us nowhere. Please don't do this. Clicking 👍 on the top description can indicate your desire for improvement.

#6934 (comment) for further instructions.

aakarim · 2023-11-07T21:51:19Z

I hope this isn't considered a "me too" comment, we've been having this issue too, but as a fix we've dropped our storage.retention.time to 2d and we're now running Mimir on top for long term storage and querying. So far it seems to have solved the problem.

We're using a bare bones Mimir deployment that uses about as much resources as one of our Prometheus instances. It's not ideal for our use case, we have fairly low metrics throughput requirements (around 500-600k series) so I don't see why we need to have a totally separate system just for long term storage, but WAL replays were really killing us on costs!

eumel8 · 2023-12-14T14:03:12Z

Prometheus v2.48.1
Kubernetes 1.25
Longhorn RWO PVC 100GB

We experienced this error in one of our clusters and stumbled in this issue here. What we discovered so far.

no chunk directories are created under /prometheus mount, we only have wal, chunk_head
in wal are various files since deployment Prometheus:

$ ls -l wal/
total 1581024
-rw-rw-r--    1 1000     2000     134217728 Dec 13 13:33 00000000
-rw-rw-r--    1 1000     2000     134152192 Dec 13 13:38 00000001
-rw-rw-r--    1 1000     2000     134119424 Dec 13 13:43 00000002
-rw-rw-r--    1 1000     2000     134217728 Dec 13 13:48 00000003
-rw-rw-r--    1 1000     2000     134184960 Dec 13 13:54 00000004
-rw-rw-r--    1 1000     2000     134152192 Dec 13 14:00 00000005
-rw-rw-r--    1 1000     2000      76480512 Dec 13 14:03 00000006
-rw-rw-r--    1 1000     2000     134184960 Dec 13 14:13 00000007
-rw-rw-r--    1 1000     2000     134217728 Dec 13 14:19 00000008
-rw-rw-r--    1 1000     2000     133890048 Dec 13 14:25 00000009
-rw-rw-r--    1 1000     2000     134217728 Dec 13 14:30 00000010
-rw-rw-r--    1 1000     2000      73119226 Dec 13 14:33 00000011
-rw-rw-r--    1 1000     2000       9865525 Dec 13 14:36 00000012
-rw-rw-r--    1 1000     2000       9240843 Dec 13 14:40 00000013
-rw-rw-r--    1 1000     2000       8248627 Dec 13 14:44 00000014
-rw-rw-r--    1 1000     2000       9105569 Dec 13 14:48 00000015
-rw-rw-r--    1 1000     2000       7974709 Dec 13 14:52 00000016
-rw-rw-r--    1 1000     2000       6973024 Dec 13 14:58 00000017
-rw-rw-r--    1 1000     2000       8545568 Dec 13 15:06 00000018
-rw-rw-r--    1 1000     2000       7791205 Dec 13 15:14 00000019
-rw-rw-r--    1 1000     2000       6654355 Dec 13 15:23 00000020
-rw-rw-r--    1 1000     2000       6631211 Dec 13 15:31 00000021
-rw-rw-r--    1 1000     2000       4125743 Dec 13 15:39 00000022
-rw-rw-r--    1 1000     2000       3452263 Dec 13 15:47 00000023
-rw-rw-r--    1 1000     2000       4566063 Dec 13 15:55 00000024
-rw-rw-r--    1 1000     2000       3913707 Dec 13 16:03 00000025
-rw-rw-r--    1 1000     2000       3219429 Dec 13 16:11 00000026
-rw-rw-r--    1 1000     2000       6302202 Dec 13 16:19 00000027
-rw-rw-r--    1 1000     2000       3860841 Dec 13 16:28 00000028
-rw-rw-r--    1 1000     2000       3039857 Dec 13 16:36 00000029
-rw-rw-r--    1 1000     2000        314631 Dec 13 16:44 00000030
-rw-rw-r--    1 1000     2000             0 Dec 13 16:49 00000031
-rw-rw-r--    1 1000     2000        382659 Dec 13 17:00 00000032
-rw-rw-r--    1 1000     2000        553411 Dec 13 17:08 00000033
-rw-rw-r--    1 1000     2000             0 Dec 13 17:13 00000034
-rw-rw-r--    1 1000     2000             0 Dec 13 17:22 00000035
-rw-rw-r--    1 1000     2000             0 Dec 13 17:30 00000036
-rw-rw-r--    1 1000     2000        821334 Dec 13 17:40 00000037
-rw-rw-r--    1 1000     2000             0 Dec 13 17:46 00000038
-rw-rw-r--    1 1000     2000        381565 Dec 13 17:57 00000039
-rw-rw-r--    1 1000     2000           272 Dec 13 18:05 00000040
-rw-rw-r--    1 1000     2000             0 Dec 13 18:10 00000041
-rw-rw-r--    1 1000     2000       1108154 Dec 13 18:16 00000042
-rw-rw-r--    1 1000     2000             0 Dec 13 18:21 00000043
-rw-rw-r--    1 1000     2000        791809 Dec 13 18:33 00000044
-rw-rw-r--    1 1000     2000       1173119 Dec 13 18:40 00000045
-rw-rw-r--    1 1000     2000             0 Dec 13 18:46 00000046
-rw-rw-r--    1 1000     2000        951617 Dec 13 18:56 00000047
-rw-rw-r--    1 1000     2000             0 Dec 13 19:02 00000048
-rw-rw-r--    1 1000     2000        504870 Dec 13 19:12 00000049
-rw-rw-r--    1 1000     2000             0 Dec 13 19:17 00000050
-rw-rw-r--    1 1000     2000             0 Dec 13 19:25 00000051
-rw-rw-r--    1 1000     2000             0 Dec 13 19:33 00000052
-rw-rw-r--    1 1000     2000        985188 Dec 13 19:43 00000053
-rw-rw-r--    1 1000     2000             0 Dec 13 19:48 00000054
-rw-rw-r--    1 1000     2000        917504 Dec 13 20:02 00000055
-rw-rw-r--    1 1000     2000             0 Dec 13 20:02 00000056
-rw-rw-r--    1 1000     2000             0 Dec 13 20:07 00000057
-rw-rw-r--    1 1000     2000        239463 Dec 13 20:18 00000058
-rw-rw-r--    1 1000     2000             0 Dec 13 20:23 00000059
-rw-rw-r--    1 1000     2000             0 Dec 13 20:31 00000060
-rw-rw-r--    1 1000     2000             0 Dec 13 20:34 00000061
-rw-rw-r--    1 1000     2000             0 Dec 13 20:42 00000062
-rw-rw-r--    1 1000     2000             0 Dec 13 20:50 00000063
-rw-rw-r--    1 1000     2000        925886 Dec 13 21:00 00000064
-rw-rw-r--    1 1000     2000          9973 Dec 13 21:08 00000065
-rw-rw-r--    1 1000     2000        484348 Dec 13 21:16 00000066
-rw-rw-r--    1 1000     2000             0 Dec 13 21:22 00000067
-rw-rw-r--    1 1000     2000             0 Dec 13 21:24 00000068
-rw-rw-r--    1 1000     2000             0 Dec 13 21:32 00000069
-rw-rw-r--    1 1000     2000         94128 Dec 13 21:43 00000070
-rw-rw-r--    1 1000     2000             0 Dec 13 21:48 00000071
-rw-rw-r--    1 1000     2000        383118 Dec 13 21:58 00000072
-rw-rw-r--    1 1000     2000             0 Dec 13 22:03 00000073
-rw-rw-r--    1 1000     2000         14733 Dec 13 22:09 00000074
-rw-rw-r--    1 1000     2000             0 Dec 13 22:14 00000075
-rw-rw-r--    1 1000     2000        121003 Dec 13 22:24 00000076
-rw-rw-r--    1 1000     2000             0 Dec 13 22:29 00000077
-rw-rw-r--    1 1000     2000             0 Dec 13 22:37 00000078
-rw-rw-r--    1 1000     2000             0 Dec 13 22:45 00000079
-rw-rw-r--    1 1000     2000             0 Dec 13 22:53 00000080
-rw-rw-r--    1 1000     2000             0 Dec 13 23:01 00000081
-rw-rw-r--    1 1000     2000        564763 Dec 13 23:11 00000082
-rw-rw-r--    1 1000     2000             0 Dec 13 23:16 00000083
-rw-rw-r--    1 1000     2000             0 Dec 13 23:24 00000084
-rw-rw-r--    1 1000     2000             0 Dec 13 23:26 00000085
-rw-rw-r--    1 1000     2000             0 Dec 13 23:34 00000086
-rw-rw-r--    1 1000     2000             0 Dec 13 23:42 00000087
-rw-rw-r--    1 1000     2000             0 Dec 13 23:49 00000088
-rw-rw-r--    1 1000     2000             0 Dec 13 23:56 00000089
-rw-rw-r--    1 1000     2000             0 Dec 14 00:04 00000090
-rw-rw-r--    1 1000     2000             0 Dec 14 00:11 00000091
-rw-rw-r--    1 1000     2000             0 Dec 14 00:14 00000092
-rw-rw-r--    1 1000     2000             0 Dec 14 00:21 00000093
-rw-rw-r--    1 1000     2000             0 Dec 14 00:28 00000094
-rw-rw-r--    1 1000     2000             0 Dec 14 00:36 00000095
-rw-rw-r--    1 1000     2000             0 Dec 14 00:43 00000096
-rw-rw-r--    1 1000     2000             0 Dec 14 00:50 00000097
-rw-rw-r--    1 1000     2000             0 Dec 14 00:58 00000098
-rw-rw-r--    1 1000     2000             0 Dec 14 01:00 00000099
-rw-rw-r--    1 1000     2000             0 Dec 14 01:08 00000100
-rw-rw-r--    1 1000     2000             0 Dec 14 01:15 00000101
-rw-rw-r--    1 1000     2000             0 Dec 14 01:22 00000102
-rw-rw-r--    1 1000     2000             0 Dec 14 01:30 00000103
-rw-rw-r--    1 1000     2000             0 Dec 14 01:37 00000104
-rw-rw-r--    1 1000     2000             0 Dec 14 01:44 00000105
-rw-rw-r--    1 1000     2000             0 Dec 14 01:47 00000106
-rw-rw-r--    1 1000     2000             0 Dec 14 01:54 00000107
-rw-rw-r--    1 1000     2000             0 Dec 14 02:02 00000108
-rw-rw-r--    1 1000     2000             0 Dec 14 02:09 00000109
-rw-rw-r--    1 1000     2000             0 Dec 14 02:16 00000110
-rw-rw-r--    1 1000     2000             0 Dec 14 02:24 00000111
-rw-rw-r--    1 1000     2000             0 Dec 14 02:31 00000112
-rw-rw-r--    1 1000     2000             0 Dec 14 02:33 00000113
-rw-rw-r--    1 1000     2000             0 Dec 14 02:41 00000114
-rw-rw-r--    1 1000     2000             0 Dec 14 02:48 00000115
-rw-rw-r--    1 1000     2000             0 Dec 14 02:55 00000116
-rw-rw-r--    1 1000     2000             0 Dec 14 03:03 00000117
-rw-rw-r--    1 1000     2000             0 Dec 14 03:10 00000118
-rw-rw-r--    1 1000     2000             0 Dec 14 03:18 00000119
-rw-rw-r--    1 1000     2000             0 Dec 14 03:20 00000120
-rw-rw-r--    1 1000     2000             0 Dec 14 03:28 00000121
-rw-rw-r--    1 1000     2000             0 Dec 14 03:35 00000122
-rw-rw-r--    1 1000     2000             0 Dec 14 03:43 00000123
-rw-rw-r--    1 1000     2000             0 Dec 14 03:51 00000124
-rw-rw-r--    1 1000     2000             0 Dec 14 03:58 00000125
-rw-rw-r--    1 1000     2000             0 Dec 14 04:05 00000126
-rw-rw-r--    1 1000     2000             0 Dec 14 04:08 00000127
-rw-rw-r--    1 1000     2000        213558 Dec 14 04:19 00000128
-rw-rw-r--    1 1000     2000             0 Dec 14 04:24 00000129
-rw-rw-r--    1 1000     2000             0 Dec 14 04:32 00000130
-rw-rw-r--    1 1000     2000             0 Dec 14 04:40 00000131
-rw-rw-r--    1 1000     2000             0 Dec 14 04:47 00000132
-rw-rw-r--    1 1000     2000             0 Dec 14 04:49 00000133
-rw-rw-r--    1 1000     2000             0 Dec 14 04:57 00000134
-rw-rw-r--    1 1000     2000             0 Dec 14 05:05 00000135
-rw-rw-r--    1 1000     2000             0 Dec 14 05:13 00000136
-rw-rw-r--    1 1000     2000             0 Dec 14 05:21 00000137
-rw-rw-r--    1 1000     2000             0 Dec 14 05:29 00000138
-rw-rw-r--    1 1000     2000             0 Dec 14 05:32 00000139
-rw-rw-r--    1 1000     2000             0 Dec 14 05:40 00000140
-rw-rw-r--    1 1000     2000          2363 Dec 14 05:51 00000141
-rw-rw-r--    1 1000     2000        423895 Dec 14 05:59 00000142
-rw-rw-r--    1 1000     2000             0 Dec 14 06:04 00000143
-rw-rw-r--    1 1000     2000             0 Dec 14 06:12 00000144
-rw-rw-r--    1 1000     2000             0 Dec 14 06:20 00000145
-rw-rw-r--    1 1000     2000             0 Dec 14 06:28 00000146
-rw-rw-r--    1 1000     2000       1549923 Dec 14 06:39 00000147
-rw-rw-r--    1 1000     2000             0 Dec 14 06:44 00000148
-rw-rw-r--    1 1000     2000             0 Dec 14 06:47 00000149
-rw-rw-r--    1 1000     2000             0 Dec 14 06:55 00000150
-rw-rw-r--    1 1000     2000             0 Dec 14 07:03 00000151
-rw-rw-r--    1 1000     2000             0 Dec 14 07:11 00000152
-rw-rw-r--    1 1000     2000             0 Dec 14 07:19 00000153
-rw-rw-r--    1 1000     2000             0 Dec 14 07:27 00000154
-rw-rw-r--    1 1000     2000             0 Dec 14 07:30 00000155
-rw-rw-r--    1 1000     2000        290538 Dec 14 07:35 00000156
-rw-rw-r--    1 1000     2000             0 Dec 14 07:36 00000157
-rw-rw-r--    1 1000     2000             0 Dec 14 07:38 00000158
-rw-rw-r--    1 1000     2000             0 Dec 14 07:41 00000159
-rw-rw-r--    1 1000     2000             0 Dec 14 07:44 00000160
-rw-rw-r--    1 1000     2000             0 Dec 14 07:46 00000161
-rw-rw-r--    1 1000     2000             0 Dec 14 07:48 00000162
-rw-rw-r--    1 1000     2000             0 Dec 14 07:51 00000163
-rw-rw-r--    1 1000     2000             0 Dec 14 07:54 00000164
-rw-rw-r--    1 1000     2000             0 Dec 14 07:57 00000165
-rw-rw-r--    1 1000     2000          3548 Dec 14 08:03 00000166
-rw-rw-r--    1 1000     2000             0 Dec 14 08:03 00000167
-rw-rw-r--    1 1000     2000             0 Dec 14 08:05 00000168
-rw-rw-r--    1 1000     2000             0 Dec 14 08:08 00000169
-rw-rw-r--    1 1000     2000             0 Dec 14 08:11 00000170
-rw-rw-r--    1 1000     2000             0 Dec 14 08:15 00000171
-rw-rw-r--    1 1000     2000             0 Dec 14 08:19 00000172
-rw-rw-r--    1 1000     2000             0 Dec 14 08:22 00000173
-rw-rw-r--    1 1000     2000             0 Dec 14 08:25 00000174
-rw-rw-r--    1 1000     2000             0 Dec 14 08:28 00000175
-rw-rw-r--    1 1000     2000             0 Dec 14 08:31 00000176
-rw-rw-r--    1 1000     2000             0 Dec 14 08:36 00000177
-rw-rw-r--    1 1000     2000             0 Dec 14 08:39 00000178
-rw-rw-r--    1 1000     2000             0 Dec 14 08:47 00000179
-rw-rw-r--    1 1000     2000             0 Dec 14 08:54 00000180
-rw-rw-r--    1 1000     2000             0 Dec 14 09:02 00000181
-rw-rw-r--    1 1000     2000             0 Dec 14 09:10 00000182
-rw-r--r--    1 1000     2000             0 Dec 14 09:15 00000183
-rw-r--r--    1 1000     2000             0 Dec 14 09:18 00000184

Starting up Prometheus:

/prometheus $ /bin/prometheus --web.console.templates=/etc/prometheus/consoles --web.console.libraries=/etc/prometheus/console_libraries --config.fi
le=/etc/prometheus/config_out/prometheus.env.yaml --web.enable-lifecycle --web.external-url=http://rancher-monitoring-prometheus.cattle-monitoring-s
ystem:9090 --web.route-prefix=/ --log.level=debug --storage.tsdb.retention.time=9d --storage.tsdb.retention.size=80GiB --storage.tsdb.path=/promethe
us --no-storage.tsdb.wal-compression --web.config.file=/etc/prometheus/web_config/web-config.yaml
ts=2023-12-14T12:16:44.255Z caller=main.go:583 level=info msg="Starting Prometheus Server" mode=server version="(version=2.48.1, branch=HEAD, revision=63894216648f0d6be310c9d16fb48293c45c9310)"
ts=2023-12-14T12:16:44.255Z caller=main.go:588 level=info build_context="(go=go1.21.5, platform=linux/amd64, user=root@71f108ff5632, date=20231208-23:33:22, tags=netgo,builtinassets,stringlabels)"
ts=2023-12-14T12:16:44.255Z caller=main.go:589 level=info host_details="(Linux 5.15.0-1049-aws #54~20.04.1-Ubuntu SMP Fri Oct 6 22:04:33 UTC 2023 x86_64 profile-pod (none))"
ts=2023-12-14T12:16:44.255Z caller=main.go:590 level=info fd_limits="(soft=131072, hard=131072)"
ts=2023-12-14T12:16:44.256Z caller=main.go:591 level=info vm_limits="(soft=unlimited, hard=unlimited)"
ts=2023-12-14T12:16:44.259Z caller=web.go:566 level=info component=web msg="Start listening for connections" address=0.0.0.0:9090
ts=2023-12-14T12:16:44.260Z caller=main.go:1024 level=info msg="Starting TSDB ..."
ts=2023-12-14T12:16:44.261Z caller=tls_config.go:274 level=info component=web msg="Listening on" address=[::]:9090
ts=2023-12-14T12:16:44.261Z caller=tls_config.go:313 level=info component=web msg="TLS is disabled." http2=false address=[::]:9090
ts=2023-12-14T12:16:44.262Z caller=dir_locker.go:77 level=warn component=tsdb msg="A lockfile from a previous execution already existed. It was replaced" file=/prometheus/lock
ts=2023-12-14T12:16:44.349Z caller=head.go:601 level=info component=tsdb msg="Replaying on-disk memory mappable chunks if any"
ts=2023-12-14T12:16:50.187Z caller=head.go:682 level=info component=tsdb msg="On-disk memory mappable chunks replay completed" duration=5.838263399s
ts=2023-12-14T12:16:50.187Z caller=head.go:690 level=info component=tsdb msg="Replaying WAL, this may take a while"
ts=2023-12-14T12:17:01.196Z caller=head.go:761 level=info component=tsdb msg="WAL segment loaded" segment=0 maxSegment=162
ts=2023-12-14T12:17:13.639Z caller=head.go:761 level=info component=tsdb msg="WAL segment loaded" segment=1 maxSegment=162
ts=2023-12-14T12:17:25.896Z caller=head.go:761 level=info component=tsdb msg="WAL segment loaded" segment=2 maxSegment=162
ts=2023-12-14T12:17:33.895Z caller=head.go:761 level=info component=tsdb msg="WAL segment loaded" segment=3 maxSegment=162
ts=2023-12-14T12:17:46.520Z caller=head.go:761 level=info component=tsdb msg="WAL segment loaded" segment=4 maxSegment=162
ts=2023-12-14T12:17:49.037Z caller=head.go:761 level=info component=tsdb msg="WAL segment loaded" segment=5 maxSegment=162
ts=2023-12-14T12:18:04.179Z caller=head.go:761 level=info component=tsdb msg="WAL segment loaded" segment=6 maxSegment=162
ts=2023-12-14T12:18:16.921Z caller=head.go:761 level=info component=tsdb msg="WAL segment loaded" segment=7 maxSegment=162
ts=2023-12-14T12:18:25.047Z caller=head.go:761 level=info component=tsdb msg="WAL segment loaded" segment=8 maxSegment=162
ts=2023-12-14T12:18:34.056Z caller=head.go:761 level=info component=tsdb msg="WAL segment loaded" segment=9 maxSegment=162
ts=2023-12-14T12:18:52.463Z caller=head.go:761 level=info component=tsdb msg="WAL segment loaded" segment=10 maxSegment=162
ts=2023-12-14T12:18:59.604Z caller=head.go:761 level=info component=tsdb msg="WAL segment loaded" segment=11 maxSegment=162
ts=2023-12-14T12:19:02.300Z caller=head.go:761 level=info component=tsdb msg="WAL segment loaded" segment=12 maxSegment=162
ts=2023-12-14T12:19:03.715Z caller=head.go:761 level=info component=tsdb msg="WAL segment loaded" segment=13 maxSegment=162
ts=2023-12-14T12:19:04.557Z caller=head.go:761 level=info component=tsdb msg="WAL segment loaded" segment=14 maxSegment=162
Killed

Memory limit reached with 32GB usage

Shortly before we are able to collect debug log with promtool debug all http://127.0.0.1:9090 and attache here
Removing some files in wal directory starts up and Prometheus build tsdb and create chunks

I think the problematic process is the computing of wal files one more and more instead one file separately to save memory. It seems there is no option to proceed this manually.

debug.tar.gz

eumel8 · 2023-12-15T18:16:35Z

After another day of debugging we drifting in the direction that a specific type of metrics could cause the break between WAL file creating and blocks writing to TSDB. Regarding documentation this should happen latest after 2 hours. In our scenario it did not happen during 4 days.
The last service which we connected was Kepler. Not sure if Kepler brings some dangerous metrics or abnormal values in gauges, in fact Kepler brings 75 metrics per Pod and if the cluster has 1500 Pods running, there are 112500 metrics + metrics for nodes. So, this could be overloading of Prometheus, but nothing to see in log files. In other discussions I saw hint to too many open files, but this did not happen (at least not on Ubuntu 20.04 machines with 131072 file limit).

We saved the WAL files before deleting them and bring Prometheus back in service. Could be an option to investigate the WAL files, tried with ldb dump_walfrom rocksdb-tool package without success

Next plans:

add/remove Kepler to investigate block creating in TSDB
separate this kind of metrics to second Prometheus instance
(mid-term) remove PVC from Prometheus, set retention to 2h, extend service with Thanos storage backend

cc: @puffitos @CerRegulus

bwplotka · 2024-04-16T12:11:41Z

Hello from the bug scrub session!

I would love to clarify things -- it might have been the case that in the past WAL replay was causing a dramatic overhead that caused unexpected OOMs e.g. your Prometheus run on 70% of memory limit and if the overhead was 300% more, you would get affected. There might be also a case of more work being done in the same time, so more CPU used during replay. If you are low on CPU in your env (e.g. Kubernetes CPU limit), then things are slower, including the GC process, thus memory is not released fast enough.

However, I was recently benchmarking our Prometheus use at Google for the recent versions and I cannot reproduce a major memory overhead (looking on heap and working sets) (there is a 2x CPU bump though) when replying, even if it is, it's 1-5%.

I wonder, if what many users reporting here is a symptom of the different problem and the replay OOM only surfaced that problem. So two very popular scenarios are right now are:

A. Your Prometheus setup scraped too many series/samples in the first place and OOMed. Obviously replay will reload all those samples back, causing a classic "OOM crashloop" situation. No matter if replay will use 1% of normal mem use, or 200% the outcome would be the same.
B. Your Prometheus setup scraped too many series/samples in the first place and is running close to the limit (e.g. 95% of memory limit). This causes exactly the same "OOM crashloop" situation when e.g. you restart Prometheus to upgrade the version etc, because there might be 1-5% overhead due to the GC.

Generally I see 3 things are the next steps:

Verifying if the memory overhead (how many % more of memory is allocated during replay vs after replay Prometheus) is not larger than let's say 10-15%. This is because running Prometheus even close to 90% of memory limit is simply dangerous, because of GC dynamicity, no room for unexpected cardinality bumps and little room for queries/remote write. I think we are within this limit, but if not, let's optimize it!
Optimize Prometheus storage/scraping/remote write to use less memory. We do it on every release (huge thanks to @bboreham here), so make sure to upgrade often. Still the amount of data you collect matters, even with perfectly optimized system. Plan ahead wisely. There might be some room for more improvements for agent modes, but for TSDB it's tricky.
Handling (ideally auto-recovery) from the OOM crashloop. This is pretty sad, that once you hit OOM case (e.g. you already scraped a broken target), the only way to recover is to delete the WAL, but the binary does do it. It's a separate issue, but I plan to propose so mitigation plans, because majority if not ALL users HAVE to delete the WAL anyway. Or they back the WAL up and never use it. So why not introducing the mode that deletes the WAL on OOM, if people choose so (we do that at Google btw - but for agent mode). Started OOM crashloop auto-recovery #13939 for that.
Better, forceful scrape limits to avoid this situation. We were always on the safe side, we didn't want to cause partial data issues, plus the eventual GC issue, so that's why we don't have a feature that would stop certain targets from being scraped if we have X memory use. OpenTelemetry has some features around that, which literally stop incoming pushes with metrics if the collector memory use is higher than X, which has other consequences (e.g. you have run dozens of those collectors and/or buffer that memory on clients). The point is we could innovate and introduce those kind of features for scrapes! Also deserves another issue.

Summary

For this particular issue, ideally please only add the comment if the 1 occurs. To tell you need:

"normal" heap use, just before the kill (working set or heap memory)
ideally the cause of restart (e.g. that it was not OOM)
much memory replay took, ideally with the profile

All, so we can ensure the least overhead possible 💪🏽

p53 · 2024-04-17T10:23:29Z

Hi,

we are using prometheus 2.48, trying to use it with vertical pod autoscaler, but we have problem that each time vertical pod autoscaler changes resources and recreates pod there is big CPU/Memory spike at prometheus pod startup which is causing spike in metrics used by vertical pod autoscaler creating much higher average than would be needed and pod is recreated every few minutes because it cannot stabilise because of that startup peak (spike is around 40G mem and 3-4 cores, and you see that after each restart peak util drops to down about 8GB and less than 1core), thus we cannot downscale. Also i cannot create wider range for metrics evaluation by vertical pod autoscaler because it then wouldn't be able to promptly react to quite sharp changes in number of our pods. We don't want to remove WAL as we don't want to loose data. Also it looks like size of peak is dependent on number of targets which were there previously (but not anymore for several hours) so it looks like something is not cleaned up?

erkexzcx added the kind/bug label Mar 5, 2020

cstyan added component/tsdb and removed kind/bug labels Mar 6, 2020

erkexzcx closed this as completed Mar 9, 2020

codesome reopened this Nov 18, 2020

codesome changed the title ~~Prometheus fails to start - runs out of memory during "WAL segment loaded"~~ Memory usage spikes during WAL replay to more than normal usage Nov 18, 2020

yuvipanda mentioned this issue Jun 23, 2023

Stop prometheus from balooning memory usage on startup 2i2c-org/infrastructure#2706

Merged

consideRatio mentioned this issue Jun 28, 2023

prometheus: avoid loosing more than 10m data during restarts 2i2c-org/infrastructure#2731

Merged

d3473r mentioned this issue Nov 3, 2023

[BUG] Monitoring - Prometheus CrashLoopBackOff OOMKilled rancher/rancher#43394

Open

bwplotka mentioned this issue Apr 16, 2024

OOM crashloop auto-recovery #13939

Open

Memory usage spikes during WAL replay to more than normal usage #6934

Memory usage spikes during WAL replay to more than normal usage #6934

Comments

erkexzcx commented Mar 5, 2020

cstyan commented Mar 6, 2020

roidelapluie commented Mar 6, 2020

erkexzcx commented Mar 9, 2020

jugani commented Nov 11, 2020

jsravn commented Nov 12, 2020 • edited

codesome commented Nov 18, 2020

brian-brazil commented Nov 18, 2020

yechiel-optimalq commented Dec 8, 2020

jsravn commented Dec 8, 2020

yechiel-optimalq commented Dec 8, 2020

roidelapluie commented Dec 8, 2020 • edited

yechiel-optimalq commented Dec 8, 2020

yechiel-optimalq commented Dec 16, 2020

pdruzhinin commented Jan 7, 2021

roidelapluie commented Jan 7, 2021

codesome commented Jan 7, 2021

meredrica commented Jan 21, 2021

codesome commented Jan 21, 2021

brian-brazil commented Jan 21, 2021

codesome commented Jan 21, 2021

brian-brazil commented Jan 21, 2021

gimpiron commented Jan 24, 2021

lwsanty commented Feb 17, 2021

rasheedamir commented Feb 18, 2021

bboreham commented Jun 15, 2023

consideRatio commented Jun 28, 2023

Issue discussion summary

bboreham commented Jul 6, 2023

shixuyue commented Jul 6, 2023

Shahard2 commented Jul 7, 2023

wedge76 commented Jul 11, 2023

awx-fuyuanchu commented Jul 14, 2023

neno13453 commented Jul 27, 2023

Shahard2 commented Jul 27, 2023

levya4 commented Jul 30, 2023

alon-dotan-starkware commented Aug 8, 2023

narendiran11 commented Aug 25, 2023

aleg-malinovsky commented Sep 4, 2023

bboreham commented Sep 5, 2023

aleg-malinovsky commented Sep 6, 2023

bboreham commented Sep 6, 2023

Stefan-Mihail commented Sep 29, 2023

john-jam commented Oct 5, 2023

bboreham commented Oct 5, 2023

aakarim commented Nov 7, 2023

eumel8 commented Dec 14, 2023

eumel8 commented Dec 15, 2023

bwplotka commented Apr 16, 2024 • edited

Generally I see 3 things are the next steps:

Summary

p53 commented Apr 17, 2024 • edited

jsravn commented Nov 12, 2020 •

edited

roidelapluie commented Dec 8, 2020 •

edited

bwplotka commented Apr 16, 2024 •

edited

p53 commented Apr 17, 2024 •

edited