Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory usage spikes during WAL replay to more than normal usage #6934

Open
erkexzcx opened this issue Mar 5, 2020 · 140 comments
Open

Memory usage spikes during WAL replay to more than normal usage #6934

erkexzcx opened this issue Mar 5, 2020 · 140 comments

Comments

@erkexzcx
Copy link

erkexzcx commented Mar 5, 2020

What did you do?
Tried to start prometheus.

What did you expect to see?
Prometheus up & running, web interface showing up.

What did you see instead? Under which circumstances?
Prometheus runs out of RAM during "WAL segment loaded" process.

Environment
Debian 9

  • System information:
    Linux 4.9.0-11-amd64 x86_64

  • Prometheus version:

prometheus, version 2.16.0 (branch: HEAD, revision: b90be6f32a33c03163d700e1452b54454ddce0ec)
  build user:       root@7ea0ae865f12
  build date:       20200213-23:50:02
  go version:       go1.13.8
  • Prometheus configuration file:
global:
  evaluation_interval: 60s
  scrape_interval: 60s
...
...
...
  • Logs:

This is what happends during the start after 10+- minutes:

... prometheus[39101]: level=info ts=2020-03-05T14:02:26.811Z caller=head.go:625 component=tsdb msg="WAL segment loaded" segment=41869 maxSegment=41871
... prometheus[39101]: level=info ts=2020-03-05T14:02:26.812Z caller=head.go:625 component=tsdb msg="WAL segment loaded" segment=41870 maxSegment=41871
... prometheus[39101]: level=info ts=2020-03-05T14:02:26.812Z caller=head.go:625 component=tsdb msg="WAL segment loaded" segment=41871 maxSegment=41871
... prometheus[39101]: fatal error: runtime: out of memory
... prometheus[39101]: runtime stack:
... prometheus[39101]: runtime.throw(0x253885d, 0x16)
... prometheus[39101]:         /usr/local/go/src/runtime/panic.go:774 +0x72
... prometheus[39101]: runtime.sysMap(0xce78000000, 0x14000000, 0x3f5bc78)
... prometheus[39101]:         /usr/local/go/src/runtime/mem_linux.go:169 +0xc5
... prometheus[39101]: runtime.(*mheap).sysAlloc(0x3f432c0, 0x11de6000, 0xc000, 0x4373e7)
... prometheus[39101]:         /usr/local/go/src/runtime/malloc.go:701 +0x1cd
... prometheus[39101]: runtime.(*mheap).grow(0x3f432c0, 0x8ef3, 0xffffffff)
... prometheus[39101]:         /usr/local/go/src/runtime/mheap.go:1255 +0xa3
... prometheus[39101]: runtime.(*mheap).allocSpanLocked(0x3f432c0, 0x8ef3, 0x3f5bc88, 0x20339d00000000)
... prometheus[39101]:         /usr/local/go/src/runtime/mheap.go:1170 +0x266
... prometheus[39101]: runtime.(*mheap).alloc_m(0x3f432c0, 0x8ef3, 0x101, 0x7f5861cc3fff)
... prometheus[39101]:         /usr/local/go/src/runtime/mheap.go:1022 +0xc2
... prometheus[39101]: runtime.(*mheap).alloc.func1()
... prometheus[39101]:         /usr/local/go/src/runtime/mheap.go:1093 +0x4c
... prometheus[39101]: runtime.(*mheap).alloc(0x3f432c0, 0x8ef3, 0x7f5861010101, 0x7f5861d11008)
... prometheus[39101]:         /usr/local/go/src/runtime/mheap.go:1092 +0x8a
... prometheus[39101]: runtime.largeAlloc(0x11de5ec0, 0x450101, 0x7f5861d11008)
... prometheus[39101]:         /usr/local/go/src/runtime/malloc.go:1138 +0x97
... prometheus[39101]: runtime.mallocgc.func1()
... prometheus[39101]:         /usr/local/go/src/runtime/malloc.go:1033 +0x46
... prometheus[39101]: runtime.systemstack(0x0)
... prometheus[39101]:         /usr/local/go/src/runtime/asm_amd64.s:370 +0x66
... prometheus[39101]: runtime.mstart()
... prometheus[39101]:         /usr/local/go/src/runtime/proc.go:1146
... prometheus[39101]: goroutine 225 [running]:
... prometheus[39101]: runtime.systemstack_switch()
... prometheus[39101]:         /usr/local/go/src/runtime/asm_amd64.s:330 fp=0xc0022234c0 sp=0xc0022234b8 pc=0x45d180
... prometheus[39101]: runtime.mallocgc(0x11de5ec0, 0x1fd78e0, 0x5949d401, 0xc0003631d0)
... prometheus[39101]:         /usr/local/go/src/runtime/malloc.go:1032 +0x895 fp=0xc002223560 sp=0xc0022234c0 pc=0x40c755
... prometheus[39101]: runtime.makeslice(0x1fd78e0, 0x0, 0x23bcbd8, 0xd)
... prometheus[39101]:         /usr/local/go/src/runtime/slice.go:49 +0x6c fp=0xc002223590 sp=0xc002223560 pc=0x445bac
... prometheus[39101]: github.com/prometheus/prometheus/tsdb/index.(*MemPostings).Delete(0xc000f29a70, 0xce71eef470)

This is how systemd service looks like:

[Unit]
Description=Prometheus Server
Documentation=https://prometheus.io/docs/introduction/overview/
After=network-online.target

[Service]
User=prometheus
Restart=always
RestartSec=5s
LimitNOFILE=infinity
ExecStart=/usr/local/bin/prometheus \
    --config.file=/etc/prometheus/prometheus.yml \
    --storage.tsdb.path=/var/lib/prometheus/data \
    --web.listen-address="127.0.0.1:1234" \
    --web.external-url="https://example.com" \
    --web.enable-admin-api \
    --storage.tsdb.retention.time=30d
ExecReload=/bin/kill -HUP $MAINPID

[Install]
WantedBy=multi-user.target

Here is RAM usage of the server (pas 1 hour) - note that RAM fills up, runs out of RAM, service gets killed and is being restarted:
image

Please advise how do I troubleshoot further this issue?

@cstyan
Copy link
Member

cstyan commented Mar 6, 2020

Your Prometheus instance either needs more RAM, or you can delete the WAL directory (you will lose some data).

Don't think there's been changes recently that would affect replay memory usage cc @codesome

@roidelapluie
Copy link
Member

How many targets / timeseries do you have? For only 3 segments that's A LOT of memory.

@erkexzcx
Copy link
Author

erkexzcx commented Mar 9, 2020

Hi. Thanks. That's not 3 segments - I just cut everything else out. I deleted WAL files and prometheus started just fine. Since WAL files are no longer, this issue cannot be reproduced and therefore closing it.

@erkexzcx erkexzcx closed this as completed Mar 9, 2020
@jugani
Copy link

jugani commented Nov 11, 2020

We are also facing the same issue.
level=info ts=2020-11-11T11:34:43.556Z caller=repair.go:59 component=tsdb msg="Found healthy block" mint=1605081600000 maxt=1605088800000 ulid=01EPVFCPX3M7FESTRSB5Q2TF51 level=info ts=2020-11-11T11:34:43.985Z caller=head.go:575 component=tsdb msg="Replaying WAL, this may take awhile" level=info ts=2020-11-11T11:34:58.706Z caller=head.go:600 component=tsdb msg="WAL checkpoint loaded" level=info ts=2020-11-11T11:35:01.019Z caller=head.go:624 component=tsdb msg="WAL segment loaded" segment=2294 maxSegment=2319 level=info ts=2020-11-11T11:35:08.143Z caller=head.go:624 component=tsdb msg="WAL segment loaded" segment=2295 maxSegment=2319 level=info ts=2020-11-11T11:35:11.164Z caller=head.go:624 component=tsdb msg="WAL segment loaded" segment=2296 maxSegment=2319 level=info ts=2020-11-11T11:35:14.823Z caller=head.go:624 component=tsdb msg="WAL segment loaded" segment=2297 maxSegment=2319 level=info ts=2020-11-11T11:35:18.494Z caller=head.go:624 component=tsdb msg="WAL segment loaded" segment=2298 maxSegment=2319 level=info ts=2020-11-11T11:35:22.327Z caller=head.go:624 component=tsdb msg="WAL segment loaded" segment=2299 maxSegment=2319 level=info ts=2020-11-11T11:35:28.307Z caller=head.go:624 component=tsdb msg="WAL segment loaded" segment=2300 maxSegment=2319 level=info ts=2020-11-11T11:35:31.207Z caller=head.go:624 component=tsdb msg="WAL segment loaded" segment=2301 maxSegment=2319 level=info ts=2020-11-11T11:35:34.117Z caller=head.go:624 component=tsdb msg="WAL segment loaded" segment=2302 maxSegment=2319 level=info ts=2020-11-11T11:35:37.019Z caller=head.go:624 component=tsdb msg="WAL segment loaded" segment=2303 maxSegment=2319 level=info ts=2020-11-11T11:35:39.883Z caller=head.go:624 component=tsdb msg="WAL segment loaded" segment=2304 maxSegment=2319 level=info ts=2020-11-11T11:35:43.309Z caller=head.go:624 component=tsdb msg="WAL segment loaded" segment=2305 maxSegment=2319 level=info ts=2020-11-11T11:35:48.017Z caller=head.go:624 component=tsdb msg="WAL segment loaded" segment=2306 maxSegment=2319

We are running 2 pods of Prometheus with a memory limit of 10 Gib and a request limit of 6 Gib

@jsravn
Copy link
Contributor

jsravn commented Nov 12, 2020

I'm also observing this issue. @cstyan is there anything we can do to further debug? In my case, the WAL replay memory usage seems to require 2-3x as much memory as the running Prometheus instance. So while the instance happily runs at around 30Gi memory usage, replay uses over 50+Gi which leads to being OOMKilled at startup. I can't simply increase the RAM limit here since it requires an excessive amount of memory to just start. Is that expected? It seems like it is keeping too much in memory while doing the replay.

@codesome
Copy link
Member

I have also seen that the memory spikes during WAL replay to almost double the consumption before shutdown (no remote write). Ideally WAL replay should not take so much memory, so there is some room for improvement here I believe. I am re-opening this issue and update the title (maybe the snapshotting work will help here).

@codesome codesome reopened this Nov 18, 2020
@codesome codesome changed the title Prometheus fails to start - runs out of memory during "WAL segment loaded" Memory usage spikes during WAL replay to more than normal usage Nov 18, 2020
@brian-brazil
Copy link
Contributor

I don't think that snapshotting would make a difference, because we're still putting the same data in the head one way or the other. I think we should figure out why this is happening first.

@yechiel-optimalq
Copy link

Any way to disable WAL completely? this has is reproducing on a daily basis now and causing quite an issue where our memory spike of prometheus is causing some issues.,

@jsravn
Copy link
Contributor

jsravn commented Dec 8, 2020

@yechiel-optimalq as a workaround you can remove the WAL folder in an init container. You will lose the data in the WAL, of course.

@yechiel-optimalq
Copy link

Yes, i'm doing it, but it becomes a daily thing..

@roidelapluie
Copy link
Member

roidelapluie commented Dec 8, 2020

Is there any chance you could setup conprof to help taking relevant memory dumps at startup?

@yechiel-optimalq
Copy link

We are running with helm chart, if you have some instructions for the conprof on the helm chart i can try

@yechiel-optimalq
Copy link

this ofcourse continue to happen for us, just OOM killer due to WAL Reply. If i delete WAL all is good.

@pdruzhinin
Copy link

Hello,

Has anyone been able to resolve it? I keep having to delete 'wal' folder every few weeks.

@roidelapluie
Copy link
Member

Can we get the full logs for the first hour?

@codesome
Copy link
Member

codesome commented Jan 7, 2021

We would need some profiling for WAL replay to find the pieces that is holding onto memory. I think this is a common occurrance. While it is in my interest to profile and investigate, I don't think I would have time for that anytime soon.

@meredrica
Copy link

We've been having this issue for a long time now too. Prometheus runs fine until it gets evicted and then we have to go and delete the WAL.
We increased the RAM to insane amounts, just for starting and still run into this problem.

@codesome
Copy link
Member

I did a quick profiling of WAL replay yesterday and there was nothing that stood out as an obvious thing to fix/optimise. I am expecting #7229 to help in this.

@brian-brazil
Copy link
Contributor

I don't think it'll make a difference, we still have all the exact same data structures being built up one way or the other.

@codesome
Copy link
Member

I don't think it'll make a difference

Building a chunk from samples takes more allocations and temporary memory than simply taking encoded chunk bytes and sticking into the memory. We will know more after some benchmarks.

@brian-brazil
Copy link
Contributor

Which is the same as the RAM it took when it was originally being ingested, and thus shouldn't cause any extra RAM requirements during replay. This feels like a red herring to me.

@gimpiron
Copy link

+1

1 similar comment
@lwsanty
Copy link

lwsanty commented Feb 17, 2021

+1

@rasheedamir
Copy link

we have 4 instances of exact same version of prometheus; and only one of them is showing this behaviour; any clue how to debug it further? we tried deleting WAL directory but it returns back

@bboreham
Copy link
Member

Can you try the release candidate? Might be better than hoping!

yuvipanda added a commit to yuvipanda/pilot-hubs that referenced this issue Jun 23, 2023
prometheus/prometheus#6934 is
the core problem, and we can afford to lose a tiny bit of
data during restarts by killing the wal at startup instead of
giving prometheus a lot of memory to read it all in. We
lose data during the WAL reading phase anyway, and the amount
of memory required seems ginormous (the linked issue has folks
topping out at 64G). Apparently 2.45 is supposed to help,
but it's not out yet (see
prometheus/prometheus#6934 (comment)).

We should only do the wal deletions on prometheii that have this
specific problem.

I've removed the higher memory request for this, and brought it
in line with our other prometheii - and it seems to work fine
without any issues. I've thus also reduced the size of the
core node, practically reverting 2i2c-org#2701
yuvipanda added a commit to yuvipanda/pilot-hubs that referenced this issue Jun 26, 2023
prometheus/prometheus#6934 is
the core problem, and we can afford to lose a tiny bit of
data during restarts by killing the wal at startup instead of
giving prometheus a lot of memory to read it all in. We
lose data during the WAL reading phase anyway, and the amount
of memory required seems ginormous (the linked issue has folks
topping out at 64G). Apparently 2.45 is supposed to help,
but it's not out yet (see
prometheus/prometheus#6934 (comment)).

We should only do the wal deletions on prometheii that have this
specific problem.

I've removed the higher memory request for this, and brought it
in line with our other prometheii - and it seems to work fine
without any issues. I've thus also reduced the size of the
core node, practically reverting 2i2c-org#2701
@consideRatio
Copy link

Issue discussion summary

I read through this issue, and here are some takeaways.

  1. The wal directory can be cleared during startup, also the sibling chunks_head directory.
  2. To clear directories on startup in a k8s context, doing it in an init_container or providing an emptyDir volume won't do the trick for situations when the container restarts, only when the entire pod restarts.
  3. Some settings can be tweaked, for example setting storage.tsdb.min-block-duration and storage.tsdb.[min|max]-block-duration to 30m
  4. Indications of possible issues before they arise can be seen from a running prometheus via specific log lines:
    caller=db.go:1075 level=warn component=tsdb msg="Head compaction took longer than the block time range, compactions are falling behind and won't be able to catch up" duration=3h2m17.340717958s block_range=7200000
    
  5. Prometheus 2.45 is released now and includes a promising improvement by @bboreham.

@bboreham
Copy link
Member

bboreham commented Jul 6, 2023

Has anyone got any results to report from version 2.45 ?

@shixuyue
Copy link

shixuyue commented Jul 6, 2023

Has anyone got any results to report from version 2.45 ?

I trying to use 2.45.
So far so good. But I dont know if there is a way to trigger a massive WAL replay to test it.

@Shahard2
Copy link

Shahard2 commented Jul 7, 2023

Has anyone got any results to report from version 2.45 ?

We are on 2.45

Still having restarts and crashes..

@wedge76
Copy link

wedge76 commented Jul 11, 2023

Same here.

@awx-fuyuanchu
Copy link

same here.

@neno13453
Copy link

same here. We still encounter the issue

@Shahard2
Copy link

ֿAnyone is going to push this forward?

We suffer badly from this issue!!

@levya4
Copy link

levya4 commented Jul 30, 2023

Same issue here !!

@alon-dotan-starkware
Copy link

Same issue here

1 similar comment
@narendiran11
Copy link

Same issue here

@aleg-malinovsky
Copy link

So... did 2.45 fix the WAL loading memory spike?

@bboreham
Copy link
Member

bboreham commented Sep 5, 2023

A number of people reported no improvement.

It would be helpful to supply details, e.g. a memory profile when the process gets large.
Best to open a new issue if you do have details, so we can concentrate on the specific symptoms there.

@aleg-malinovsky
Copy link

A number of people reported no improvement.

It would be helpful to supply details, e.g. a memory profile when the process gets large. Best to open a new issue if you do have details, so we can concentrate on the specific symptoms there.

Sure, will keep an eye for when the loop of death is happening again.

Could you please provide what details are you looking for exactly and how to get them?
Specific commands to run will be even better.

For context, we're running prometheus inside k8s with operator.

Thanks!

@bboreham
Copy link
Member

bboreham commented Sep 6, 2023

Prometheus (and most Go programs) have an http service that lets you fetch profiles.

The command will vary according to your particular setup, but basically you want to get access to Prometheus' http port (perhaps via kubectl port-forward) and then:

curl http://localhost:9090/debug/pprof/heap > heap.pprof.gz

If the output is less than 1KB it is probably an error; look at the output and try to resolve whatever it says.

The standard issue template asks for other details such as config and logs.

@Stefan-Mihail
Copy link

I would limit the storage.tsdb.retention.size to a lower value as a temporary fix

@john-jam
Copy link

john-jam commented Oct 5, 2023

Same issue with 2.47.

@bboreham
Copy link
Member

bboreham commented Oct 5, 2023

Seems I need to say it louder: adding "me too" with no details (a) emails everyone who commented so far and (b) gets us nowhere. Please don't do this. Clicking 👍 on the top description can indicate your desire for improvement.

#6934 (comment) for further instructions.

@aakarim
Copy link

aakarim commented Nov 7, 2023

I hope this isn't considered a "me too" comment, we've been having this issue too, but as a fix we've dropped our storage.retention.time to 2d and we're now running Mimir on top for long term storage and querying. So far it seems to have solved the problem.

We're using a bare bones Mimir deployment that uses about as much resources as one of our Prometheus instances. It's not ideal for our use case, we have fairly low metrics throughput requirements (around 500-600k series) so I don't see why we need to have a totally separate system just for long term storage, but WAL replays were really killing us on costs!

@eumel8
Copy link

eumel8 commented Dec 14, 2023

Prometheus v2.48.1
Kubernetes 1.25
Longhorn RWO PVC 100GB

We experienced this error in one of our clusters and stumbled in this issue here. What we discovered so far.

  1. no chunk directories are created under /prometheus mount, we only have wal, chunk_head
  2. in wal are various files since deployment Prometheus:
$ ls -l wal/
total 1581024
-rw-rw-r--    1 1000     2000     134217728 Dec 13 13:33 00000000
-rw-rw-r--    1 1000     2000     134152192 Dec 13 13:38 00000001
-rw-rw-r--    1 1000     2000     134119424 Dec 13 13:43 00000002
-rw-rw-r--    1 1000     2000     134217728 Dec 13 13:48 00000003
-rw-rw-r--    1 1000     2000     134184960 Dec 13 13:54 00000004
-rw-rw-r--    1 1000     2000     134152192 Dec 13 14:00 00000005
-rw-rw-r--    1 1000     2000      76480512 Dec 13 14:03 00000006
-rw-rw-r--    1 1000     2000     134184960 Dec 13 14:13 00000007
-rw-rw-r--    1 1000     2000     134217728 Dec 13 14:19 00000008
-rw-rw-r--    1 1000     2000     133890048 Dec 13 14:25 00000009
-rw-rw-r--    1 1000     2000     134217728 Dec 13 14:30 00000010
-rw-rw-r--    1 1000     2000      73119226 Dec 13 14:33 00000011
-rw-rw-r--    1 1000     2000       9865525 Dec 13 14:36 00000012
-rw-rw-r--    1 1000     2000       9240843 Dec 13 14:40 00000013
-rw-rw-r--    1 1000     2000       8248627 Dec 13 14:44 00000014
-rw-rw-r--    1 1000     2000       9105569 Dec 13 14:48 00000015
-rw-rw-r--    1 1000     2000       7974709 Dec 13 14:52 00000016
-rw-rw-r--    1 1000     2000       6973024 Dec 13 14:58 00000017
-rw-rw-r--    1 1000     2000       8545568 Dec 13 15:06 00000018
-rw-rw-r--    1 1000     2000       7791205 Dec 13 15:14 00000019
-rw-rw-r--    1 1000     2000       6654355 Dec 13 15:23 00000020
-rw-rw-r--    1 1000     2000       6631211 Dec 13 15:31 00000021
-rw-rw-r--    1 1000     2000       4125743 Dec 13 15:39 00000022
-rw-rw-r--    1 1000     2000       3452263 Dec 13 15:47 00000023
-rw-rw-r--    1 1000     2000       4566063 Dec 13 15:55 00000024
-rw-rw-r--    1 1000     2000       3913707 Dec 13 16:03 00000025
-rw-rw-r--    1 1000     2000       3219429 Dec 13 16:11 00000026
-rw-rw-r--    1 1000     2000       6302202 Dec 13 16:19 00000027
-rw-rw-r--    1 1000     2000       3860841 Dec 13 16:28 00000028
-rw-rw-r--    1 1000     2000       3039857 Dec 13 16:36 00000029
-rw-rw-r--    1 1000     2000        314631 Dec 13 16:44 00000030
-rw-rw-r--    1 1000     2000             0 Dec 13 16:49 00000031
-rw-rw-r--    1 1000     2000        382659 Dec 13 17:00 00000032
-rw-rw-r--    1 1000     2000        553411 Dec 13 17:08 00000033
-rw-rw-r--    1 1000     2000             0 Dec 13 17:13 00000034
-rw-rw-r--    1 1000     2000             0 Dec 13 17:22 00000035
-rw-rw-r--    1 1000     2000             0 Dec 13 17:30 00000036
-rw-rw-r--    1 1000     2000        821334 Dec 13 17:40 00000037
-rw-rw-r--    1 1000     2000             0 Dec 13 17:46 00000038
-rw-rw-r--    1 1000     2000        381565 Dec 13 17:57 00000039
-rw-rw-r--    1 1000     2000           272 Dec 13 18:05 00000040
-rw-rw-r--    1 1000     2000             0 Dec 13 18:10 00000041
-rw-rw-r--    1 1000     2000       1108154 Dec 13 18:16 00000042
-rw-rw-r--    1 1000     2000             0 Dec 13 18:21 00000043
-rw-rw-r--    1 1000     2000        791809 Dec 13 18:33 00000044
-rw-rw-r--    1 1000     2000       1173119 Dec 13 18:40 00000045
-rw-rw-r--    1 1000     2000             0 Dec 13 18:46 00000046
-rw-rw-r--    1 1000     2000        951617 Dec 13 18:56 00000047
-rw-rw-r--    1 1000     2000             0 Dec 13 19:02 00000048
-rw-rw-r--    1 1000     2000        504870 Dec 13 19:12 00000049
-rw-rw-r--    1 1000     2000             0 Dec 13 19:17 00000050
-rw-rw-r--    1 1000     2000             0 Dec 13 19:25 00000051
-rw-rw-r--    1 1000     2000             0 Dec 13 19:33 00000052
-rw-rw-r--    1 1000     2000        985188 Dec 13 19:43 00000053
-rw-rw-r--    1 1000     2000             0 Dec 13 19:48 00000054
-rw-rw-r--    1 1000     2000        917504 Dec 13 20:02 00000055
-rw-rw-r--    1 1000     2000             0 Dec 13 20:02 00000056
-rw-rw-r--    1 1000     2000             0 Dec 13 20:07 00000057
-rw-rw-r--    1 1000     2000        239463 Dec 13 20:18 00000058
-rw-rw-r--    1 1000     2000             0 Dec 13 20:23 00000059
-rw-rw-r--    1 1000     2000             0 Dec 13 20:31 00000060
-rw-rw-r--    1 1000     2000             0 Dec 13 20:34 00000061
-rw-rw-r--    1 1000     2000             0 Dec 13 20:42 00000062
-rw-rw-r--    1 1000     2000             0 Dec 13 20:50 00000063
-rw-rw-r--    1 1000     2000        925886 Dec 13 21:00 00000064
-rw-rw-r--    1 1000     2000          9973 Dec 13 21:08 00000065
-rw-rw-r--    1 1000     2000        484348 Dec 13 21:16 00000066
-rw-rw-r--    1 1000     2000             0 Dec 13 21:22 00000067
-rw-rw-r--    1 1000     2000             0 Dec 13 21:24 00000068
-rw-rw-r--    1 1000     2000             0 Dec 13 21:32 00000069
-rw-rw-r--    1 1000     2000         94128 Dec 13 21:43 00000070
-rw-rw-r--    1 1000     2000             0 Dec 13 21:48 00000071
-rw-rw-r--    1 1000     2000        383118 Dec 13 21:58 00000072
-rw-rw-r--    1 1000     2000             0 Dec 13 22:03 00000073
-rw-rw-r--    1 1000     2000         14733 Dec 13 22:09 00000074
-rw-rw-r--    1 1000     2000             0 Dec 13 22:14 00000075
-rw-rw-r--    1 1000     2000        121003 Dec 13 22:24 00000076
-rw-rw-r--    1 1000     2000             0 Dec 13 22:29 00000077
-rw-rw-r--    1 1000     2000             0 Dec 13 22:37 00000078
-rw-rw-r--    1 1000     2000             0 Dec 13 22:45 00000079
-rw-rw-r--    1 1000     2000             0 Dec 13 22:53 00000080
-rw-rw-r--    1 1000     2000             0 Dec 13 23:01 00000081
-rw-rw-r--    1 1000     2000        564763 Dec 13 23:11 00000082
-rw-rw-r--    1 1000     2000             0 Dec 13 23:16 00000083
-rw-rw-r--    1 1000     2000             0 Dec 13 23:24 00000084
-rw-rw-r--    1 1000     2000             0 Dec 13 23:26 00000085
-rw-rw-r--    1 1000     2000             0 Dec 13 23:34 00000086
-rw-rw-r--    1 1000     2000             0 Dec 13 23:42 00000087
-rw-rw-r--    1 1000     2000             0 Dec 13 23:49 00000088
-rw-rw-r--    1 1000     2000             0 Dec 13 23:56 00000089
-rw-rw-r--    1 1000     2000             0 Dec 14 00:04 00000090
-rw-rw-r--    1 1000     2000             0 Dec 14 00:11 00000091
-rw-rw-r--    1 1000     2000             0 Dec 14 00:14 00000092
-rw-rw-r--    1 1000     2000             0 Dec 14 00:21 00000093
-rw-rw-r--    1 1000     2000             0 Dec 14 00:28 00000094
-rw-rw-r--    1 1000     2000             0 Dec 14 00:36 00000095
-rw-rw-r--    1 1000     2000             0 Dec 14 00:43 00000096
-rw-rw-r--    1 1000     2000             0 Dec 14 00:50 00000097
-rw-rw-r--    1 1000     2000             0 Dec 14 00:58 00000098
-rw-rw-r--    1 1000     2000             0 Dec 14 01:00 00000099
-rw-rw-r--    1 1000     2000             0 Dec 14 01:08 00000100
-rw-rw-r--    1 1000     2000             0 Dec 14 01:15 00000101
-rw-rw-r--    1 1000     2000             0 Dec 14 01:22 00000102
-rw-rw-r--    1 1000     2000             0 Dec 14 01:30 00000103
-rw-rw-r--    1 1000     2000             0 Dec 14 01:37 00000104
-rw-rw-r--    1 1000     2000             0 Dec 14 01:44 00000105
-rw-rw-r--    1 1000     2000             0 Dec 14 01:47 00000106
-rw-rw-r--    1 1000     2000             0 Dec 14 01:54 00000107
-rw-rw-r--    1 1000     2000             0 Dec 14 02:02 00000108
-rw-rw-r--    1 1000     2000             0 Dec 14 02:09 00000109
-rw-rw-r--    1 1000     2000             0 Dec 14 02:16 00000110
-rw-rw-r--    1 1000     2000             0 Dec 14 02:24 00000111
-rw-rw-r--    1 1000     2000             0 Dec 14 02:31 00000112
-rw-rw-r--    1 1000     2000             0 Dec 14 02:33 00000113
-rw-rw-r--    1 1000     2000             0 Dec 14 02:41 00000114
-rw-rw-r--    1 1000     2000             0 Dec 14 02:48 00000115
-rw-rw-r--    1 1000     2000             0 Dec 14 02:55 00000116
-rw-rw-r--    1 1000     2000             0 Dec 14 03:03 00000117
-rw-rw-r--    1 1000     2000             0 Dec 14 03:10 00000118
-rw-rw-r--    1 1000     2000             0 Dec 14 03:18 00000119
-rw-rw-r--    1 1000     2000             0 Dec 14 03:20 00000120
-rw-rw-r--    1 1000     2000             0 Dec 14 03:28 00000121
-rw-rw-r--    1 1000     2000             0 Dec 14 03:35 00000122
-rw-rw-r--    1 1000     2000             0 Dec 14 03:43 00000123
-rw-rw-r--    1 1000     2000             0 Dec 14 03:51 00000124
-rw-rw-r--    1 1000     2000             0 Dec 14 03:58 00000125
-rw-rw-r--    1 1000     2000             0 Dec 14 04:05 00000126
-rw-rw-r--    1 1000     2000             0 Dec 14 04:08 00000127
-rw-rw-r--    1 1000     2000        213558 Dec 14 04:19 00000128
-rw-rw-r--    1 1000     2000             0 Dec 14 04:24 00000129
-rw-rw-r--    1 1000     2000             0 Dec 14 04:32 00000130
-rw-rw-r--    1 1000     2000             0 Dec 14 04:40 00000131
-rw-rw-r--    1 1000     2000             0 Dec 14 04:47 00000132
-rw-rw-r--    1 1000     2000             0 Dec 14 04:49 00000133
-rw-rw-r--    1 1000     2000             0 Dec 14 04:57 00000134
-rw-rw-r--    1 1000     2000             0 Dec 14 05:05 00000135
-rw-rw-r--    1 1000     2000             0 Dec 14 05:13 00000136
-rw-rw-r--    1 1000     2000             0 Dec 14 05:21 00000137
-rw-rw-r--    1 1000     2000             0 Dec 14 05:29 00000138
-rw-rw-r--    1 1000     2000             0 Dec 14 05:32 00000139
-rw-rw-r--    1 1000     2000             0 Dec 14 05:40 00000140
-rw-rw-r--    1 1000     2000          2363 Dec 14 05:51 00000141
-rw-rw-r--    1 1000     2000        423895 Dec 14 05:59 00000142
-rw-rw-r--    1 1000     2000             0 Dec 14 06:04 00000143
-rw-rw-r--    1 1000     2000             0 Dec 14 06:12 00000144
-rw-rw-r--    1 1000     2000             0 Dec 14 06:20 00000145
-rw-rw-r--    1 1000     2000             0 Dec 14 06:28 00000146
-rw-rw-r--    1 1000     2000       1549923 Dec 14 06:39 00000147
-rw-rw-r--    1 1000     2000             0 Dec 14 06:44 00000148
-rw-rw-r--    1 1000     2000             0 Dec 14 06:47 00000149
-rw-rw-r--    1 1000     2000             0 Dec 14 06:55 00000150
-rw-rw-r--    1 1000     2000             0 Dec 14 07:03 00000151
-rw-rw-r--    1 1000     2000             0 Dec 14 07:11 00000152
-rw-rw-r--    1 1000     2000             0 Dec 14 07:19 00000153
-rw-rw-r--    1 1000     2000             0 Dec 14 07:27 00000154
-rw-rw-r--    1 1000     2000             0 Dec 14 07:30 00000155
-rw-rw-r--    1 1000     2000        290538 Dec 14 07:35 00000156
-rw-rw-r--    1 1000     2000             0 Dec 14 07:36 00000157
-rw-rw-r--    1 1000     2000             0 Dec 14 07:38 00000158
-rw-rw-r--    1 1000     2000             0 Dec 14 07:41 00000159
-rw-rw-r--    1 1000     2000             0 Dec 14 07:44 00000160
-rw-rw-r--    1 1000     2000             0 Dec 14 07:46 00000161
-rw-rw-r--    1 1000     2000             0 Dec 14 07:48 00000162
-rw-rw-r--    1 1000     2000             0 Dec 14 07:51 00000163
-rw-rw-r--    1 1000     2000             0 Dec 14 07:54 00000164
-rw-rw-r--    1 1000     2000             0 Dec 14 07:57 00000165
-rw-rw-r--    1 1000     2000          3548 Dec 14 08:03 00000166
-rw-rw-r--    1 1000     2000             0 Dec 14 08:03 00000167
-rw-rw-r--    1 1000     2000             0 Dec 14 08:05 00000168
-rw-rw-r--    1 1000     2000             0 Dec 14 08:08 00000169
-rw-rw-r--    1 1000     2000             0 Dec 14 08:11 00000170
-rw-rw-r--    1 1000     2000             0 Dec 14 08:15 00000171
-rw-rw-r--    1 1000     2000             0 Dec 14 08:19 00000172
-rw-rw-r--    1 1000     2000             0 Dec 14 08:22 00000173
-rw-rw-r--    1 1000     2000             0 Dec 14 08:25 00000174
-rw-rw-r--    1 1000     2000             0 Dec 14 08:28 00000175
-rw-rw-r--    1 1000     2000             0 Dec 14 08:31 00000176
-rw-rw-r--    1 1000     2000             0 Dec 14 08:36 00000177
-rw-rw-r--    1 1000     2000             0 Dec 14 08:39 00000178
-rw-rw-r--    1 1000     2000             0 Dec 14 08:47 00000179
-rw-rw-r--    1 1000     2000             0 Dec 14 08:54 00000180
-rw-rw-r--    1 1000     2000             0 Dec 14 09:02 00000181
-rw-rw-r--    1 1000     2000             0 Dec 14 09:10 00000182
-rw-r--r--    1 1000     2000             0 Dec 14 09:15 00000183
-rw-r--r--    1 1000     2000             0 Dec 14 09:18 00000184
  1. Starting up Prometheus:
/prometheus $ /bin/prometheus --web.console.templates=/etc/prometheus/consoles --web.console.libraries=/etc/prometheus/console_libraries --config.fi
le=/etc/prometheus/config_out/prometheus.env.yaml --web.enable-lifecycle --web.external-url=http://rancher-monitoring-prometheus.cattle-monitoring-s
ystem:9090 --web.route-prefix=/ --log.level=debug --storage.tsdb.retention.time=9d --storage.tsdb.retention.size=80GiB --storage.tsdb.path=/promethe
us --no-storage.tsdb.wal-compression --web.config.file=/etc/prometheus/web_config/web-config.yaml
ts=2023-12-14T12:16:44.255Z caller=main.go:583 level=info msg="Starting Prometheus Server" mode=server version="(version=2.48.1, branch=HEAD, revision=63894216648f0d6be310c9d16fb48293c45c9310)"
ts=2023-12-14T12:16:44.255Z caller=main.go:588 level=info build_context="(go=go1.21.5, platform=linux/amd64, user=root@71f108ff5632, date=20231208-23:33:22, tags=netgo,builtinassets,stringlabels)"
ts=2023-12-14T12:16:44.255Z caller=main.go:589 level=info host_details="(Linux 5.15.0-1049-aws #54~20.04.1-Ubuntu SMP Fri Oct 6 22:04:33 UTC 2023 x86_64 profile-pod (none))"
ts=2023-12-14T12:16:44.255Z caller=main.go:590 level=info fd_limits="(soft=131072, hard=131072)"
ts=2023-12-14T12:16:44.256Z caller=main.go:591 level=info vm_limits="(soft=unlimited, hard=unlimited)"
ts=2023-12-14T12:16:44.259Z caller=web.go:566 level=info component=web msg="Start listening for connections" address=0.0.0.0:9090
ts=2023-12-14T12:16:44.260Z caller=main.go:1024 level=info msg="Starting TSDB ..."
ts=2023-12-14T12:16:44.261Z caller=tls_config.go:274 level=info component=web msg="Listening on" address=[::]:9090
ts=2023-12-14T12:16:44.261Z caller=tls_config.go:313 level=info component=web msg="TLS is disabled." http2=false address=[::]:9090
ts=2023-12-14T12:16:44.262Z caller=dir_locker.go:77 level=warn component=tsdb msg="A lockfile from a previous execution already existed. It was replaced" file=/prometheus/lock
ts=2023-12-14T12:16:44.349Z caller=head.go:601 level=info component=tsdb msg="Replaying on-disk memory mappable chunks if any"
ts=2023-12-14T12:16:50.187Z caller=head.go:682 level=info component=tsdb msg="On-disk memory mappable chunks replay completed" duration=5.838263399s
ts=2023-12-14T12:16:50.187Z caller=head.go:690 level=info component=tsdb msg="Replaying WAL, this may take a while"
ts=2023-12-14T12:17:01.196Z caller=head.go:761 level=info component=tsdb msg="WAL segment loaded" segment=0 maxSegment=162
ts=2023-12-14T12:17:13.639Z caller=head.go:761 level=info component=tsdb msg="WAL segment loaded" segment=1 maxSegment=162
ts=2023-12-14T12:17:25.896Z caller=head.go:761 level=info component=tsdb msg="WAL segment loaded" segment=2 maxSegment=162
ts=2023-12-14T12:17:33.895Z caller=head.go:761 level=info component=tsdb msg="WAL segment loaded" segment=3 maxSegment=162
ts=2023-12-14T12:17:46.520Z caller=head.go:761 level=info component=tsdb msg="WAL segment loaded" segment=4 maxSegment=162
ts=2023-12-14T12:17:49.037Z caller=head.go:761 level=info component=tsdb msg="WAL segment loaded" segment=5 maxSegment=162
ts=2023-12-14T12:18:04.179Z caller=head.go:761 level=info component=tsdb msg="WAL segment loaded" segment=6 maxSegment=162
ts=2023-12-14T12:18:16.921Z caller=head.go:761 level=info component=tsdb msg="WAL segment loaded" segment=7 maxSegment=162
ts=2023-12-14T12:18:25.047Z caller=head.go:761 level=info component=tsdb msg="WAL segment loaded" segment=8 maxSegment=162
ts=2023-12-14T12:18:34.056Z caller=head.go:761 level=info component=tsdb msg="WAL segment loaded" segment=9 maxSegment=162
ts=2023-12-14T12:18:52.463Z caller=head.go:761 level=info component=tsdb msg="WAL segment loaded" segment=10 maxSegment=162
ts=2023-12-14T12:18:59.604Z caller=head.go:761 level=info component=tsdb msg="WAL segment loaded" segment=11 maxSegment=162
ts=2023-12-14T12:19:02.300Z caller=head.go:761 level=info component=tsdb msg="WAL segment loaded" segment=12 maxSegment=162
ts=2023-12-14T12:19:03.715Z caller=head.go:761 level=info component=tsdb msg="WAL segment loaded" segment=13 maxSegment=162
ts=2023-12-14T12:19:04.557Z caller=head.go:761 level=info component=tsdb msg="WAL segment loaded" segment=14 maxSegment=162
Killed

Memory limit reached with 32GB usage

  1. Shortly before we are able to collect debug log with promtool debug all http://127.0.0.1:9090 and attache here
  2. Removing some files in wal directory starts up and Prometheus build tsdb and create chunks

I think the problematic process is the computing of wal files one more and more instead one file separately to save memory. It seems there is no option to proceed this manually.

debug.tar.gz

@eumel8
Copy link

eumel8 commented Dec 15, 2023

After another day of debugging we drifting in the direction that a specific type of metrics could cause the break between WAL file creating and blocks writing to TSDB. Regarding documentation this should happen latest after 2 hours. In our scenario it did not happen during 4 days.
The last service which we connected was Kepler. Not sure if Kepler brings some dangerous metrics or abnormal values in gauges, in fact Kepler brings 75 metrics per Pod and if the cluster has 1500 Pods running, there are 112500 metrics + metrics for nodes. So, this could be overloading of Prometheus, but nothing to see in log files. In other discussions I saw hint to too many open files, but this did not happen (at least not on Ubuntu 20.04 machines with 131072 file limit).

We saved the WAL files before deleting them and bring Prometheus back in service. Could be an option to investigate the WAL files, tried with ldb dump_walfrom rocksdb-tool package without success

Next plans:

  • add/remove Kepler to investigate block creating in TSDB
  • separate this kind of metrics to second Prometheus instance
  • (mid-term) remove PVC from Prometheus, set retention to 2h, extend service with Thanos storage backend

cc: @puffitos @CerRegulus

@bwplotka
Copy link
Member

bwplotka commented Apr 16, 2024

Hello from the bug scrub session!

I would love to clarify things -- it might have been the case that in the past WAL replay was causing a dramatic overhead that caused unexpected OOMs e.g. your Prometheus run on 70% of memory limit and if the overhead was 300% more, you would get affected. There might be also a case of more work being done in the same time, so more CPU used during replay. If you are low on CPU in your env (e.g. Kubernetes CPU limit), then things are slower, including the GC process, thus memory is not released fast enough.

However, I was recently benchmarking our Prometheus use at Google for the recent versions and I cannot reproduce a major memory overhead (looking on heap and working sets) (there is a 2x CPU bump though) when replying, even if it is, it's 1-5%.

I wonder, if what many users reporting here is a symptom of the different problem and the replay OOM only surfaced that problem. So two very popular scenarios are right now are:

A. Your Prometheus setup scraped too many series/samples in the first place and OOMed. Obviously replay will reload all those samples back, causing a classic "OOM crashloop" situation. No matter if replay will use 1% of normal mem use, or 200% the outcome would be the same.
B. Your Prometheus setup scraped too many series/samples in the first place and is running close to the limit (e.g. 95% of memory limit). This causes exactly the same "OOM crashloop" situation when e.g. you restart Prometheus to upgrade the version etc, because there might be 1-5% overhead due to the GC.

Generally I see 3 things are the next steps:

  1. Verifying if the memory overhead (how many % more of memory is allocated during replay vs after replay Prometheus) is not larger than let's say 10-15%. This is because running Prometheus even close to 90% of memory limit is simply dangerous, because of GC dynamicity, no room for unexpected cardinality bumps and little room for queries/remote write. I think we are within this limit, but if not, let's optimize it!
  2. Optimize Prometheus storage/scraping/remote write to use less memory. We do it on every release (huge thanks to @bboreham here), so make sure to upgrade often. Still the amount of data you collect matters, even with perfectly optimized system. Plan ahead wisely. There might be some room for more improvements for agent modes, but for TSDB it's tricky.
  3. Handling (ideally auto-recovery) from the OOM crashloop. This is pretty sad, that once you hit OOM case (e.g. you already scraped a broken target), the only way to recover is to delete the WAL, but the binary does do it. It's a separate issue, but I plan to propose so mitigation plans, because majority if not ALL users HAVE to delete the WAL anyway. Or they back the WAL up and never use it. So why not introducing the mode that deletes the WAL on OOM, if people choose so (we do that at Google btw - but for agent mode). Started OOM crashloop auto-recovery #13939 for that.
  4. Better, forceful scrape limits to avoid this situation. We were always on the safe side, we didn't want to cause partial data issues, plus the eventual GC issue, so that's why we don't have a feature that would stop certain targets from being scraped if we have X memory use. OpenTelemetry has some features around that, which literally stop incoming pushes with metrics if the collector memory use is higher than X, which has other consequences (e.g. you have run dozens of those collectors and/or buffer that memory on clients). The point is we could innovate and introduce those kind of features for scrapes! Also deserves another issue.

Summary

For this particular issue, ideally please only add the comment if the 1 occurs. To tell you need:

  • "normal" heap use, just before the kill (working set or heap memory)
  • ideally the cause of restart (e.g. that it was not OOM)
  • much memory replay took, ideally with the profile

All, so we can ensure the least overhead possible 💪🏽

@p53
Copy link

p53 commented Apr 17, 2024

Hi,

we are using prometheus 2.48, trying to use it with vertical pod autoscaler, but we have problem that each time vertical pod autoscaler changes resources and recreates pod there is big CPU/Memory spike at prometheus pod startup which is causing spike in metrics used by vertical pod autoscaler creating much higher average than would be needed and pod is recreated every few minutes because it cannot stabilise because of that startup peak (spike is around 40G mem and 3-4 cores, and you see that after each restart peak util drops to down about 8GB and less than 1core), thus we cannot downscale. Also i cannot create wider range for metrics evaluation by vertical pod autoscaler because it then wouldn't be able to promptly react to quite sharp changes in number of our pods. We don't want to remove WAL as we don't want to loose data. Also it looks like size of peak is dependent on number of targets which were there previously (but not anymore for several hours) so it looks like something is not cleaned up?

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests