Memory leak of TCP connections #8219

eminence · 2021-06-26T15:54:01Z

Version information:

go-ipfs version: 0.9.0-179d1d150
Repo version: 11
System version: amd64/linux
Golang version: go1.15.2

Description:

IPFS 0.9 seems to be showing increasing memory usage over time. On a small VSP (with only 2GB of memory), this results IPFS being killed every few days via the kernel OOM killer.

On a much bigger machine, I've been graphing the memory usage over the past few days:

On the small VPS, there are no experimental features enabled. On the bigger machine, I have:

  "Experimental": {
    "AcceleratedDHTClient": true,
    "FilestoreEnabled": true,
    "GraphsyncEnabled": false,
    "Libp2pStreamMounting": false,
    "P2pHttpProxy": false,
    "ShardingEnabled": false,
    "StrategicProviding": false,
    "UrlstoreEnabled": false
  }

(The RRD graphs are configured to store up to a week of data, so I'll keep monitoring things for a few more days)

Jorropo · 2021-06-26T22:47:55Z

@eminence on my machines IPFS ceils arround 13 Gigs (with AcceleratedDHTClient), I guess yours lack enough memory to run it.

Do you know if IPFS does the same without AcceleratedDHTClient ?

What datastore are you using (this could also be badger that is using a lot) ?
(I guess dump the whole config too, ipfs config show to avoid leaking your keys)

What usage are you doing of theses nodes ?
I'm saying that because the graphs even tho different are surprisingly similar, example the flat portion near the end. Do you know what you were doing all the time but not when it was flat ?

Can you send a profile of IPFS when it's on high ?
Look at github.com/ipfs/go-ipfs/blob/master/docs/debug-guide.md#beginning to do so, the bin/collect-profiles.sh script.

akhilman · 2021-06-28T13:23:17Z

I have a similar problem. Recently, ipfs began to eat up 1 GB of memory allocated to it in half a day and stop responding to requests. Before that, everything worked great.
I am using the latest tag from the docker hub.

aschmahmann · 2021-06-28T13:34:36Z

@akhilman @eminence , thanks for the reports. If you could, as suggested above, report your config and collect a profile that would greatly assist in looking into this.

akhilman · 2021-06-28T13:44:28Z

Here is all requested data:
https://ipfs.io/ipfs/QmNyQJc2brAjd917opV8X4p91ih3Ra8cu27pyaim6AG41J

aschmahmann · 2021-06-28T14:06:57Z

@akhilman a few questions:

Is this snapshot taken when the node is performing as expected or not?
- I've pinged your node with bitswap requests and it seems to be responding fine.
Is it eating up more RAM over time (the initial report here) or are you just receiving a bunch of requests?
How confident are you that this is actually different if you run v0.8.0 rather than v0.9.0? Have you tried running v0.8.0 to see if you get the same behavior?
- It's possible you're just receiving receiving a spike in requests independent of what version of go-ipfs you're running

Note: if your node is running in a low powered environment you may want to consider the lowpower profile and/or decreasing your High/Low water peers to a smaller amount. For context IPFS-Desktop uses 50, 300 as the low and high water

akhilman · 2021-06-28T15:02:50Z

1. Is this snapshot taken when the node is performing as expected or not?
      * I've pinged your node with bitswap requests and it seems to be responding fine.

Data collected while the node was slow. Then I restarted my node to upload those files. Without restarting uploading was too slow.
I'm sure he will slow down again in 12 hours.

I have a cron job that restarts ipfs every day to avoid these slowdowns.

2. Is it eating up more RAM over time (the initial report here) or are you just receiving a bunch of requests?

My node is almost idle with no requests and eats memory over time. Until recently, 1Gb of memory was enough for several days, and only after that the memory ran out.

3. How confident are you that this is actually different if you run v0.8.0 rather than v0.9.0? Have you tried running v0.8.0 to see if you get the same behavior?

I'll try v0.8.0.

It's possible you're just receiving receiving a spike in requests independent of what version of go-ipfs you're running

My node is almost idle.

Note: if your node is running in a low powered environment you may want to consider the lowpower profile and/or decreasing your High/Low water peers to a smaller amount. For context IPFS-Desktop uses 50, 300 as the low and high water

Now I use 300-600, but I used to have 600-900 without any problem for a long time.

eminence · 2021-06-28T16:27:40Z

Hi all, thanks for the good questions @Jorropo

Unfortunately my graphing tool stopped working after a NAS hiccup, so the above graphs is all I have for now in terms of pictures. But the node is now up to 14G virt mem (compared to the ~8GB from the last datapoint in the graphs I uploaded).

Do you know if IPFS does the same without AcceleratedDHTClient

No, but I will spend several days trying this

What datastore are you using (this could also be badger that is using a lot) ?
(I guess dump the whole config too, ipfs config show to avoid leaking your keys)

I'm using whatever the default datastore it.
See my config at https://ipfs.io/ipfs/QmZxJkKWszoTLUAQW1FFBsTz82YF4FPHGfUgJWWrhBZL7Y

I haven't yet collected a profile, but I will later.

What usage are you doing of theses nodes ?

The node wasn't entirely idle. It was likely serving some content to the network, and every now and again it would be adding content (both new stuff, and rehashing stuff it already had in its store)

To be clear, the node seems to be performing just fine, except for the memory usage.

It may be true that the memory usage just hasn't hit its ceiling yet, but I feel pretty confident in saying that the 0.9 memory usage is higher than 0.8. I will also try to downgrade to 0.8 to confirm.

Thanks!

akhilman · 2021-06-29T13:30:33Z

v0.8.0 up for 22 hours. Bpytop reports 440Mb of memory usage, all works fine, no slowdowns.
Here is debug data https://ipfs.io/ipfs/QmVcoCAQhqX9bEtL7ZkzYxzpWcSskyoDDQL84EwwfFmFMw .

Jorropo · 2021-06-29T14:01:13Z

I'm happy to announce that my node has broke it's ceiling too, and it do seems that there is a memory leak in 0.9.0 which wasn't in 0.8.0.
I'll be able to debug with a buggy node on hand. :D

Jorropo · 2021-06-29T14:13:12Z

This only affect AcceleratedDHTClient.
I have 2 nodes with identical ipfs configs, but one with and one without is doing fine at ~2.5 Gigs of ram (all mostly badgerds) (the small one have way less requests, but over the time I'm running it I think it would still have restarted due to OOM, it's only a 4 Gig vps).

eminence · 2021-06-29T16:52:51Z

Ok, good info, thanks! I'm still having problems with 0.9.0 (with AcceleratedDHTClient disabled) running out of memory on a 2GB VPS. I'm now trying 0.9.0 with the "lowpower" config to see if that helps

RubenKelevra · 2021-06-30T21:15:53Z

Saw some strange behavior on one of my nodes as well. While the system had 2 GB of free memory and 9 GB of ZFS ARC memory left, the memsome and memfull pressure indicators was above 95%, which indicates heavy IO on the memory - which doesn't make that much sense.

The system specs:

32 GB memory
6 cores of an AMD EPYC 7702P
SSD RAID10
Arch Linux

IPFS:

go-ipfs version: 0.9.0-rc2
Repo version: 11
System version: amd64/linux
Golang version: go1.16.4

Datastore:

flatfs

Maybe have a look at the memory IO with atop on your systems as well?

qwertyforce · 2021-07-04T14:53:36Z

Ok, good info, thanks! I'm still having problems with 0.9.0 (with AcceleratedDHTClient disabled) running out of memory on a 2GB VPS. I'm now trying 0.9.0 with the "lowpower" config to see if that helps

I had the same problem on 0.9.0 (AcceleratedDHTClient disabled). I rolled back to 0.8.0 and now memory consumption is growing a lot slower

Jorropo · 2021-07-04T22:19:16Z

@qwertyforce how much ram do you have ? My 4Gigs VPS in 0.9.0 without AcceleratedDHTClient havn't OOMed even once.

qwertyforce · 2021-07-05T06:03:16Z

@Jorropo 1gb (rpi3)
0.9.0 - ipfs daemon is oomed after ~24 hours. (routing == dhtclient, I also switched off bandwidth metrics and lowered ConnMgr.HighWater)
0.8.0 - ipfs daemon consumes 30% of ram after 24 hours. (routing == dhtserver. bandwidth metrics are switched off. LowWater - 250, HighWater - 500). Memory consumption is so good in 0.8.0 (routing == dhtclient) so I switched routingType to dhtserver.

eminence · 2021-07-05T15:53:35Z

I've had to downgrade another node (on a VPS with 4GB memory) from v0.9.0 to v0.8.0 due to OOM issues (happened 3 times, about once every 26 hours). It's running a default config (so running the default profile, and no tuning done to things like ConnMgr configs)

RubenKelevra · 2021-07-05T23:17:34Z

Okay, hard to debug something without any debug info. Can you guys please read the debug manual and attach the debug info in the cases where you think the memory consumption is too high?

RubenKelevra · 2021-07-05T23:18:26Z

Here is all requested data:
https://ipfs.io/ipfs/QmNyQJc2brAjd917opV8X4p91ih3Ra8cu27pyaim6AG41J

This link doesn't work.

RubenKelevra · 2021-07-05T23:18:52Z

v0.8.0 up for 22 hours. Bpytop reports 440Mb of memory usage, all works fine, no slowdowns.
Here is debug data https://ipfs.io/ipfs/QmVcoCAQhqX9bEtL7ZkzYxzpWcSskyoDDQL84EwwfFmFMw .

Is this the debug data for 0.8 which works fine? If so, why?

eminence · 2021-07-06T00:10:13Z

Can you guys please read the debug manual and attach the debug info in the cases where you think the memory consumption is too high?

Are you looking for debug info for nodes using AcceleratedDHTClient, or not (or both)?

aschmahmann · 2021-07-06T00:28:47Z

This issue is a bit confusing with comments being fairly anecdotal and not easy to analyze. If people who are experiencing increasing memory usage over time could drop profiles here (as mentioned above #8219 (comment)) along with config files that would make it much easier to analyze.

I've done a little bit of local testing (lots-of-noise-ram.zip) and it seems that v0.9.0 has some suspicious memory usage where it seems to keep holding onto buffers from Noise connections for a long time even with low connection limits. This was with the standard config (i.e. AcceratedDHTClient disabled) on Windows. I haven't tested this against v0.8.0 yet, but the resource usage does appear to climb over time even when the number of connections remains the same.

If other people can post their pprof dumps from when they have high RAM usage it'll be easier to tell if these problems are similar or not.

akhilman · 2021-07-06T03:50:08Z

Here is all requested data:
https://ipfs.io/ipfs/QmNyQJc2brAjd917opV8X4p91ih3Ra8cu27pyaim6AG41J

This link doesn't work.

Here are those files archived.
QmNyQJc2brAjd917opV8X4p91ih3Ra8cu27pyaim6AG41J.tar.gz

v0.8.0 up for 22 hours. Bpytop reports 440Mb of memory usage, all works fine, no slowdowns.
Here is debug data https://ipfs.io/ipfs/QmVcoCAQhqX9bEtL7ZkzYxzpWcSskyoDDQL84EwwfFmFMw .

Is this the debug data for 0.8 which works fine? If so, why?

For comparison with anomalous data.

aschmahmann · 2021-07-09T14:00:07Z

Pretty sure this is related to libp2p/go-tcp-transport#81.

One way to test this is to periodically do curl http://localhost:5001/debug/metrics/prometheus (i.e. manually do some periodic metrics collection) and see if memory usage decreases. If this works then the problem should be resolved when that issue gets closed out next week.

eminence · 2021-07-10T19:04:16Z

Thanks, nice find @aschmahmann. I curled the prometheus endpoint, it took a full 30 seconds the first time I did (so I'm guessing the bulk of those 30 seconds was the metrics collector doing some cleanup). It didn't immediately reduce memory usage, but it start a downward trend in RSS memory usage.

I've also captured some debug info (run after curling the prometheus endpoint), they are available here: https://ipfs.io/ipfs/QmTacvQzyHMwcRzciejfTT38ECU1n8MdbaeXxppYv77B4b

(Also, I had totally forgotten about the prometheus endpoint, so I'll start graphing that data now, thanks for reminding me about it)

RubenKelevra · 2021-07-10T20:30:33Z

I think #8195 is related

aschmahmann · 2021-07-11T03:00:57Z

@eminence is that profile after restarting the node? Do things seem to be stabilizing over time now? It'd be good to know if the workaround (and the fix once one is out) is enough to resolve the issue.

Side note: your most recent profile shows about 1GB of heap size, but a lot of memory unclaimed by the OS. I'm not sure if this is just a combination of Go taking some time to GC, since it does it in pieces, and your OS being really slow at reclaiming the memory Go relinquishes or something else.

eminence · 2021-07-11T03:15:47Z

@eminence is that profile after restarting the node? Do things seem to be stabilizing over time now? It'd be good to know if the workaround (and the fix once one is out) is enough to resolve the issue.

No. This profile is taken after IPFS had been running for about 10 days. Unfortunately, I've since restarted the node (in an attempt to debug some unrelated content-resolution problems), so I can't tell you how things have stabilized.

I am now collecting the prometheus metrics every 15 seconds (and graphing them), so I expect this will give (1) more detailed metrics and (2) be a good test of the workaround.

I'm not sure if this is just a combination of Go taking some time to GC,

I see that there is a go_gc_duration_seconds metric exposed. I'm not very familiar with this metric or how to understand it, but maybe it can give some insight into the GC performance.

RubenKelevra · 2021-07-11T04:38:51Z

@eminence you can turn on debugging of the garbage collector with setting gctrace to 1.

See here for more details.

You can also follow the debug guide of go-ipfs and have a look at the memory consumption by allocations to routines and libraries. There are several programs designed to plot them out in graphs.

I mean this issue is somewhat recent. It didn't happen in 0.7. So we could just bisect the history to see when it starts happening? 🤔

Only issue here is the repo version. Not sure if I can do this on my main machine where it happens.

RubenKelevra · 2021-07-11T05:12:18Z

Well, so the real memory usage of IPFS increases (without shared memory):

... and systemd reports much more memory consumption due to highly shared memory(?):

These graphs show all the same timeframe and are from starting up ipfs the last time:

When systemd is asked to contain ipfs within 12 GB of memory ipfs comes basically to a halt, trying to maintain this amount of memory. See #8195 for the symptoms - there's also a set of debug info available for the current startup of ipfs and my config.

I'll close the other ticket since I think this is the same issue.

aschmahmann · 2021-07-16T16:37:53Z

If people could try building with master and seeing if the issue remains that would be great, if so please post with new pprofs for analysis so we can reopen.

eminence · 2021-07-16T23:35:39Z

Thanks! I'm now testing 9599ad5. For the record (and for future comparisons), here's a visual graph of memory usage captured for a few days, running 0.9.0

RubenKelevra · 2021-07-23T09:49:27Z

@aschmahmann looks way way better now! Thank you for the fast fix :)

eminence added kind/bug A bug in existing code (including security flaws) need/triage Needs initial labeling and prioritization labels Jun 26, 2021

eminence changed the title ~~Ever increasing memory usage~~ Ever increasing memory usage when using AcceleratedDHTClient Jun 29, 2021

BigLep added this to In Progress in Go IPFS Roadmap Jul 6, 2021

BigLep added this to the go-ipfs 0.9.1 milestone Jul 8, 2021

aschmahmann changed the title ~~Ever increasing memory usage when using AcceleratedDHTClient~~ Ever increasing memory usage Jul 9, 2021

RubenKelevra mentioned this issue Jul 11, 2021

ipfs freezes (memleak?) #8195

Closed

aschmahmann mentioned this issue Jul 16, 2021

chore: update deps #8263

Merged

aschmahmann closed this as completed in #8263 Jul 16, 2021

Go IPFS Roadmap automation moved this from In Progress to Done Jul 16, 2021

aschmahmann mentioned this issue Jul 16, 2021

Release v0.9.1 #8267

Merged

18 tasks

aschmahmann changed the title ~~Ever increasing memory usage~~ Memory leak of TCP connections Jul 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leak of TCP connections #8219

Memory leak of TCP connections #8219

eminence commented Jun 26, 2021

Jorropo commented Jun 26, 2021

akhilman commented Jun 28, 2021

aschmahmann commented Jun 28, 2021 •

edited

akhilman commented Jun 28, 2021

aschmahmann commented Jun 28, 2021 •

edited

akhilman commented Jun 28, 2021

eminence commented Jun 28, 2021

akhilman commented Jun 29, 2021 •

edited

Jorropo commented Jun 29, 2021

Jorropo commented Jun 29, 2021 •

edited

eminence commented Jun 29, 2021

RubenKelevra commented Jun 30, 2021 •

edited

qwertyforce commented Jul 4, 2021

Jorropo commented Jul 4, 2021 •

edited

qwertyforce commented Jul 5, 2021

eminence commented Jul 5, 2021

RubenKelevra commented Jul 5, 2021

RubenKelevra commented Jul 5, 2021

RubenKelevra commented Jul 5, 2021

eminence commented Jul 6, 2021

aschmahmann commented Jul 6, 2021

akhilman commented Jul 6, 2021

aschmahmann commented Jul 9, 2021

eminence commented Jul 10, 2021 •

edited

RubenKelevra commented Jul 10, 2021

aschmahmann commented Jul 11, 2021

eminence commented Jul 11, 2021

RubenKelevra commented Jul 11, 2021

RubenKelevra commented Jul 11, 2021 •

edited

aschmahmann commented Jul 16, 2021

eminence commented Jul 16, 2021

RubenKelevra commented Jul 23, 2021

Memory leak of TCP connections #8219

Memory leak of TCP connections #8219

Comments

eminence commented Jun 26, 2021

Version information:

Description:

Jorropo commented Jun 26, 2021

akhilman commented Jun 28, 2021

aschmahmann commented Jun 28, 2021 • edited

akhilman commented Jun 28, 2021

aschmahmann commented Jun 28, 2021 • edited

akhilman commented Jun 28, 2021

eminence commented Jun 28, 2021

akhilman commented Jun 29, 2021 • edited

Jorropo commented Jun 29, 2021

Jorropo commented Jun 29, 2021 • edited

eminence commented Jun 29, 2021

RubenKelevra commented Jun 30, 2021 • edited

qwertyforce commented Jul 4, 2021

Jorropo commented Jul 4, 2021 • edited

qwertyforce commented Jul 5, 2021

eminence commented Jul 5, 2021

RubenKelevra commented Jul 5, 2021

RubenKelevra commented Jul 5, 2021

RubenKelevra commented Jul 5, 2021

eminence commented Jul 6, 2021

aschmahmann commented Jul 6, 2021

akhilman commented Jul 6, 2021

aschmahmann commented Jul 9, 2021

eminence commented Jul 10, 2021 • edited

RubenKelevra commented Jul 10, 2021

aschmahmann commented Jul 11, 2021

eminence commented Jul 11, 2021

RubenKelevra commented Jul 11, 2021

RubenKelevra commented Jul 11, 2021 • edited

aschmahmann commented Jul 16, 2021

eminence commented Jul 16, 2021

RubenKelevra commented Jul 23, 2021

aschmahmann commented Jun 28, 2021 •

edited

aschmahmann commented Jun 28, 2021 •

edited

akhilman commented Jun 29, 2021 •

edited

Jorropo commented Jun 29, 2021 •

edited

RubenKelevra commented Jun 30, 2021 •

edited

Jorropo commented Jul 4, 2021 •

edited

eminence commented Jul 10, 2021 •

edited

RubenKelevra commented Jul 11, 2021 •

edited