Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak of TCP connections #8219

Closed
eminence opened this issue Jun 26, 2021 · 32 comments · Fixed by #8263
Closed

Memory leak of TCP connections #8219

eminence opened this issue Jun 26, 2021 · 32 comments · Fixed by #8263
Labels
kind/bug A bug in existing code (including security flaws) need/triage Needs initial labeling and prioritization
Milestone

Comments

@eminence
Copy link
Contributor

Version information:

go-ipfs version: 0.9.0-179d1d150
Repo version: 11
System version: amd64/linux
Golang version: go1.15.2

Description:

IPFS 0.9 seems to be showing increasing memory usage over time. On a small VSP (with only 2GB of memory), this results IPFS being killed every few days via the kernel OOM killer.

On a much bigger machine, I've been graphing the memory usage over the past few days:

mem_rss

mem_vm

On the small VPS, there are no experimental features enabled. On the bigger machine, I have:

  "Experimental": {
    "AcceleratedDHTClient": true,
    "FilestoreEnabled": true,
    "GraphsyncEnabled": false,
    "Libp2pStreamMounting": false,
    "P2pHttpProxy": false,
    "ShardingEnabled": false,
    "StrategicProviding": false,
    "UrlstoreEnabled": false
  }

(The RRD graphs are configured to store up to a week of data, so I'll keep monitoring things for a few more days)

@eminence eminence added kind/bug A bug in existing code (including security flaws) need/triage Needs initial labeling and prioritization labels Jun 26, 2021
@Jorropo
Copy link
Contributor

Jorropo commented Jun 26, 2021

@eminence on my machines IPFS ceils arround 13 Gigs (with AcceleratedDHTClient), I guess yours lack enough memory to run it.

Do you know if IPFS does the same without AcceleratedDHTClient ?

What datastore are you using (this could also be badger that is using a lot) ?
(I guess dump the whole config too, ipfs config show to avoid leaking your keys)

What usage are you doing of theses nodes ?
I'm saying that because the graphs even tho different are surprisingly similar, example the flat portion near the end. Do you know what you were doing all the time but not when it was flat ?

Can you send a profile of IPFS when it's on high ?
Look at github.com/ipfs/go-ipfs/blob/master/docs/debug-guide.md#beginning to do so, the bin/collect-profiles.sh script.

@akhilman
Copy link

I have a similar problem. Recently, ipfs began to eat up 1 GB of memory allocated to it in half a day and stop responding to requests. Before that, everything worked great.
I am using the latest tag from the docker hub.

@aschmahmann
Copy link
Contributor

aschmahmann commented Jun 28, 2021

@akhilman @eminence , thanks for the reports. If you could, as suggested above, report your config and collect a profile that would greatly assist in looking into this.

@akhilman
Copy link

@aschmahmann
Copy link
Contributor

aschmahmann commented Jun 28, 2021

@akhilman a few questions:

  1. Is this snapshot taken when the node is performing as expected or not?
    • I've pinged your node with bitswap requests and it seems to be responding fine.
  2. Is it eating up more RAM over time (the initial report here) or are you just receiving a bunch of requests?
  3. How confident are you that this is actually different if you run v0.8.0 rather than v0.9.0? Have you tried running v0.8.0 to see if you get the same behavior?
    • It's possible you're just receiving receiving a spike in requests independent of what version of go-ipfs you're running

Note: if your node is running in a low powered environment you may want to consider the lowpower profile and/or decreasing your High/Low water peers to a smaller amount. For context IPFS-Desktop uses 50, 300 as the low and high water

@akhilman
Copy link

1. Is this snapshot taken when the node is performing as expected or not?
      * I've pinged your node with bitswap requests and it seems to be responding fine.

Data collected while the node was slow. Then I restarted my node to upload those files. Without restarting uploading was too slow.
I'm sure he will slow down again in 12 hours.

I have a cron job that restarts ipfs every day to avoid these slowdowns.

2. Is it eating up more RAM over time (the initial report here) or are you just receiving a bunch of requests?

My node is almost idle with no requests and eats memory over time. Until recently, 1Gb of memory was enough for several days, and only after that the memory ran out.

3. How confident are you that this is actually different if you run v0.8.0 rather than v0.9.0? Have you tried running v0.8.0 to see if you get the same behavior?

I'll try v0.8.0.

  • It's possible you're just receiving receiving a spike in requests independent of what version of go-ipfs you're running

My node is almost idle.

Note: if your node is running in a low powered environment you may want to consider the lowpower profile and/or decreasing your High/Low water peers to a smaller amount. For context IPFS-Desktop uses 50, 300 as the low and high water

Now I use 300-600, but I used to have 600-900 without any problem for a long time.

@eminence
Copy link
Contributor Author

Hi all, thanks for the good questions @Jorropo

Unfortunately my graphing tool stopped working after a NAS hiccup, so the above graphs is all I have for now in terms of pictures. But the node is now up to 14G virt mem (compared to the ~8GB from the last datapoint in the graphs I uploaded).

Do you know if IPFS does the same without AcceleratedDHTClient

No, but I will spend several days trying this

What datastore are you using (this could also be badger that is using a lot) ?
(I guess dump the whole config too, ipfs config show to avoid leaking your keys)

I'm using whatever the default datastore it.
See my config at https://ipfs.io/ipfs/QmZxJkKWszoTLUAQW1FFBsTz82YF4FPHGfUgJWWrhBZL7Y

I haven't yet collected a profile, but I will later.

What usage are you doing of theses nodes ?

The node wasn't entirely idle. It was likely serving some content to the network, and every now and again it would be adding content (both new stuff, and rehashing stuff it already had in its store)

To be clear, the node seems to be performing just fine, except for the memory usage.

It may be true that the memory usage just hasn't hit its ceiling yet, but I feel pretty confident in saying that the 0.9 memory usage is higher than 0.8. I will also try to downgrade to 0.8 to confirm.

Thanks!

@akhilman
Copy link

akhilman commented Jun 29, 2021

v0.8.0 up for 22 hours. Bpytop reports 440Mb of memory usage, all works fine, no slowdowns.
Here is debug data https://ipfs.io/ipfs/QmVcoCAQhqX9bEtL7ZkzYxzpWcSskyoDDQL84EwwfFmFMw .

@Jorropo
Copy link
Contributor

Jorropo commented Jun 29, 2021

I'm happy to announce that my node has broke it's ceiling too, and it do seems that there is a memory leak in 0.9.0 which wasn't in 0.8.0.
I'll be able to debug with a buggy node on hand. :D

@Jorropo
Copy link
Contributor

Jorropo commented Jun 29, 2021

This only affect AcceleratedDHTClient.
I have 2 nodes with identical ipfs configs, but one with and one without is doing fine at ~2.5 Gigs of ram (all mostly badgerds) (the small one have way less requests, but over the time I'm running it I think it would still have restarted due to OOM, it's only a 4 Gig vps).

@eminence eminence changed the title Ever increasing memory usage Ever increasing memory usage when using AcceleratedDHTClient Jun 29, 2021
@eminence
Copy link
Contributor Author

Ok, good info, thanks! I'm still having problems with 0.9.0 (with AcceleratedDHTClient disabled) running out of memory on a 2GB VPS. I'm now trying 0.9.0 with the "lowpower" config to see if that helps

@RubenKelevra
Copy link
Contributor

RubenKelevra commented Jun 30, 2021

Saw some strange behavior on one of my nodes as well. While the system had 2 GB of free memory and 9 GB of ZFS ARC memory left, the memsome and memfull pressure indicators was above 95%, which indicates heavy IO on the memory - which doesn't make that much sense.

The system specs:

32 GB memory
6 cores of an AMD EPYC 7702P
SSD RAID10
Arch Linux

IPFS:

go-ipfs version: 0.9.0-rc2
Repo version: 11
System version: amd64/linux
Golang version: go1.16.4

Datastore:

flatfs


Maybe have a look at the memory IO with atop on your systems as well?

@qwertyforce
Copy link

Ok, good info, thanks! I'm still having problems with 0.9.0 (with AcceleratedDHTClient disabled) running out of memory on a 2GB VPS. I'm now trying 0.9.0 with the "lowpower" config to see if that helps

I had the same problem on 0.9.0 (AcceleratedDHTClient disabled). I rolled back to 0.8.0 and now memory consumption is growing a lot slower

@Jorropo
Copy link
Contributor

Jorropo commented Jul 4, 2021

@qwertyforce how much ram do you have ? My 4Gigs VPS in 0.9.0 without AcceleratedDHTClient havn't OOMed even once.

@qwertyforce
Copy link

@Jorropo 1gb (rpi3)
0.9.0 - ipfs daemon is oomed after ~24 hours. (routing == dhtclient, I also switched off bandwidth metrics and lowered ConnMgr.HighWater)
0.8.0 - ipfs daemon consumes 30% of ram after 24 hours. (routing == dhtserver. bandwidth metrics are switched off. LowWater - 250, HighWater - 500). Memory consumption is so good in 0.8.0 (routing == dhtclient) so I switched routingType to dhtserver.

@eminence
Copy link
Contributor Author

eminence commented Jul 5, 2021

I've had to downgrade another node (on a VPS with 4GB memory) from v0.9.0 to v0.8.0 due to OOM issues (happened 3 times, about once every 26 hours). It's running a default config (so running the default profile, and no tuning done to things like ConnMgr configs)

@RubenKelevra
Copy link
Contributor

Okay, hard to debug something without any debug info. Can you guys please read the debug manual and attach the debug info in the cases where you think the memory consumption is too high?

@RubenKelevra
Copy link
Contributor

Here is all requested data:
https://ipfs.io/ipfs/QmNyQJc2brAjd917opV8X4p91ih3Ra8cu27pyaim6AG41J

This link doesn't work.

@RubenKelevra
Copy link
Contributor

v0.8.0 up for 22 hours. Bpytop reports 440Mb of memory usage, all works fine, no slowdowns.
Here is debug data https://ipfs.io/ipfs/QmVcoCAQhqX9bEtL7ZkzYxzpWcSskyoDDQL84EwwfFmFMw .

Is this the debug data for 0.8 which works fine? If so, why?

@eminence
Copy link
Contributor Author

eminence commented Jul 6, 2021

Can you guys please read the debug manual and attach the debug info in the cases where you think the memory consumption is too high?

Are you looking for debug info for nodes using AcceleratedDHTClient, or not (or both)?

@aschmahmann
Copy link
Contributor

This issue is a bit confusing with comments being fairly anecdotal and not easy to analyze. If people who are experiencing increasing memory usage over time could drop profiles here (as mentioned above #8219 (comment)) along with config files that would make it much easier to analyze.

I've done a little bit of local testing (lots-of-noise-ram.zip) and it seems that v0.9.0 has some suspicious memory usage where it seems to keep holding onto buffers from Noise connections for a long time even with low connection limits. This was with the standard config (i.e. AcceratedDHTClient disabled) on Windows. I haven't tested this against v0.8.0 yet, but the resource usage does appear to climb over time even when the number of connections remains the same.

If other people can post their pprof dumps from when they have high RAM usage it'll be easier to tell if these problems are similar or not.

@akhilman
Copy link

akhilman commented Jul 6, 2021

Here is all requested data:
https://ipfs.io/ipfs/QmNyQJc2brAjd917opV8X4p91ih3Ra8cu27pyaim6AG41J

This link doesn't work.

Here are those files archived.
QmNyQJc2brAjd917opV8X4p91ih3Ra8cu27pyaim6AG41J.tar.gz

v0.8.0 up for 22 hours. Bpytop reports 440Mb of memory usage, all works fine, no slowdowns.
Here is debug data https://ipfs.io/ipfs/QmVcoCAQhqX9bEtL7ZkzYxzpWcSskyoDDQL84EwwfFmFMw .

Is this the debug data for 0.8 which works fine? If so, why?

For comparison with anomalous data.

@BigLep BigLep added this to In Progress in Go IPFS Roadmap Jul 6, 2021
@BigLep BigLep added this to the go-ipfs 0.9.1 milestone Jul 8, 2021
@aschmahmann
Copy link
Contributor

Pretty sure this is related to libp2p/go-tcp-transport#81.

One way to test this is to periodically do curl http://localhost:5001/debug/metrics/prometheus (i.e. manually do some periodic metrics collection) and see if memory usage decreases. If this works then the problem should be resolved when that issue gets closed out next week.

@aschmahmann aschmahmann changed the title Ever increasing memory usage when using AcceleratedDHTClient Ever increasing memory usage Jul 9, 2021
@eminence
Copy link
Contributor Author

eminence commented Jul 10, 2021

Thanks, nice find @aschmahmann. I curled the prometheus endpoint, it took a full 30 seconds the first time I did (so I'm guessing the bulk of those 30 seconds was the metrics collector doing some cleanup). It didn't immediately reduce memory usage, but it start a downward trend in RSS memory usage.

mem_rss
mem_vm

I've also captured some debug info (run after curling the prometheus endpoint), they are available here: https://ipfs.io/ipfs/QmTacvQzyHMwcRzciejfTT38ECU1n8MdbaeXxppYv77B4b

(Also, I had totally forgotten about the prometheus endpoint, so I'll start graphing that data now, thanks for reminding me about it)

@RubenKelevra
Copy link
Contributor

I think #8195 is related

@aschmahmann
Copy link
Contributor

@eminence is that profile after restarting the node? Do things seem to be stabilizing over time now? It'd be good to know if the workaround (and the fix once one is out) is enough to resolve the issue.

Side note: your most recent profile shows about 1GB of heap size, but a lot of memory unclaimed by the OS. I'm not sure if this is just a combination of Go taking some time to GC, since it does it in pieces, and your OS being really slow at reclaiming the memory Go relinquishes or something else.

@eminence
Copy link
Contributor Author

@eminence is that profile after restarting the node? Do things seem to be stabilizing over time now? It'd be good to know if the workaround (and the fix once one is out) is enough to resolve the issue.

No. This profile is taken after IPFS had been running for about 10 days. Unfortunately, I've since restarted the node (in an attempt to debug some unrelated content-resolution problems), so I can't tell you how things have stabilized.

I am now collecting the prometheus metrics every 15 seconds (and graphing them), so I expect this will give (1) more detailed metrics and (2) be a good test of the workaround.

I'm not sure if this is just a combination of Go taking some time to GC,

I see that there is a go_gc_duration_seconds metric exposed. I'm not very familiar with this metric or how to understand it, but maybe it can give some insight into the GC performance.

@RubenKelevra
Copy link
Contributor

@eminence you can turn on debugging of the garbage collector with setting gctrace to 1.

See here for more details.

You can also follow the debug guide of go-ipfs and have a look at the memory consumption by allocations to routines and libraries. There are several programs designed to plot them out in graphs.


I mean this issue is somewhat recent. It didn't happen in 0.7. So we could just bisect the history to see when it starts happening? 🤔

Only issue here is the repo version. Not sure if I can do this on my main machine where it happens.

@RubenKelevra
Copy link
Contributor

RubenKelevra commented Jul 11, 2021

Well, so the real memory usage of IPFS increases (without shared memory):

Screenshot_20210711_065940-1

... and systemd reports much more memory consumption due to highly shared memory(?):

Screenshot_20210711_064849-1

These graphs show all the same timeframe and are from starting up ipfs the last time:

Screenshot_20210711_070436-1
Screenshot_20210711_070347-1
Screenshot_20210711_070310-1
Screenshot_20210711_070141-1
Screenshot_20210711_065906-1
Screenshot_20210711_065813-1
Screenshot_20210711_065745-1
Screenshot_20210711_065725-1
Screenshot_20210711_065626-1
Screenshot_20210711_065536-1
Screenshot_20210711_065501-1
Screenshot_20210711_065356-1
Screenshot_20210711_065241-1
Screenshot_20210711_065202-1
Screenshot_20210711_065043-1
Screenshot_20210711_064946-1

When systemd is asked to contain ipfs within 12 GB of memory ipfs comes basically to a halt, trying to maintain this amount of memory. See #8195 for the symptoms - there's also a set of debug info available for the current startup of ipfs and my config.

I'll close the other ticket since I think this is the same issue.

Go IPFS Roadmap automation moved this from In Progress to Done Jul 16, 2021
@aschmahmann
Copy link
Contributor

If people could try building with master and seeing if the issue remains that would be great, if so please post with new pprofs for analysis so we can reopen.

@aschmahmann aschmahmann mentioned this issue Jul 16, 2021
18 tasks
@eminence
Copy link
Contributor Author

Thanks! I'm now testing 9599ad5. For the record (and for future comparisons), here's a visual graph of memory usage captured for a few days, running 0.9.0

@aschmahmann aschmahmann changed the title Ever increasing memory usage Memory leak of TCP connections Jul 20, 2021
@RubenKelevra
Copy link
Contributor

@aschmahmann looks way way better now! Thank you for the fast fix :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug A bug in existing code (including security flaws) need/triage Needs initial labeling and prioritization
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

7 participants