-
Notifications
You must be signed in to change notification settings - Fork 166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Access] Frequent previewnet AN OOM #5798
Comments
I noticed that we missed upgrading 3 more ipfs libraries that were moved to the I upgraded one of the previewnet ANs to this new version. here is the memory utilization since This shows a 7 day chart. Note that both of the ANs are showing lower memory spikes now event though I only upgrade one node. Update: a few hours later the memory baseline increased by 20%, then a spike caused the node to OOM again |
I enabled debug logging for the dht library on AN2. Looking at the 1 second interval between Since there are a small number of nodes on the network, and the network is only generating 3 execution data blobs per second, this number is very high. My current hypothesis is that it's related to "reproviding" older blobs. This is done by bitswap to ensure that nodes can find peers that have the data. |
After reviewing the DHT and bitswap documentation further, I don't think we gain much by using it. The main intention is to make it efficient to disseminate the routing table of which blocks of data are stored on which nodes. The basic design makes a few assumptions:
On the staked bitswap network, none of those assumptions are true.
Additionally, bitswap already has a built-in mechanism for discovering peers that have data the client wants. This mechanism is used first before looking into the DHT, so the DHT is rarely used in practice. |
I upgraded AN2 around 11am yesterday with a build from this branch: This makes the following changes:
Here are the CPU and memory utilization graphs compared to AN1 which is running the original code: interestingly, the baseline CPU and memory on AN1 is lower than in previous days (2 day chart) also interesting that the memory spikes attributed to badger compaction are gone after disabling the DHT. |
profile comparisons between the 2 nodes. Note: there is very little API traffic on this network, so these show a steady state inuse spaceNote the large amount of memory used by alloc spaceNote all of the allocs on the left half of the AN1 profile are missing on AN2. These are related to
cpu |
馃悶 Bug Report
Note: this is copied from an internal issue.
The access nodes on previewnet1 are frequently crashing due to OOM. After crashing, the nodes will often hang for several minutes before coming back up resulting in extended periods of downtime. On April 22, both ANs crashed within a few minutes of each other and took over an hour to recover, resulting in the network going into EFM.
Observations
Looking at the nodes on a 7 days timescale, they tend to have a similar shape to their memory utilization:
Note that the general shape is the same for both nodes, and that it sometimes, but not always, lines up in time.
There are similar periodic patterns for CPU (below) as well as Rx and Tx
Periodic resource issues like this are often related to badger DB compaction, but these nodes only have 75 GB data drives and 64 GB of memory. So it's unlikely that compaction on a drive that size would be cause these large 40% spikes in memory.
Profiles
Here are a couple profiles taken during spikes:
inuse space
alloc space
The vast majority of live memory is used by the DHT client. The allocations are spread between execution data decoding and DHT.
The first place to look is the DHT client. Why is it using so much live memory?
Threads to pull:
A large amount of inuse heap memory is used by the DHT.
The steep memory spikes appear to be caused by badger compaction. I captured a trace during and after a spike. During the spike, compaction used ~9GB of memory (30% of total available)
heap_during_spike.pb.gz
heap_after_spike.pb.gz
this is odd because the node is only using 88 GB of its data disk.
Update: after looking at more traces, the compaction was coincidental. I found other cases where there was a large 20GB spike and drop in the size of the in-memory DHT provider db
The text was updated successfully, but these errors were encountered: