[Access] Frequent previewnet AN OOM #5798

peterargue · 2024-04-28T14:23:30Z

🐞 Bug Report

Note: this is copied from an internal issue.

The access nodes on previewnet1 are frequently crashing due to OOM. After crashing, the nodes will often hang for several minutes before coming back up resulting in extended periods of downtime. On April 22, both ANs crashed within a few minutes of each other and took over an hour to recover, resulting in the network going into EFM.

Observations

Looking at the nodes on a 7 days timescale, they tend to have a similar shape to their memory utilization:

Note that the general shape is the same for both nodes, and that it sometimes, but not always, lines up in time.

There are similar periodic patterns for CPU (below) as well as Rx and Tx

Periodic resource issues like this are often related to badger DB compaction, but these nodes only have 75 GB data drives and 64 GB of memory. So it's unlikely that compaction on a drive that size would be cause these large 40% spikes in memory.

Profiles

Here are a couple profiles taken during spikes:

inuse space

alloc space

The vast majority of live memory is used by the DHT client. The allocations are spread between execution data decoding and DHT.

The first place to look is the DHT client. Why is it using so much live memory?

Threads to pull:

A large amount of inuse heap memory is used by the DHT.

What tuning can we do? Our usage pattern of every node having all the data doesn't align with the basic assumptions for using a DHT for content routing.
Currently it's configured with an in-memory db. Can we move this to an on-disk badger implementation?
Nodes seem to add a large number of providers per second (2500 in the comment below). why?

The steep memory spikes appear to be caused by badger compaction. I captured a trace during and after a spike. During the spike, compaction used ~9GB of memory (30% of total available)
heap_during_spike.pb.gz
heap_after_spike.pb.gz

this is odd because the node is only using 88 GB of its data disk.

Update: after looking at more traces, the compaction was coincidental. I found other cases where there was a large 20GB spike and drop in the size of the in-memory DHT provider db

peterargue · 2024-04-28T14:23:59Z

I noticed that we missed upgrading 3 more ipfs libraries that were moved to the ipfs/boxo repo, and opened PRs to upgrade:
#5774 (master)
#5777 (feature/stable-cadence)

I upgraded one of the previewnet ANs to this new version. here is the memory utilization since

This shows a 7 day chart. Note that both of the ANs are showing lower memory spikes now event though I only upgrade one node.

Update: a few hours later the memory baseline increased by 20%, then a spike caused the node to OOM again

peterargue · 2024-04-28T14:24:12Z

I enabled debug logging for the dht library on AN2. Looking at the 1 second interval between 2024-04-25T03:17:45.000 and 2024-04-25T03:17:46.000, the node handled 5035 DHT messages, and added a provider 2521 times. Of the add provider keys, there were 2513 unique values.

Since there are a small number of nodes on the network, and the network is only generating 3 execution data blobs per second, this number is very high. My current hypothesis is that it's related to "reproviding" older blobs. This is done by bitswap to ensure that nodes can find peers that have the data.

peterargue · 2024-04-28T14:24:25Z

After reviewing the DHT and bitswap documentation further, I don't think we gain much by using it. The main intention is to make it efficient to disseminate the routing table of which blocks of data are stored on which nodes. The basic design makes a few assumptions:

Nodes are connected to a subset of peers on the network
Nodes only host a subset of data
Peers are joining and leaving the network regularly, thus it's necessary to remind the network which blocks of data you have.
Data is equally relevant over time, so we should remind peers of all data we have

On the staked bitswap network, none of those assumptions are true.

Participating nodes are connected to all other participating peers
All nodes have all recent data
Staked nodes will generally be available throughout an epoch
Only the most recent data is generally needed by peers

Additionally, bitswap already has a built-in mechanism for discovering peers that have data the client wants. This mechanism is used first before looking into the DHT, so the DHT is rarely used in practice.

peterargue · 2024-04-28T14:28:21Z

I upgraded AN2 around 11am yesterday with a build from this branch:
#5795

This makes the following changes:

Disable DHT
Add caching to bitswap's blockstore
Upgrade several ipfs/boxo libraries

Here are the CPU and memory utilization graphs compared to AN1 which is running the original code:

interestingly, the baseline CPU and memory on AN1 is lower than in previous days (2 day chart)

also interesting that the memory spikes attributed to badger compaction are gone after disabling the DHT.

peterargue · 2024-04-28T14:41:27Z

profile comparisons between the 2 nodes. Note: there is very little API traffic on this network, so these show a steady state

inuse space

AN2

AN1

Note the large amount of memory used by ProviderManager on AN1. This module has an in-memory DB that's used to store a list of providers for each blob of data provided on the network. Growth in the memory used by this DB correlates with spikes in messages and streams used by the dht protocol.

alloc space

AN2

AN1

Note all of the allocs on the left half of the AN1 profile are missing on AN2. These are related to

DHT management and messaging
Bitswap (resolved by adding caching)

cpu

AN2

AN1

peterargue · 2024-04-28T14:55:43Z

Network metrics:

Inbound and Outbound libp2p streams over last 7 days on DHT enable nodes

Inbound and Outbound libp2p streams over last 7 days on AN2. Note: on Apr 25 the boxo libraries were upgraded and on Apr 27 the DHT was disabled

peterargue · 2024-04-28T15:13:56Z

Execution data syncing

Downloads

Note: AN2 seems to keep up fine downloading data, and even seems to keep up better than AN1

Duplicate data %

Note that ~2.5% of the data received by AN2 is duplicates, vs 15% for AN1

peterargue added Bug Something isn't working S-Access labels Apr 28, 2024

peterargue self-assigned this Apr 28, 2024

This was referenced Apr 28, 2024

[State Sync] DHT causing high resource utilization #5799

Open

[State Sync] Disable DHT on staked network #5800

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Access] Frequent previewnet AN OOM #5798

[Access] Frequent previewnet AN OOM #5798

peterargue commented Apr 28, 2024 •

edited

peterargue commented Apr 28, 2024

peterargue commented Apr 28, 2024

peterargue commented Apr 28, 2024

peterargue commented Apr 28, 2024 •

edited

peterargue commented Apr 28, 2024 •

edited

peterargue commented Apr 28, 2024 •

edited

peterargue commented Apr 28, 2024

[Access] Frequent previewnet AN OOM #5798

[Access] Frequent previewnet AN OOM #5798

Comments

peterargue commented Apr 28, 2024 • edited

🐞 Bug Report

Observations

Profiles

Threads to pull:

peterargue commented Apr 28, 2024

peterargue commented Apr 28, 2024

peterargue commented Apr 28, 2024

peterargue commented Apr 28, 2024 • edited

peterargue commented Apr 28, 2024 • edited

inuse space

alloc space

cpu

peterargue commented Apr 28, 2024 • edited

peterargue commented Apr 28, 2024

peterargue commented Apr 28, 2024 •

edited

peterargue commented Apr 28, 2024 •

edited

peterargue commented Apr 28, 2024 •

edited

peterargue commented Apr 28, 2024 •

edited