Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Access] Frequent previewnet AN OOM #5798

Open
3 tasks
peterargue opened this issue Apr 28, 2024 · 7 comments
Open
3 tasks

[Access] Frequent previewnet AN OOM #5798

peterargue opened this issue Apr 28, 2024 · 7 comments
Assignees
Labels
Bug Something isn't working S-Access

Comments

@peterargue
Copy link
Contributor

peterargue commented Apr 28, 2024

馃悶 Bug Report

Note: this is copied from an internal issue.

The access nodes on previewnet1 are frequently crashing due to OOM. After crashing, the nodes will often hang for several minutes before coming back up resulting in extended periods of downtime. On April 22, both ANs crashed within a few minutes of each other and took over an hour to recover, resulting in the network going into EFM.

Observations

Looking at the nodes on a 7 days timescale, they tend to have a similar shape to their memory utilization:
image

Note that the general shape is the same for both nodes, and that it sometimes, but not always, lines up in time.

There are similar periodic patterns for CPU (below) as well as Rx and Tx
image

Periodic resource issues like this are often related to badger DB compaction, but these nodes only have 75 GB data drives and 64 GB of memory. So it's unlikely that compaction on a drive that size would be cause these large 40% spikes in memory.

Profiles

Here are a couple profiles taken during spikes:

inuse space
image

alloc space
image

The vast majority of live memory is used by the DHT client. The allocations are spread between execution data decoding and DHT.

The first place to look is the DHT client. Why is it using so much live memory?

Threads to pull:

A large amount of inuse heap memory is used by the DHT.

  • What tuning can we do? Our usage pattern of every node having all the data doesn't align with the basic assumptions for using a DHT for content routing.
  • Currently it's configured with an in-memory db. Can we move this to an on-disk badger implementation?
  • Nodes seem to add a large number of providers per second (2500 in the comment below). why?

The steep memory spikes appear to be caused by badger compaction. I captured a trace during and after a spike. During the spike, compaction used ~9GB of memory (30% of total available)
heap_during_spike.pb.gz
heap_after_spike.pb.gz

this is odd because the node is only using 88 GB of its data disk.

Update: after looking at more traces, the compaction was coincidental. I found other cases where there was a large 20GB spike and drop in the size of the in-memory DHT provider db

@peterargue peterargue added Bug Something isn't working S-Access labels Apr 28, 2024
@peterargue peterargue self-assigned this Apr 28, 2024
@peterargue
Copy link
Contributor Author

I noticed that we missed upgrading 3 more ipfs libraries that were moved to the ipfs/boxo repo, and opened PRs to upgrade:
#5774 (master)
#5777 (feature/stable-cadence)

I upgraded one of the previewnet ANs to this new version. here is the memory utilization since
image

This shows a 7 day chart. Note that both of the ANs are showing lower memory spikes now event though I only upgrade one node.

Update: a few hours later the memory baseline increased by 20%, then a spike caused the node to OOM again

@peterargue
Copy link
Contributor Author

I enabled debug logging for the dht library on AN2. Looking at the 1 second interval between 2024-04-25T03:17:45.000 and 2024-04-25T03:17:46.000, the node handled 5035 DHT messages, and added a provider 2521 times. Of the add provider keys, there were 2513 unique values.

Since there are a small number of nodes on the network, and the network is only generating 3 execution data blobs per second, this number is very high. My current hypothesis is that it's related to "reproviding" older blobs. This is done by bitswap to ensure that nodes can find peers that have the data.

@peterargue
Copy link
Contributor Author

After reviewing the DHT and bitswap documentation further, I don't think we gain much by using it. The main intention is to make it efficient to disseminate the routing table of which blocks of data are stored on which nodes. The basic design makes a few assumptions:

  1. Nodes are connected to a subset of peers on the network
  2. Nodes only host a subset of data
  3. Peers are joining and leaving the network regularly, thus it's necessary to remind the network which blocks of data you have.
  4. Data is equally relevant over time, so we should remind peers of all data we have

On the staked bitswap network, none of those assumptions are true.

  1. Participating nodes are connected to all other participating peers
  2. All nodes have all recent data
  3. Staked nodes will generally be available throughout an epoch
  4. Only the most recent data is generally needed by peers

Additionally, bitswap already has a built-in mechanism for discovering peers that have data the client wants. This mechanism is used first before looking into the DHT, so the DHT is rarely used in practice.

@peterargue
Copy link
Contributor Author

peterargue commented Apr 28, 2024

I upgraded AN2 around 11am yesterday with a build from this branch:
#5795

This makes the following changes:

  • Disable DHT
  • Add caching to bitswap's blockstore
  • Upgrade several ipfs/boxo libraries

Here are the CPU and memory utilization graphs compared to AN1 which is running the original code:
image

interestingly, the baseline CPU and memory on AN1 is lower than in previous days (2 day chart)
image

also interesting that the memory spikes attributed to badger compaction are gone after disabling the DHT.

@peterargue
Copy link
Contributor Author

peterargue commented Apr 28, 2024

profile comparisons between the 2 nodes. Note: there is very little API traffic on this network, so these show a steady state

inuse space

AN2
image

AN1
image

Note the large amount of memory used by ProviderManager on AN1. This module has an in-memory DB that's used to store a list of providers for each blob of data provided on the network. Growth in the memory used by this DB correlates with spikes in messages and streams used by the dht protocol.

alloc space

AN2
image

AN1
image

Note all of the allocs on the left half of the AN1 profile are missing on AN2. These are related to

  1. DHT management and messaging
  2. Bitswap (resolved by adding caching)

cpu

AN2
image

AN1
image

@peterargue
Copy link
Contributor Author

peterargue commented Apr 28, 2024

Network metrics:

Inbound and Outbound libp2p streams over last 7 days on DHT enable nodes
image

Inbound and Outbound libp2p streams over last 7 days on AN2. Note: on Apr 25 the boxo libraries were upgraded and on Apr 27 the DHT was disabled
image

@peterargue
Copy link
Contributor Author

Execution data syncing

Downloads
image

Note: AN2 seems to keep up fine downloading data, and even seems to keep up better than AN1

Duplicate data %
image

Note that ~2.5% of the data received by AN2 is duplicates, vs 15% for AN1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Something isn't working S-Access
Projects
None yet
Development

No branches or pull requests

1 participant