Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak in cosmic-swingset causes validator missed block #9316

Closed
mhofman opened this issue May 3, 2024 · 2 comments · Fixed by #9405
Closed

Memory leak in cosmic-swingset causes validator missed block #9316

mhofman opened this issue May 3, 2024 · 2 comments · Fixed by #9405
Assignees
Labels
bug Something isn't working cosmic-swingset package: cosmic-swingset performance Performance related issues SwingSet package: SwingSet

Comments

@mhofman
Copy link
Member

mhofman commented May 3, 2024

Describe the bug

There seem to be a leak in the Node.js side of the agd process (cosmic-swingset or swingset), which results in memory growth and slower and slower force GC after each block. Ultimately this causes validators to miss blocks if the GC takes more than the block voting period (5s).

To Reproduce

Run a node on mainnet, observe slow increase in Node heap usage and forced gc time.

Expected behavior

Flat memory usage.

This is likely a regression as we used to observe this metric (added in #5637) in previous upgrades after we had a large leak caused by Promise.race used in xsnap (#5643)

Platform Environment

upgrade-14 on mainnet, but upgrade-15 is likely affected as well

Workaround

Restarting agd will free the leaked memory, resetting consumption back to expected levels. This is considered better than allowing OOM mitigations to kill processes, but note that it may take a few blocks to catch back up.

Alternatively, we now have a branch with a patch for agoric-upgrade-15: mhofman/patch-u15-xsnap-leak

Additional context

Screenshots

@mhofman mhofman added bug Something isn't working SwingSet package: SwingSet cosmic-swingset package: cosmic-swingset performance Performance related issues labels May 3, 2024
@mhofman
Copy link
Member Author

mhofman commented May 25, 2024

This was an interesting scope leak in the kernel. Minimal repro in a gist.

There is an uprgrade-15 branch with a hotfix for this: mhofman/patch-u15-xsnap-leak

@gibson042
Copy link
Member

Node snapshot analysis travelogue:

# Inspect tarball contents.
tar tvzf node-heap-snapshots.tar.gz | awk '{ print $NF }' | sed 's/^/# /; 4q'
# node-heap-snapshots/
# node-heap-snapshots/Heap-269-1714766591308.heapsnapshot
# node-heap-snapshots/Heap-268-1715106920615.heapsnapshot
# node-heap-snapshots/Heap-270-1714822445711.heapsnapshot
# The base filename structure is `Heap-${process.pid}-${Date.now()}.heapsnapshot`.
# https://github.com/Agoric/agoric-sdk/blob/8343a0ca16603ce52aeb4818d4dbf2529b9a59f7/packages/cosmic-swingset/src/chain-main.js#L442
# Sort snapshots by timestamp.
tar tvzf node-heap-snapshots.tar.gz | awk '/Heap-/ { print $NF }' | sort -t- -k5n > snapshots.txt
# Check count.
wc -l snapshots.txt | sed 's/^/# /'
#     2061 snapshots.txt
# Analyze data.
cat snapshots.txt | node -e '(async () => {
  const reader = require("node:readline").createInterface({ input: process.stdin });
  const data = {
    count: 0,
    firstTimestamp: undefined,
    lastTimestamp: undefined,
    minGap: Infinity,
    maxGap: -Infinity,
    sum: 0,
    gaps: [],
    gapHighWatermarks: [],
  };
  for await (const line of reader) {
    const { epochms } = line.match(/Heap-(?<pid>[0-9]+)-(?<epochms>[0-9]+)/)?.groups || {};
    if (!epochms) throw Error(`expected Heap-$pid-$timestamp in ${line}`);
    const timestamp = +epochms;
    if (data.count === 0) {
      data.count++;
      data.firstTimestamp = data.lastTimestamp = timestamp;
      continue;
    }
    const { lastTimestamp } = data;
    const gap = (timestamp - lastTimestamp) / 1000;
    if (gap <= 0) throw Error(`expected timestamp advancement from ${lastTimestamp} in ${line}`);
    data.count++;
    data.lastTimestamp = timestamp;
    if (gap < data.minGap) data.minGap = gap;
    if (gap > data.maxGap) { data.gapHighWatermarks.push(line); data.maxGap = gap; }
    data.sum += gap;
    data.gaps.push(gap);
  }

  data.firstTimestamp = new Date(data.firstTimestamp).toISOString();
  data.lastTimestamp = new Date(data.lastTimestamp).toISOString();
  data.median = data.gaps[Math.floor(data.count / 2)];
  data.mean = data.sum / data.count;
  delete data.gaps;
  delete data.sum;
  console.log(JSON.stringify(data, null, 2));
})()' > snapshotData.json
cat snapshotData.json | sed 's/^/# /'
# {
#   "count": 2061,
#   "firstTimestamp": "2024-05-03T18:05:39.352Z",
#   "lastTimestamp": "2024-05-10T04:08:17.204Z",
#   "minGap": 29.997,
#   "maxGap": 2590.485,
#   "gapHighWatermarks": [
#     "node-heap-snapshots/Heap-269-1714759575998.heapsnapshot",
#     "node-heap-snapshots/Heap-269-1714759726799.heapsnapshot",
#     "node-heap-snapshots/Heap-269-1714760240502.heapsnapshot",
#     "node-heap-snapshots/Heap-269-1714760834503.heapsnapshot",
#     "node-heap-snapshots/Heap-269-1714761619303.heapsnapshot",
#     "node-heap-snapshots/Heap-269-1714791458519.heapsnapshot",
#     "node-heap-snapshots/Heap-269-1714800641991.heapsnapshot",
#     "node-heap-snapshots/Heap-270-1714820055013.heapsnapshot",
#     "node-heap-snapshots/Heap-268-1714877515018.heapsnapshot",
#     "node-heap-snapshots/Heap-268-1714884595207.heapsnapshot",
#     "node-heap-snapshots/Heap-271-1714930520894.heapsnapshot",
#     "node-heap-snapshots/Heap-271-1714967410109.heapsnapshot"
#   ],
#   "median": 591.525,
#   "mean": 269.07222319262434
# }
cat snapshotData.json | jq -r '.gapHighWatermarks[]' | grep -Fxf- -C2 snapshots.txt | awk '{
  len = split($0, parts, /[-.]/);
  t = parts[len - 1];
  if (t && lastT) $0 = sprintf("%s # %fs", $0, (t - lastT) / 1000);
  lastT = t;
  print;
}' | sed 's/^/# /'
# node-heap-snapshots/Heap-269-1714759539352.heapsnapshot
# node-heap-snapshots/Heap-269-1714759575998.heapsnapshot # 36.646000s
# node-heap-snapshots/Heap-269-1714759611201.heapsnapshot # 35.203000s
# node-heap-snapshots/Heap-269-1714759726799.heapsnapshot # 115.598000s
# node-heap-snapshots/Heap-269-1714759784401.heapsnapshot # 57.602000s
# node-heap-snapshots/Heap-269-1714759844212.heapsnapshot # 59.811000s
# node-heap-snapshots/Heap-269-1714759917117.heapsnapshot # 72.905000s
# node-heap-snapshots/Heap-269-1714759974505.heapsnapshot # 57.388000s
# node-heap-snapshots/Heap-269-1714760240502.heapsnapshot # 265.997000s
# node-heap-snapshots/Heap-269-1714760270990.heapsnapshot # 30.488000s
# node-heap-snapshots/Heap-269-1714760322999.heapsnapshot # 52.009000s
# --
# node-heap-snapshots/Heap-269-1714760441906.heapsnapshot
# node-heap-snapshots/Heap-269-1714760516101.heapsnapshot # 74.195000s
# node-heap-snapshots/Heap-269-1714760834503.heapsnapshot # 318.402000s
# node-heap-snapshots/Heap-269-1714760898690.heapsnapshot # 64.187000s
# node-heap-snapshots/Heap-269-1714761108286.heapsnapshot # 209.596000s
# node-heap-snapshots/Heap-269-1714761619303.heapsnapshot # 511.017000s
# node-heap-snapshots/Heap-269-1714761692798.heapsnapshot # 73.495000s
# node-heap-snapshots/Heap-269-1714762009507.heapsnapshot # 316.709000s
# --
# node-heap-snapshots/Heap-269-1714790142201.heapsnapshot
# node-heap-snapshots/Heap-269-1714790361994.heapsnapshot # 219.793000s
# node-heap-snapshots/Heap-269-1714791458519.heapsnapshot # 1096.525000s
# node-heap-snapshots/Heap-269-1714791520707.heapsnapshot # 62.188000s
# node-heap-snapshots/Heap-269-1714791562398.heapsnapshot # 41.691000s
# --
# node-heap-snapshots/Heap-269-1714798806585.heapsnapshot
# node-heap-snapshots/Heap-269-1714799482515.heapsnapshot # 675.930000s
# node-heap-snapshots/Heap-269-1714800641991.heapsnapshot # 1159.476000s
# node-heap-snapshots/Heap-269-1714801244989.heapsnapshot # 602.998000s
# node-heap-snapshots/Heap-269-1714801986606.heapsnapshot # 741.617000s
# --
# node-heap-snapshots/Heap-270-1714818789092.heapsnapshot
# node-heap-snapshots/Heap-270-1714818850091.heapsnapshot # 60.999000s
# node-heap-snapshots/Heap-270-1714820055013.heapsnapshot # 1204.922000s
# node-heap-snapshots/Heap-270-1714820710486.heapsnapshot # 655.473000s
# node-heap-snapshots/Heap-270-1714820803985.heapsnapshot # 93.499000s
# --
# node-heap-snapshots/Heap-268-1714876156797.heapsnapshot
# node-heap-snapshots/Heap-268-1714876301918.heapsnapshot # 145.121000s
# node-heap-snapshots/Heap-268-1714877515018.heapsnapshot # 1213.100000s
# node-heap-snapshots/Heap-268-1714877963998.heapsnapshot # 448.980000s
# node-heap-snapshots/Heap-268-1714878141009.heapsnapshot # 177.011000s
# --
# node-heap-snapshots/Heap-268-1714882851608.heapsnapshot
# node-heap-snapshots/Heap-268-1714882939101.heapsnapshot # 87.493000s
# node-heap-snapshots/Heap-268-1714884595207.heapsnapshot # 1656.106000s
# node-heap-snapshots/Heap-268-1714884625609.heapsnapshot # 30.402000s
# node-heap-snapshots/Heap-232-1714885200000.heapsnapshot # 574.391000s
# --
# node-heap-snapshots/Heap-271-1714928639012.heapsnapshot
# node-heap-snapshots/Heap-271-1714928810499.heapsnapshot # 171.487000s
# node-heap-snapshots/Heap-271-1714930520894.heapsnapshot # 1710.395000s
# node-heap-snapshots/Heap-271-1714930865789.heapsnapshot # 344.895000s
# node-heap-snapshots/Heap-271-1714931133408.heapsnapshot # 267.619000s
# --
# node-heap-snapshots/Heap-271-1714963686504.heapsnapshot
# node-heap-snapshots/Heap-271-1714964819624.heapsnapshot # 1133.120000s
# node-heap-snapshots/Heap-271-1714967410109.heapsnapshot # 2590.485000s
# node-heap-snapshots/Heap-271-1714967931888.heapsnapshot # 521.779000s
# node-heap-snapshots/Heap-271-1714968400997.heapsnapshot # 469.109000s
# Extract from the similarly-named neighborhood of the 1213.100000s high water mark.
gdate -ud @$(echo 1714877515018 | sed 's/...$//') | sed 's/^/# /'
# Sun May  5 02:51:55 UTC 2024
cat snapshots.txt | grep -e '-171487[67]' | sed 's/^/# /'
# node-heap-snapshots/Heap-268-1714876156797.heapsnapshot
# node-heap-snapshots/Heap-268-1714876301918.heapsnapshot
# node-heap-snapshots/Heap-268-1714877515018.heapsnapshot
# node-heap-snapshots/Heap-268-1714877963998.heapsnapshot
tar xvzf node-heap-snapshots.tar.gz '*-171487[67]*' 2>&1 | sed 's/^/# /'
# x node-heap-snapshots/Heap-268-1714877515018.heapsnapshot
# x node-heap-snapshots/Heap-268-1714876156797.heapsnapshot
# x node-heap-snapshots/Heap-268-1714877963998.heapsnapshot
# x node-heap-snapshots/Heap-268-1714876301918.heapsnapshot

Loading these four snapshots into Dev Tools and inspecting their differences fingered xsnap makeSnapshotInternal as relevant to the retention path of snapshots/sockets/etc.

@mergify mergify bot closed this as completed in #9405 May 28, 2024
mergify bot added a commit that referenced this issue May 28, 2024
Fixes #9316

## Description

Add testing code to observe memory leaks related to xsnap snapshots, and
fix the leak that retains them.

### Security Considerations

None.

### Scaling Considerations

None.

### Documentation Considerations

None.

### Testing Considerations

Covered!

### Upgrade Considerations

None (this is kernel code).
mhofman pushed a commit that referenced this issue May 30, 2024
Fixes #9316

Add testing code to observe memory leaks related to xsnap snapshots, and
fix the leak that retains them.

None.

None.

None.

Covered!

None (this is kernel code).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cosmic-swingset package: cosmic-swingset performance Performance related issues SwingSet package: SwingSet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants