-
Notifications
You must be signed in to change notification settings - Fork 191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory leak in cosmic-swingset causes validator missed block #9316
Labels
bug
Something isn't working
cosmic-swingset
package: cosmic-swingset
performance
Performance related issues
SwingSet
package: SwingSet
Comments
mhofman
added
bug
Something isn't working
SwingSet
package: SwingSet
cosmic-swingset
package: cosmic-swingset
performance
Performance related issues
labels
May 3, 2024
gibson042
added a commit
that referenced
this issue
May 23, 2024
This was an interesting scope leak in the kernel. Minimal repro in a gist. There is an uprgrade-15 branch with a hotfix for this: mhofman/patch-u15-xsnap-leak |
gibson042
added a commit
that referenced
this issue
May 25, 2024
gibson042
added a commit
that referenced
this issue
May 28, 2024
gibson042
pushed a commit
that referenced
this issue
May 28, 2024
Node snapshot analysis travelogue: # Inspect tarball contents.
tar tvzf node-heap-snapshots.tar.gz | awk '{ print $NF }' | sed 's/^/# /; 4q'
# node-heap-snapshots/
# node-heap-snapshots/Heap-269-1714766591308.heapsnapshot
# node-heap-snapshots/Heap-268-1715106920615.heapsnapshot
# node-heap-snapshots/Heap-270-1714822445711.heapsnapshot # The base filename structure is `Heap-${process.pid}-${Date.now()}.heapsnapshot`.
# https://github.com/Agoric/agoric-sdk/blob/8343a0ca16603ce52aeb4818d4dbf2529b9a59f7/packages/cosmic-swingset/src/chain-main.js#L442
# Sort snapshots by timestamp.
tar tvzf node-heap-snapshots.tar.gz | awk '/Heap-/ { print $NF }' | sort -t- -k5n > snapshots.txt # Check count.
wc -l snapshots.txt | sed 's/^/# /'
# 2061 snapshots.txt # Analyze data.
cat snapshots.txt | node -e '(async () => {
const reader = require("node:readline").createInterface({ input: process.stdin });
const data = {
count: 0,
firstTimestamp: undefined,
lastTimestamp: undefined,
minGap: Infinity,
maxGap: -Infinity,
sum: 0,
gaps: [],
gapHighWatermarks: [],
};
for await (const line of reader) {
const { epochms } = line.match(/Heap-(?<pid>[0-9]+)-(?<epochms>[0-9]+)/)?.groups || {};
if (!epochms) throw Error(`expected Heap-$pid-$timestamp in ${line}`);
const timestamp = +epochms;
if (data.count === 0) {
data.count++;
data.firstTimestamp = data.lastTimestamp = timestamp;
continue;
}
const { lastTimestamp } = data;
const gap = (timestamp - lastTimestamp) / 1000;
if (gap <= 0) throw Error(`expected timestamp advancement from ${lastTimestamp} in ${line}`);
data.count++;
data.lastTimestamp = timestamp;
if (gap < data.minGap) data.minGap = gap;
if (gap > data.maxGap) { data.gapHighWatermarks.push(line); data.maxGap = gap; }
data.sum += gap;
data.gaps.push(gap);
}
data.firstTimestamp = new Date(data.firstTimestamp).toISOString();
data.lastTimestamp = new Date(data.lastTimestamp).toISOString();
data.median = data.gaps[Math.floor(data.count / 2)];
data.mean = data.sum / data.count;
delete data.gaps;
delete data.sum;
console.log(JSON.stringify(data, null, 2));
})()' > snapshotData.json
cat snapshotData.json | sed 's/^/# /'
# {
# "count": 2061,
# "firstTimestamp": "2024-05-03T18:05:39.352Z",
# "lastTimestamp": "2024-05-10T04:08:17.204Z",
# "minGap": 29.997,
# "maxGap": 2590.485,
# "gapHighWatermarks": [
# "node-heap-snapshots/Heap-269-1714759575998.heapsnapshot",
# "node-heap-snapshots/Heap-269-1714759726799.heapsnapshot",
# "node-heap-snapshots/Heap-269-1714760240502.heapsnapshot",
# "node-heap-snapshots/Heap-269-1714760834503.heapsnapshot",
# "node-heap-snapshots/Heap-269-1714761619303.heapsnapshot",
# "node-heap-snapshots/Heap-269-1714791458519.heapsnapshot",
# "node-heap-snapshots/Heap-269-1714800641991.heapsnapshot",
# "node-heap-snapshots/Heap-270-1714820055013.heapsnapshot",
# "node-heap-snapshots/Heap-268-1714877515018.heapsnapshot",
# "node-heap-snapshots/Heap-268-1714884595207.heapsnapshot",
# "node-heap-snapshots/Heap-271-1714930520894.heapsnapshot",
# "node-heap-snapshots/Heap-271-1714967410109.heapsnapshot"
# ],
# "median": 591.525,
# "mean": 269.07222319262434
# }
cat snapshotData.json | jq -r '.gapHighWatermarks[]' | grep -Fxf- -C2 snapshots.txt | awk '{
len = split($0, parts, /[-.]/);
t = parts[len - 1];
if (t && lastT) $0 = sprintf("%s # %fs", $0, (t - lastT) / 1000);
lastT = t;
print;
}' | sed 's/^/# /'
# node-heap-snapshots/Heap-269-1714759539352.heapsnapshot
# node-heap-snapshots/Heap-269-1714759575998.heapsnapshot # 36.646000s
# node-heap-snapshots/Heap-269-1714759611201.heapsnapshot # 35.203000s
# node-heap-snapshots/Heap-269-1714759726799.heapsnapshot # 115.598000s
# node-heap-snapshots/Heap-269-1714759784401.heapsnapshot # 57.602000s
# node-heap-snapshots/Heap-269-1714759844212.heapsnapshot # 59.811000s
# node-heap-snapshots/Heap-269-1714759917117.heapsnapshot # 72.905000s
# node-heap-snapshots/Heap-269-1714759974505.heapsnapshot # 57.388000s
# node-heap-snapshots/Heap-269-1714760240502.heapsnapshot # 265.997000s
# node-heap-snapshots/Heap-269-1714760270990.heapsnapshot # 30.488000s
# node-heap-snapshots/Heap-269-1714760322999.heapsnapshot # 52.009000s
# --
# node-heap-snapshots/Heap-269-1714760441906.heapsnapshot
# node-heap-snapshots/Heap-269-1714760516101.heapsnapshot # 74.195000s
# node-heap-snapshots/Heap-269-1714760834503.heapsnapshot # 318.402000s
# node-heap-snapshots/Heap-269-1714760898690.heapsnapshot # 64.187000s
# node-heap-snapshots/Heap-269-1714761108286.heapsnapshot # 209.596000s
# node-heap-snapshots/Heap-269-1714761619303.heapsnapshot # 511.017000s
# node-heap-snapshots/Heap-269-1714761692798.heapsnapshot # 73.495000s
# node-heap-snapshots/Heap-269-1714762009507.heapsnapshot # 316.709000s
# --
# node-heap-snapshots/Heap-269-1714790142201.heapsnapshot
# node-heap-snapshots/Heap-269-1714790361994.heapsnapshot # 219.793000s
# node-heap-snapshots/Heap-269-1714791458519.heapsnapshot # 1096.525000s
# node-heap-snapshots/Heap-269-1714791520707.heapsnapshot # 62.188000s
# node-heap-snapshots/Heap-269-1714791562398.heapsnapshot # 41.691000s
# --
# node-heap-snapshots/Heap-269-1714798806585.heapsnapshot
# node-heap-snapshots/Heap-269-1714799482515.heapsnapshot # 675.930000s
# node-heap-snapshots/Heap-269-1714800641991.heapsnapshot # 1159.476000s
# node-heap-snapshots/Heap-269-1714801244989.heapsnapshot # 602.998000s
# node-heap-snapshots/Heap-269-1714801986606.heapsnapshot # 741.617000s
# --
# node-heap-snapshots/Heap-270-1714818789092.heapsnapshot
# node-heap-snapshots/Heap-270-1714818850091.heapsnapshot # 60.999000s
# node-heap-snapshots/Heap-270-1714820055013.heapsnapshot # 1204.922000s
# node-heap-snapshots/Heap-270-1714820710486.heapsnapshot # 655.473000s
# node-heap-snapshots/Heap-270-1714820803985.heapsnapshot # 93.499000s
# --
# node-heap-snapshots/Heap-268-1714876156797.heapsnapshot
# node-heap-snapshots/Heap-268-1714876301918.heapsnapshot # 145.121000s
# node-heap-snapshots/Heap-268-1714877515018.heapsnapshot # 1213.100000s
# node-heap-snapshots/Heap-268-1714877963998.heapsnapshot # 448.980000s
# node-heap-snapshots/Heap-268-1714878141009.heapsnapshot # 177.011000s
# --
# node-heap-snapshots/Heap-268-1714882851608.heapsnapshot
# node-heap-snapshots/Heap-268-1714882939101.heapsnapshot # 87.493000s
# node-heap-snapshots/Heap-268-1714884595207.heapsnapshot # 1656.106000s
# node-heap-snapshots/Heap-268-1714884625609.heapsnapshot # 30.402000s
# node-heap-snapshots/Heap-232-1714885200000.heapsnapshot # 574.391000s
# --
# node-heap-snapshots/Heap-271-1714928639012.heapsnapshot
# node-heap-snapshots/Heap-271-1714928810499.heapsnapshot # 171.487000s
# node-heap-snapshots/Heap-271-1714930520894.heapsnapshot # 1710.395000s
# node-heap-snapshots/Heap-271-1714930865789.heapsnapshot # 344.895000s
# node-heap-snapshots/Heap-271-1714931133408.heapsnapshot # 267.619000s
# --
# node-heap-snapshots/Heap-271-1714963686504.heapsnapshot
# node-heap-snapshots/Heap-271-1714964819624.heapsnapshot # 1133.120000s
# node-heap-snapshots/Heap-271-1714967410109.heapsnapshot # 2590.485000s
# node-heap-snapshots/Heap-271-1714967931888.heapsnapshot # 521.779000s
# node-heap-snapshots/Heap-271-1714968400997.heapsnapshot # 469.109000s # Extract from the similarly-named neighborhood of the 1213.100000s high water mark.
gdate -ud @$(echo 1714877515018 | sed 's/...$//') | sed 's/^/# /'
# Sun May 5 02:51:55 UTC 2024
cat snapshots.txt | grep -e '-171487[67]' | sed 's/^/# /'
# node-heap-snapshots/Heap-268-1714876156797.heapsnapshot
# node-heap-snapshots/Heap-268-1714876301918.heapsnapshot
# node-heap-snapshots/Heap-268-1714877515018.heapsnapshot
# node-heap-snapshots/Heap-268-1714877963998.heapsnapshot
tar xvzf node-heap-snapshots.tar.gz '*-171487[67]*' 2>&1 | sed 's/^/# /'
# x node-heap-snapshots/Heap-268-1714877515018.heapsnapshot
# x node-heap-snapshots/Heap-268-1714876156797.heapsnapshot
# x node-heap-snapshots/Heap-268-1714877963998.heapsnapshot
# x node-heap-snapshots/Heap-268-1714876301918.heapsnapshot Loading these four snapshots into Dev Tools and inspecting their differences fingered xsnap |
mergify bot
added a commit
that referenced
this issue
May 28, 2024
Fixes #9316 ## Description Add testing code to observe memory leaks related to xsnap snapshots, and fix the leak that retains them. ### Security Considerations None. ### Scaling Considerations None. ### Documentation Considerations None. ### Testing Considerations Covered! ### Upgrade Considerations None (this is kernel code).
mhofman
pushed a commit
that referenced
this issue
May 30, 2024
Fixes #9316 Add testing code to observe memory leaks related to xsnap snapshots, and fix the leak that retains them. None. None. None. Covered! None (this is kernel code).
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
bug
Something isn't working
cosmic-swingset
package: cosmic-swingset
performance
Performance related issues
SwingSet
package: SwingSet
Describe the bug
There seem to be a leak in the Node.js side of the
agd
process (cosmic-swingset or swingset), which results in memory growth and slower and slower force GC after each block. Ultimately this causes validators to miss blocks if the GC takes more than the block voting period (5s).To Reproduce
Run a node on mainnet, observe slow increase in Node heap usage and forced gc time.
Expected behavior
Flat memory usage.
This is likely a regression as we used to observe this metric (added in #5637) in previous upgrades after we had a large leak caused by Promise.race used in xsnap (#5643)
Platform Environment
upgrade-14 on mainnet, but upgrade-15 is likely affected as well
Workaround
Restarting
agd
will free the leaked memory, resetting consumption back to expected levels. This is considered better than allowing OOM mitigations to kill processes, but note that it may take a few blocks to catch back up.Alternatively, we now have a branch with a patch for
agoric-upgrade-15
: mhofman/patch-u15-xsnap-leakAdditional context
Screenshots
The text was updated successfully, but these errors were encountered: