New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
intermittent CI failure: test-terminate.js gets SIGKILL #5782
Comments
The actual command that test executes:
|
Running this locally, I see about 8 If I run just
My guess is that the massive number of The immediate change that could have triggered this would be the addition of new test cases to If that hypothesis holds up, the fix will be to insert a bunch of We should probably look through all of our tests to identify the ones that start full kernels, and limit the number of non-serialized kernel-starting tests in any one file, then multiply that by AVA's test-file parallelization factor, to meet some budget limit on total simultaneous |
Commit 9ef4941 (PR #5436) changed @FUDCo made a change (e8055a) to split the file into two pieces ( What I really want is a way to tell AVA "hey, there's this limited shared resource called 'xsnap processes', and you should never have more than N running at a time, and for each I'll make a dummy PR that adds some |
https://github.com/avajs/ava/blob/main/docs/recipes/shared-workers.md might help. Would require upgrading to Ava 4 though. |
I was able to reproduce this locally by running Reconfiguring Docker to provide 16GB RAM allowed the test to pass. On the 8GB configuration, by following the emulated kernel's log ( I'm not sure how things add up, but I know there's all sorts of overhead, so I'm not entirely surprised that 2GB RSS of workers on a nominally 8GB kernel was enough to trigger the OOM Killer. |
Changing all 32 |
I reverted the split because the root cause of the previous OOM was that we left XS processes laying around. I and I believe @FUDCo considered that split as a hack. It seems that we have now another OOM but this time due to the testing environment not accounting for our resource cost of individual tests. A one off split would work too, but as this issue suggests, a better fix would be to tell ava to do less parallelization, possibly based on the amount of memory available. |
Cool, thanks, yeah I think |
Both @turadg and @Chris-Hibbert have seen intermittent but fairly persistent CI failures in the last week, in the
test-swingset4 (xs)
job, when it runstest-vat-admin/terminate/test-terminate.js
. The CI job ends with:This suggests that none of the
t.is
/etc test assertions withintest-terminate.js
reported failure (else we'd have seen aNN tests failed
report), but the test process itself (the Node.js child process that AVA spawned as a worker, whose first action is to importtest-terminate.js
) was killed by something with SIGKILL.I've seen SIGKILL used by the Linux kernel Out-Of-Memory handler (the "OOM Killer") when the host experiences memory pressure and chooses some likely target to kill. My current best hypothesis is that this one test process is getting into some sort of loop which consumes a whole lot of memory, and the host's OOM killer takes it out. The GitHub CI environment doesn't give us a lot of information about stuff like that.
I'll try running this test locally on a linux box, to see if I can pay close attention to the memory it uses, to check if it seems excessive.
I've also seen it show up when the CI job is taking too long and the CI runner decides to give up on the job. This job terminates after only 8 minutes, and the timeout/kills I've seen in the past have taken at least a few hours, so I doubt that's what's happening.
Since the test in question is
test-terminate.js
, another possibility is that something in the vat-warehouse is confused about which process to kill when the vat under test does something termination-worthy. I landed code last week which refactored the way vat termination is managed, so it's conceivable that there's a new bug, or an old one newly exposed, in which the vat warehouse goes to kill the worker process and manages to kill itself instead.The text was updated successfully, but these errors were encountered: