intermittent CI failure: test-terminate.js gets SIGKILL #5782

warner · 2022-07-19T18:04:12Z

Both @turadg and @Chris-Hibbert have seen intermittent but fairly persistent CI failures in the last week, in the test-swingset4 (xs) job, when it runs test-vat-admin/terminate/test-terminate.js. The CI job ends with:

  ✖ test/vat-admin/terminate/test-terminate.js exited due to SIGKILL
   ─
   253 tests passed
   3 known failures
   4 tests skipped

This suggests that none of the t.is/etc test assertions within test-terminate.js reported failure (else we'd have seen a NN tests failed report), but the test process itself (the Node.js child process that AVA spawned as a worker, whose first action is to import test-terminate.js) was killed by something with SIGKILL.

I've seen SIGKILL used by the Linux kernel Out-Of-Memory handler (the "OOM Killer") when the host experiences memory pressure and chooses some likely target to kill. My current best hypothesis is that this one test process is getting into some sort of loop which consumes a whole lot of memory, and the host's OOM killer takes it out. The GitHub CI environment doesn't give us a lot of information about stuff like that.

I'll try running this test locally on a linux box, to see if I can pay close attention to the memory it uses, to check if it seems excessive.

I've also seen it show up when the CI job is taking too long and the CI runner decides to give up on the job. This job terminates after only 8 minutes, and the timeout/kills I've seen in the past have taken at least a few hours, so I doubt that's what's happening.

Since the test in question is test-terminate.js, another possibility is that something in the vat-warehouse is confused about which process to kill when the vat under test does something termination-worthy. I landed code last week which refactored the way vat termination is managed, so it's conceivable that there's a new bug, or an old one newly exposed, in which the vat warehouse goes to kill the worker process and manages to kill itself instead.

The text was updated successfully, but these errors were encountered:

warner · 2022-07-19T18:18:45Z

The actual command that test executes:

Run cd packages/SwingSet && yarn test:xs 'test/**/test-[S-Zs-z0-9]*.js'
  cd packages/SwingSet && yarn test:xs 'test/**/test-[S-Zs-z0-9]*.js'
  shell: /usr/bin/bash -e {0}
  env:
    ESM_DISABLE_CACHE: true
yarn run v1.22.19
$ SWINGSET_WORKER_TYPE=xs-worker ava 'test/**/test-[S-Zs-z0-9]*.js'

warner · 2022-07-19T20:36:15Z

Running this locally, I see about 8 node_modules/ava/lib/worker/subprocess.js processes each using about 11-12GB of vmem, and an overall load average of 14, so it's slamming my linux box pretty badly. I've got 32GB of physical memory and it isn't using any swap, and the average resident memory of each of those processes is about 400MB, so it's not exactly memory pressure from the Node.js parents (test programs). But there are a lot of xsnap-worker processes running, many tests spin up a kernel that needs to launch 4 or 5 each, and this is running a lot of tests in parallel.

If I run just yarn test:xs test/vat-admin/terminate/test-terminate.js, I see:

physical memory usage rises by about 10GB (my computer seems to use 15GB at baseline, and it grows to about 25GB)
the one Node.js subprocess.js hits about 12GB virtual, 0.8GB resident
there are 172 xsnap-worker processes at the peak, each about 1.0GB virtual, 100MB resident

My guess is that the massive number of xsnap-worker processes, despite each being fairly small, adds up to enough RAM usage to trigger the CI machine's OOM killer, which looks around for a good candidate to kill, and sees that the AVA subprocess.js is using both the most Vmem and the most resident RAM, and kills it,

The immediate change that could have triggered this would be the addition of new test cases to test-terminate.js, which increases the opportunity for parallelism. AVA must impose some parallelism limit internally, but it doesn't know how expensive each test might be (especially if it spawns child processes like our XS tests do). test-terminate.js currently defines 30 test functions, all of which start a kernel. Each kernel appears to launch 5 workers.

If that hypothesis holds up, the fix will be to insert a bunch of test.serial in that file, to reduce the parallelism.

We should probably look through all of our tests to identify the ones that start full kernels, and limit the number of non-serialized kernel-starting tests in any one file, then multiply that by AVA's test-file parallelization factor, to meet some budget limit on total simultaneous xsnap-worker processes.

warner · 2022-07-19T20:50:33Z

Commit 9ef4941 (PR #5436) changed test-terminate.js from having 1+10 tests (one test.serial() plus ten test(), which I think means we get 1 test followed by 10 tests-in-parallel), to having 1+31.

@FUDCo made a change (e8055a) to split the file into two pieces (test-terminate-critical.js and test-terminate-non-critical.js), but 1: I don't think that would help (it might even hurt, if it means AVA is more willing to run the two files in parallel), and 2: we reverted that change a month later (in 5ca5d0e PR #5710). @mhofman do you remember why you included a revert of that change in your PR? I imagine the intent might have been to improve speed.

What I really want is a way to tell AVA "hey, there's this limited shared resource called 'xsnap processes', and you should never have more than N running at a time, and for each test() here's an option that says how many of these xsnaps it will run, please limit your parallel scheduling to avoid using too many". Or a simpler form where each test is assumed to use 1 "real kernel launch (with an XS worker cohort)" and we want to limit that set to maybe 5 simultaneous.

I'll make a dummy PR that adds some test.serial to test-terminate.js and see how it fares against CI.

turadg · 2022-07-19T21:37:55Z

What I really want is a way to tell AVA "hey, there's this limited shared resource

https://github.com/avajs/ava/blob/main/docs/recipes/shared-workers.md might help. Would require upgrading to Ava 4 though.

warner · 2022-07-19T22:36:10Z

I was able to reproduce this locally by running cd packages/SwingSet && yarn test:xs test/vat-admin/terminate/test-terminate.js in a docker container (hosted on macOS) configured to provide 8GB of RAM (and 1GB of swap) to the linux kernel. It reported 1 test passed followed by error Command failed with exit code 1.

Reconfiguring Docker to provide 16GB RAM allowed the test to pass.

On the 8GB configuration, by following the emulated kernel's log (~/Library/Containers/com.docker.docker/Data/log/vm/kmsg.log), I observed the OOM Killer's activation. At the time it fired, there were 145 xsnap-worker processes, each taking about 270MB of VMSize and 10MB-13.5MB of RSS, which adds up to about 2GB of RSS. There were four Node.js processes running, only one of which had significant usage: 1.4GB of VMSize, 95MB of RSS. The log reported 0 bytes of swap remaining, out of the 1GB available.

I'm not sure how things add up, but I know there's all sorts of overhead, so I'm not entirely surprised that 2GB RSS of workers on a nominally 8GB kernel was enough to trigger the OOM Killer.

warner · 2022-07-19T23:37:27Z

Changing all 32 test()s in test-terminate.js into test.serial() slows this test down on my laptop (68s instead of 48s) but allows it to complete successfully in an 8GB docker container. The associated PR does that, and appears to pass on GitHub CI (at least 3 out of 3).

mhofman · 2022-07-19T23:38:16Z

@mhofman do you remember why you included a revert of that change in your PR? I imagine the intent might have been to improve speed

I reverted the split because the root cause of the previous OOM was that we left XS processes laying around. I and I believe @FUDCo considered that split as a hack.

It seems that we have now another OOM but this time due to the testing environment not accounting for our resource cost of individual tests. A one off split would work too, but as this issue suggests, a better fix would be to tell ava to do less parallelization, possibly based on the amount of memory available.

warner · 2022-07-19T23:43:57Z

Cool, thanks, yeah I think t.teardown(controller.shutdown) was the correct/non-hackish fix for the accumulating-children problem, and test.serial is the best fix for the new "cumulative worker memory vs OOM Killer" problem. I just read through the AVA-v4 shared worker docs that @turadg referenced (pretty cool!), and that will be the new best fix for the cumulative worker memory problem, where we can implement a test-suite-wide shared mutex and limit the number of simultaneous kernels to something reasonable.

warner added bug Something isn't working SwingSet package: SwingSet labels Jul 19, 2022

warner self-assigned this Jul 19, 2022

Tartuffo added the MUST-HAVE label Jul 19, 2022

mergify bot closed this as completed in 890245e Jul 20, 2022

Tartuffo added this to the Mainnet 1 milestone Jul 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

intermittent CI failure: test-terminate.js gets SIGKILL #5782

intermittent CI failure: test-terminate.js gets SIGKILL #5782

warner commented Jul 19, 2022

warner commented Jul 19, 2022

warner commented Jul 19, 2022

warner commented Jul 19, 2022

turadg commented Jul 19, 2022

warner commented Jul 19, 2022

warner commented Jul 19, 2022

mhofman commented Jul 19, 2022

warner commented Jul 19, 2022

intermittent CI failure: test-terminate.js gets SIGKILL #5782

intermittent CI failure: test-terminate.js gets SIGKILL #5782

Comments

warner commented Jul 19, 2022

warner commented Jul 19, 2022

warner commented Jul 19, 2022

warner commented Jul 19, 2022

turadg commented Jul 19, 2022

warner commented Jul 19, 2022

warner commented Jul 19, 2022

mhofman commented Jul 19, 2022

warner commented Jul 19, 2022