dirty-vat -driven BOYD scheduler #8980

warner · 2024-02-23T23:45:45Z

What is the Problem Being Solved?

Vats accumulate "garbage" over time: JS objects (Presences, Remotables, Representaives) which have been collected but not processed, as well as imported/exported refcounts that have reached zero but not yet processed. We defer this processing until a special dispatch.bringOutYourDead delivery, which performs an engine-level forced GC pass, examines all the finalizers that have run since the last BOYD, and checks current refcounts on every vref/baseref that might now be unreachable or unrecognizable.

We defer BOYD until specific calls, to reduce the sensitivity of normal execution (especially metering and the syscall transcript) to the details of the garbage collector. We definitely need to prevent userspace from sensing GC activity, but if we can also maintain the transcript's insensitivty, then we have more flexibility to change the engine in the future (upgrading XS or xsnap without requiring a complete vat upgrade).

We also do a BOYD just before we write out an XS heap snapshot, because the heap-writing process does a forced GC anyways, and we want to minimize the size of the snapshot (i.e. "empty your trash cans before you put them in the moving van").

However, this raised the question of how frequently we should make these BOYD calls. Too frequently, and we're spending a lot of time forcing unnecessary GC mark-and-sweep operations. Too infrequently, and a single BOYD might need to examine a large number of vrefs, which requires refcount syscalls (vatstoreGet), and which might create syscall.dropImports/retireImports calls with a large number of vrefs (which creates spikes in kernel activity that may cause problems).

The current kernel scheduler triggers a BOYD after every 2000 deliveries ("reapInterval", recorded separately for each vat in vatKeeper.updateReapInterval(), read from a kernel-wide configurable default that is set with changeKernelOptions({defaultReapInterval: NN})). However it also triggers a heap snapshot (and thus a BOYD) on a different schedule: 3 deliveries after the start of the incarnation, and then every 200 deliveries (snapshotInitial and snapshotInterval, configurable on the kernel as a whole with changeKernelOptions({snapshotInterval: NN})). The host application can also invoke controller.reapAllVats() to trigger a BOYD on all vats, although this does not update the counters (#8665), so the usual BOYDs will still happen at their usual time.

The worst case, and the motivation for this ticket, is when a large vat is terminated, or deletes a large collection, and drops a lot of references into some other vat. The #8928 strategy limits the rate at which we send dispatch.dropExports/retireExports into the upstream vat, but if we don't also do BOYD frequently enough, we'll still be stuck with problematically-large BOYD deliveries. The largest expansion factor I've seen so far is when a terminated contract's ExitObject or SeatHandle is dropped, because these are engaged in reference cycles through v9-zoe. Each dropped object triggers about 50 syscalls in v9-zoe (to unwind the zoe-side portion of the cycle).

I want to limit each BOYD to maybe 500 syscalls, to minimize the CPU time and kvStore churn (which causes IAVL churn, which adds to the cosmos-sdk state pruning work that happens sooner or later). Based on my studies so far, 500 syscalls of work can be done in about half a second, which seems safe to do in nearly every block (as a "background" task, only done when no other work was done in that block).

So I want the scheduler to trigger a BOYD after 10 krefs have been dropped and retired.

Description of the Design

We'd replace the current vNN.reapCountdown vatKeeper key with one named vNN.dirty. The value is a JSON-serialized record with { deliveries, gcKrefs, computrons }. Each non-BOYD delivery (and its deliveryResults) gets sent to a new function, maybe named measureDirt, which:

deserializes the old record
increments deliveries
uses the result computrons (if any) to increment computrons
examines the delivery to see if it's a dispatch.dropImports or other GC delivery, counts the krefs inside, and maybe increments gcKrefs
serializes and writes the record back to kvStore
finally, compare the values against the vat's trigger thresholds (vNN.reapInterval, and some new kernel-wide gcKrefsLimit)
- if exceeded, add the vatID to kernelKeeper's reapQueue with scheduleReap(vatID)

Then, the next time kernel.step() happens, it might see a vatID in the reapQueue, and process it:

remove the vatID from reapQueue
perform the BOYD delivery
zero out the vNN.dirty counters

In addition, the BOYD we do as part of writing heap snapshots should zero out the counters and remove it from reapQueue. That would fix the #8665 problem, as well as deferring BOYDs to the next correct time (avoiding rapid BOYDs when the snapshotInterval counter phases in and out of sync with the reapInterval counter).

We'll need to make the thresholds configurable. We already have changeKernelOptions({defaultReapInterval}), but it only sets the default (which gets sampled when a vat is created), and doesn't provide a way to modify a single vat's interval later. And we certainly need a new control on the GC kref threshold, so maybe we add changeKernelOptions({gcKrefReapThreshold}) (but I don't love the spelling of that).

This vNN.dirty record could also be used to schedule heap snapshots some day (#6786). Some thoughts:

burning computrons should definitely accelerate the heap-snapshot schedule, since they represent work that must be done on the next replay, creating surprising delays during reboot or paging-in an offline vat worker
but we don't record computrons for first-turn liveslots work, which includes all of the GC and BOYD deliveries
- a large BOYD could do thousands of syscalls, but not appear to consume significant computrons
- so we might want to add a syscall counter to dirty, and also trigger snapshots after a lot of them
- and eventually we should start metering the GC/BOYD work, not to charge against the vat's Meter, but certainly to share with the runPolicy, and probably also to trigger snapshots
I'm not currently convinced that computron-counting is useful for scheduling BOYDs
- I have a theory which could be proved/disproved by scanning the slogfiles and computing some statistics:
  - in general, most dispatch.deliver increase the number of vrefs being tracked (especially if they include vrefs as arguments)
  - in general, most dispatch.notify (promise resolutions) decrease the number of vrefs, increasing garbage
  - GC deliveries almost always decrease the number of vrefs
  - so it might be an interesting heuristic might be to count .slots.length and delivery-type, and use them to compute a score, which might ignore dispatch.deliver but would trigger GC sooner if a lot of dispatch.notify or dispatch.dropImports/etc deliveries were made
but vNN.dirty is a decent place to track them, or to add new things to be tracked

Security Considerations

None, userspace code cannot observe GC or BOYD. The most significant concern is what kind of userspace behavior could provoke excessive BOYDs (either too frequently, or individual ones that are too large). E.g. creating a large collection and then dereferencing it, so it gets collected in a future BOYD, which then gets too big (would be addressed/prevented by #8417). Changing the scheduler might change the tactics an attacker might take to provoke this, but I think a dirt-driven scheduler should generally work better than our current reapInterval approach.

Scaling Considerations

The main purpose of this scheduler is to improve the scaling behavior, by doing BOYD more frequently when there is more garbage to be collected, and avoiding infrequent-but-large BOYDs (which trigger bad kernel behavior, #8402).

Test Plan

Swingset unit tests which use fake vats (returning fake metering data), to measure exactly when BOYD happens.

Upgrade Considerations

Transitioning from vNN.reapCountdown to vNN.dirty is a schema upgrade that can be done lazily. We'll delete countdownToReap() and replace it with addDirt() (etc), and the new function needs to accommodate a missing vNN.dirty key by initializing it to a new zero record, looking for a legacy vNN.reapCountdown (and the still-useful vNN.reapInterval), and subtracting them to deduce the new .deliveries. We didn't have a gcKrefs or computron-counter before, so we don't need to initialize them with anything but zero.

This is a one-way upgrade, of course, since we delete vNN.reapCountdown.

The text was updated successfully, but these errors were encountered:

warner · 2024-03-25T06:23:47Z

I keep running into new wrinkles as I try to minimize the changes (and make sure all the hard-to-change callers can keep using the old API), so I'll write up my plan first:

The old code:

stores { reapInterval } in the kvStore vNN.options
- although this is not used by any of the code that reads from vNN.options (search for getOptions and getSourceAndOptions, used in kernel.js and vat-warehouse.js)
- vat-warehouse.js ensureVatOnline() stashes all options in ephemeral.vats but so far nothing reads .reapInterval from it
- options are passed to vat-loader.js create() (which is routed to each manager-type's factory), but that does not extract reapInterval
also stores vNN.reapInterval and vNN.reapCountdown
- processChangeVatOptions() updates vNN.reapInterval but not the property in vNN.options, oops
- these are read from the kvStore each time they're used (no RAM cache)
the kernel config takes .defaultReapInterval, stored in kvStore kernel.defaultReapInterval
- (also .snapshotInitial and .snapshotInterval, stored similarly)
- this is in types SwingSetKernelConfig and SwingSetConfig
the kernel config takes .vats[NAME].creationOptions.reapInterval for each vat
- this appears in type BaseVatOptions, which is used by DynamicVatOptions and StaticVatOptions
- initializeSwingset sets config.vats.comms.creationOptions.reapInterval = 'never'
recordVatOptions.js extracts reapInterval (or kernelKeeper.getDefaultReapInterval()) from static/dynamic options and stores them into the vatKeeper with setSourceAndOptions() and initializeReapCountdown()
- this provides recordStatic and recordDynamic
- kernel.js createTestVat() (used by tests) takes creationOptions.reapInterval and calls recordStatic
- initializeKernel() calls recordStatic
dynamic vat creation: E(vatAdminSvc).createVat(bcap, { reapInterval })
- NOTE: we should retain support for this (although no mainnet vats use it yet)
- vat-vat-admin.js uses convertOptions() to check/accept options.reapInterval
  - changing this API would require upgrading vat-vat-admin, which has dependency blockers
- device-vat-admin.js createByBundleID takes options.reapInterval
  - changing this API would require upgrading device-vat-admin, for which we lack an API
- vat-admin-hooks.js createByID copies dynamicOptions into the create-vat event
- kernel.js processCreateVat() calls recordDynamic
dynamic vat upgrade: does not accept options.reapInterval, no work needed
dynamic vat admin node accepts E(adminNode).changeOptions({ reapInterval }) (only)
- NOTE: we should retain support for this (although no mainnet vats use it yet)
- vat-vat-admin.js, device-vat-admin.js, vat-admin-hooks.js: similar path as vat creation
- copies options into changeVatOptions event
- kernel.js processChangeVatOptions()/setKernelVatOption() passes to vatKeeper.updateReapInterval()

We don't want / can't change vat-vat-admin or device-vat-admin, so we'll continue to accept E(vatAdminSvc).createVat(bcap, { reapInterval }) and E(adminNode).changeOptions({ reapInterval }). To minimize test churn, I intend to retain .reapInterval in the kernel config and createTestVat APIs (which means the StaticVatOptions/DynamicVatOptions types retain the property), but add an additional .reapDirtThreshold to both, overridden by a .reapInterval if present.

The upgrade process needs to:

modify all vNN.options to replace .reapInterval with .reapDirtThreshold
- using the value from vNN.reapInterval, not vNN.options, because of the "oops" above
delete vNN.reapInterval and vNN.reapCountdown
add vNN.reapDirt, with .deliveries computed from reapInterval/reapCountdown if possible
- vatKeeper.js will have an upgradeVatState() function (not a vatKeeper method)
- this is driven by a kernelKeeper.js maybeUpgradeState() method, gated by an in-RAM flag, once per boot
- each call to kernel.step/run will call maybeUpgradeState() before anything else

The new code will do:

vatKeeper:
- store { reapDirtThreshold } in vNN.options, but cache it in RAM for faster comparisons
- store vNN.reapDirt in the kvStore, but cache it in RAM for faster comparisons
- have updateReapDirtThreshold() instead of updateReapInterval(), update cached value and save to kvStore
- have addDirt instead of countdownToReap: update cache, save to kvStore, maybe schedule BOYD
kernelKeeper:
- add scheduleBOYD, called from vatKeeper.addDirt
- have get(/set)DefaultReapDirtThreshold instead of get(/set)ReapInterval
- export DEFAULT_DELIVERIES_PER_BOYD and DEFAULT_GC_KREFS_PER_BOYD for use by initializeKernel.js
- createStartingKernelState() takes options with defaultReapDirtThreshold, writes new kvStore keys
BaseVatOptions will have both .reapInterval and .reapDirtThreshold, with the former overriding the latter
recordVatOptions.js will merge reapInterval and reapDirtThreshold, and apply the default
vat-admin-hooks will support both, but vat-vat-admin and device-vat-admin will only accept reapInterval
initializeKernel/creationOptions and kernel.createTestVat will accept both

Previously, vat state initialization (e.g. setting counters to zero) happened lazily, the first time that `provideVatKeeper()` was called. When creating a new vat, the pattern was: vk = kernelKeeper.provideVatKeeper(); vk.setSourceAndOptions(source, options); Now, we initialize both counters and source/options explicitly, in `recordVatOptions`, when the static/dynamic vat is first defined: kernelKeeper.createVatState(vatID, source, options); In the future, this will make it easier for `provideVatKeeper` to rely upon the presence of all vat state keys, especially `options`. Previously, vatKeeper.getOptions() would tolerate a missing key by returning empty options. The one place where this was needed (terminating a barely-created vat because startVat() failed, using getOptions().critical) was changed, so this tolerance is no longer needed, and was removed. The tolerance caused bug #9157 (kernel doesn't panic when it should), which continues to be present, but at least won't cause a crash. refs #8980

`dispatch.bringOutYourDead()`, aka "reap", triggers garbage collection inside a vat, and gives it a chance to drop imported c-list vrefs that are no longer referenced by anything inside the vat. Previously, each vat has a configurable parameter named `reapInterval`, which defaults to a kernel-wide `defaultReapInterval` (but can be set separately for each vat). This defaults to 1, mainly for unit testing, but real applications set it to something like 200. This caused BOYD to happen once every 200 deliveries, plus an extra BOYD just before we save an XS heap-state snapshot. This commit switches to a "dirt"-based BOYD scheduler, wherein we consider the vat to get more and more dirty as it does work, and eventually it reaches a `reapDirtThreshold` that triggers the BOYD (which resets the dirt counter). We continue to track `dirt.deliveries` as before, with the same defaults. But we add a new `dirt.gcKrefs` counter, which is incremented by the krefs we submit to the vat in GC deliveries. For example, calling `dispatch.dropImports([kref1, kref2])` would increase `dirt.gcKrefs` by two. The `reapDirtThreshold.gcKrefs` limit defaults to 20. For normal use patterns, this will trigger a BOYD after ten krefs have been dropped and retired. We choose this value to allow the #8928 slow vat termination process to trigger BOYD frequently enough to keep the BOYD cranks small: since these will be happening constantly (in the "background"), we don't want them to take more than 500ms or so. Given the current size of the large vats that #8928 seeks to terminate, 10 krefs seems like a reasonable limit. And of course we don't want to perform too many BOYDs, so `gcKrefs: 20` is about the smallest threshold we'd want to use. External APIs continue to accept `reapInterval`, and now also accept `reapGCKrefs`. * kernel config record * takes `config.defaultReapInterval` and `defaultReapGCKrefs` * takes `vat.NAME.creationOptions.reapInterval` and `.reapGCKrefs` * `controller.changeKernelOptions()` still takes `defaultReapInterval` but now also accepts `defaultReapGCKrefs` The APIs available to userspace code (through `vatAdminSvc`) are unchanged (partially due to upgrade/backwards-compatibility limitations), and continue to only support setting `reapInterval`. Internally, this just modifies `reapDirtThreshold.deliveries`. * `E(vatAdminSvc).createVat(bcap, { reapInterval })` * `E(adminNode).upgrade(bcap, { reapInterval })` * `E(adminNode).changeOptions({ reapInterval })` Internally, the kernel-wide state records `defaultReapDirtThreshold` instead of `defaultReapInterval`, and each vat records `.reapDirtThreshold` in their `vNN.options` key instead of `vNN.reapInterval`. The current dirt level is recorded in `vNN.reapDirt`. The kernel will automatically upgrade both the kernel-wide and the per-vat state upon the first reboot with the new kernel code. The old `reapCountdown` value is used to initialize the vat's `reapDirt.deliveries` counter, so the upgrade shouldn't disrupt the existing schedule. Vats which used `reapInterval = 'never'` (eg comms) will get a `reapDirtThreshold` of all 'never' values, so they continue to inhibit BOYD. Otherwise, all vats get a `threshold.gcKrefs` of 20. We do not track dirt when the corresponding threshold is 'never', to avoid incrementing the comms dirt counters forever. This design leaves room for adding `.computrons` to the dirt record, as well as tracking a separate `snapshotDirt` counter (to trigger XS heap snapshots, ala #6786). We add `reapDirtThreshold.computrons`, but do not yet expose an API to set it. Future work includes: * upgrade vat-vat-admin to let userspace set `reapDirtThreshold` New tests were added to exercise the upgrade process, and other tests were updated to match the new internal initialization pattern. We now reset the dirt counter upon any BOYD, so this also happens to help with #8665 (doing a `reapAllVats()` resets the delivery counters, so future BOYDs will be delayed, which is what we want). But we should still change `controller.reapAllVats()` to avoid BOYDs on vats which haven't received any deliveries. closes #8980

Previously, vat state initialization (e.g. setting counters to zero) happened lazily, the first time that `provideVatKeeper()` was called. When creating a new vat, the pattern was: vk = kernelKeeper.provideVatKeeper(); vk.setSourceAndOptions(source, options); Now, we initialize both counters and source/options explicitly, in `recordVatOptions`, when the static/dynamic vat is first defined: kernelKeeper.createVatState(vatID, source, options); In the future, this will make it easier for `provideVatKeeper` to rely upon the presence of all vat state keys, especially `options`. Previously, vatKeeper.getOptions() would tolerate a missing key by returning empty options. The one place where this was needed (terminating a barely-created vat because startVat() failed, using getOptions().critical) was changed, so this tolerance is no longer needed, and was removed. The tolerance caused bug #9157 (kernel doesn't panic when it should), which continues to be present, but at least won't cause a crash. refs #8980

`dispatch.bringOutYourDead()`, aka "reap", triggers garbage collection inside a vat, and gives it a chance to drop imported c-list vrefs that are no longer referenced by anything inside the vat. Previously, each vat has a configurable parameter named `reapInterval`, which defaults to a kernel-wide `defaultReapInterval` (but can be set separately for each vat). This defaults to 1, mainly for unit testing, but real applications set it to something like 200. This caused BOYD to happen once every 200 deliveries, plus an extra BOYD just before we save an XS heap-state snapshot. This commit switches to a "dirt"-based BOYD scheduler, wherein we consider the vat to get more and more dirty as it does work, and eventually it reaches a `reapDirtThreshold` that triggers the BOYD (which resets the dirt counter). We continue to track `dirt.deliveries` as before, with the same defaults. But we add a new `dirt.gcKrefs` counter, which is incremented by the krefs we submit to the vat in GC deliveries. For example, calling `dispatch.dropImports([kref1, kref2])` would increase `dirt.gcKrefs` by two. The `reapDirtThreshold.gcKrefs` limit defaults to 20. For normal use patterns, this will trigger a BOYD after ten krefs have been dropped and retired. We choose this value to allow the #8928 slow vat termination process to trigger BOYD frequently enough to keep the BOYD cranks small: since these will be happening constantly (in the "background"), we don't want them to take more than 500ms or so. Given the current size of the large vats that #8928 seeks to terminate, 10 krefs seems like a reasonable limit. And of course we don't want to perform too many BOYDs, so `gcKrefs: 20` is about the smallest threshold we'd want to use. External APIs continue to accept `reapInterval`, and now also accept `reapGCKrefs`. * kernel config record * takes `config.defaultReapInterval` and `defaultReapGCKrefs` * takes `vat.NAME.creationOptions.reapInterval` and `.reapGCKrefs` * `controller.changeKernelOptions()` still takes `defaultReapInterval` but now also accepts `defaultReapGCKrefs` The APIs available to userspace code (through `vatAdminSvc`) are unchanged (partially due to upgrade/backwards-compatibility limitations), and continue to only support setting `reapInterval`. Internally, this just modifies `reapDirtThreshold.deliveries`. * `E(vatAdminSvc).createVat(bcap, { reapInterval })` * `E(adminNode).upgrade(bcap, { reapInterval })` * `E(adminNode).changeOptions({ reapInterval })` Internally, the kernel-wide state records `defaultReapDirtThreshold` instead of `defaultReapInterval`, and each vat records `.reapDirtThreshold` in their `vNN.options` key instead of `vNN.reapInterval`. The current dirt level is recorded in `vNN.reapDirt`. The kernel will automatically upgrade both the kernel-wide and the per-vat state upon the first reboot with the new kernel code. The old `reapCountdown` value is used to initialize the vat's `reapDirt.deliveries` counter, so the upgrade shouldn't disrupt the existing schedule. Vats which used `reapInterval = 'never'` (eg comms) will get a `reapDirtThreshold` of all 'never' values, so they continue to inhibit BOYD. Otherwise, all vats get a `threshold.gcKrefs` of 20. We do not track dirt when the corresponding threshold is 'never', to avoid incrementing the comms dirt counters forever. This design leaves room for adding `.computrons` to the dirt record, as well as tracking a separate `snapshotDirt` counter (to trigger XS heap snapshots, ala #6786). We add `reapDirtThreshold.computrons`, but do not yet expose an API to set it. Future work includes: * upgrade vat-vat-admin to let userspace set `reapDirtThreshold` New tests were added to exercise the upgrade process, and other tests were updated to match the new internal initialization pattern. We now reset the dirt counter upon any BOYD, so this also happens to help with #8665 (doing a `reapAllVats()` resets the delivery counters, so future BOYDs will be delayed, which is what we want). But we should still change `controller.reapAllVats()` to avoid BOYDs on vats which haven't received any deliveries. closes #8980

warner added enhancement New feature or request SwingSet package: SwingSet labels Feb 23, 2024

warner self-assigned this Feb 23, 2024

warner mentioned this issue Feb 24, 2024

terminate vats slowly #8928

Open

warner mentioned this issue Mar 27, 2024

failed startVat won't fire criticalVat panic #9157

Open

warner linked a pull request Mar 29, 2024 that will close this issue

fix(swingset): use "dirt" to schedule vat reap/bringOutYourDead #9169

Open

aj-agoric assigned warner and unassigned warner May 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dirty-vat -driven BOYD scheduler #8980

dirty-vat -driven BOYD scheduler #8980

warner commented Feb 23, 2024 •

edited

warner commented Mar 25, 2024

dirty-vat -driven BOYD scheduler #8980

dirty-vat -driven BOYD scheduler #8980

Comments

warner commented Feb 23, 2024 • edited

What is the Problem Being Solved?

Description of the Design

Security Considerations

Scaling Considerations

Test Plan

Upgrade Considerations

warner commented Mar 25, 2024

warner commented Feb 23, 2024 •

edited