Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(swingset): use "dirt" to schedule vat reap/bringOutYourDead #9169

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

warner
Copy link
Member

@warner warner commented Mar 29, 2024

dispatch.bringOutYourDead(), aka "reap", triggers garbage collection
inside a vat, and gives it a chance to drop imported c-list vrefs that
are no longer referenced by anything inside the vat.

Previously, each vat has a configurable parameter named
reapInterval, which defaults to a kernel-wide
defaultReapInterval (but can be set separately for each vat). This
defaults to 1, mainly for unit testing, but real applications set it
to something like 200.

This caused BOYD to happen once every 200 deliveries, plus an extra
BOYD just before we save an XS heap-state snapshot.

This commit switches to a "dirt"-based BOYD scheduler, wherein we
consider the vat to get more and more dirty as it does work, and
eventually it reaches a reapDirtThreshold that triggers the
BOYD (which resets the dirt counter).

We continue to track dirt.deliveries as before, with the same
defaults. But we add a new dirt.gcKrefs counter, which is
incremented by the krefs we submit to the vat in GC deliveries. For
example, calling dispatch.dropImports([kref1, kref2]) would increase
dirt.gcKrefs by two.

The reapDirtThreshold.gcKrefs limit defaults to 20. For normal use
patterns, this will trigger a BOYD after ten krefs have been dropped
and retired. We choose this value to allow the #8928 slow vat
termination process to trigger BOYD frequently enough to keep the BOYD
cranks small: since these will be happening constantly (in the
"background"), we don't want them to take more than 500ms or so. Given
the current size of the large vats that #8928 seeks to terminate, 10
krefs seems like a reasonable limit. And of course we don't want to
perform too many BOYDs, so gcKrefs: 20 is about the smallest
threshold we'd want to use.

External APIs continue to accept reapInterval, and now also accept
reapGCKrefs.

  • kernel config record
    • takes config.defaultReapInterval and defaultReapGCKrefs
    • takes vat.NAME.creationOptions.reapInterval and .reapGCKrefs
  • controller.changeKernelOptions() still takes defaultReapInterval
    but now also accepts defaultReapGCKrefs

The APIs available to userspace code (through vatAdminSvc) are
unchanged (partially due to upgrade/backwards-compatibility
limitations), and continue to only support setting reapInterval.
Internally, this just modifies reapDirtThreshold.deliveries.

  • E(vatAdminSvc).createVat(bcap, { reapInterval })
  • E(adminNode).upgrade(bcap, { reapInterval })
  • E(adminNode).changeOptions({ reapInterval })

Internally, the kernel-wide state records defaultReapDirtThreshold
instead of defaultReapInterval, and each vat records
.reapDirtThreshold in their vNN.options key instead of
vNN.reapInterval. The current dirt level is recorded in
vNN.reapDirt.

The kernel will automatically upgrade both the kernel-wide and the
per-vat state upon the first reboot with the new kernel code. The old
reapCountdown value is used to initialize the vat's
reapDirt.deliveries counter, so the upgrade shouldn't disrupt the
existing schedule. Vats which used reapInterval = 'never' (eg comms)
will get a reapDirtThreshold of all 'never' values, so they continue
to inhibit BOYD. Otherwise, all vats get a threshold.gcKrefs of 20.

We do not track dirt when the corresponding threshold is 'never', to
avoid incrementing the comms dirt counters forever.

This design leaves room for adding .computrons to the dirt record,
as well as tracking a separate snapshotDirt counter (to trigger XS
heap snapshots, ala #6786). We add reapDirtThreshold.computrons, but
do not yet expose an API to set it.

Future work includes:

  • upgrade vat-vat-admin to let userspace set reapDirtThreshold

New tests were added to exercise the upgrade process, and other tests
were updated to match the new internal initialization pattern.

We now reset the dirt counter upon any BOYD, so this also happens to
help with #8665 (doing a reapAllVats() resets the delivery counters,
so future BOYDs will be delayed, which is what we want). But we should
still change controller.reapAllVats() to avoid BOYDs on vats which
haven't received any deliveries.

closes #8980

@warner warner added the SwingSet package: SwingSet label Mar 29, 2024
@warner warner requested a review from mhofman March 29, 2024 16:49
@warner warner self-assigned this Mar 29, 2024
@warner warner force-pushed the warner/8980-boyd-scheduler branch 3 times, most recently from 62aa511 to 0fe9f39 Compare April 13, 2024 04:10
Copy link

cloudflare-pages bot commented Apr 13, 2024

Deploying agoric-sdk with  Cloudflare Pages  Cloudflare Pages

Latest commit: 402811a
Status: ✅  Deploy successful!
Preview URL: https://3c1436cb.agoric-sdk.pages.dev
Branch Preview URL: https://warner-8980-boyd-scheduler.agoric-sdk.pages.dev

View logs

@warner warner force-pushed the warner/8980-boyd-scheduler branch from 0fe9f39 to 967e458 Compare April 15, 2024 15:36
@warner warner assigned mhofman and unassigned warner Apr 16, 2024
Previously, vat state initialization (e.g. setting counters to zero)
happened lazily, the first time that `provideVatKeeper()` was
called. When creating a new vat, the pattern was:

  vk = kernelKeeper.provideVatKeeper();
  vk.setSourceAndOptions(source, options);

Now, we initialize both counters and source/options explicitly, in
`recordVatOptions`, when the static/dynamic vat is first defined:

  kernelKeeper.createVatState(vatID, source, options);

In the future, this will make it easier for `provideVatKeeper` to rely
upon the presence of all vat state keys, especially `options`.

Previously, vatKeeper.getOptions() would tolerate a missing key by
returning empty options. The one place where this was
needed (terminating a barely-created vat because startVat() failed,
using getOptions().critical) was changed, so this tolerance is no
longer needed, and was removed. The tolerance caused bug #9157 (kernel
doesn't panic when it should), which continues to be present, but at
least won't cause a crash.

refs #8980
`dispatch.bringOutYourDead()`, aka "reap", triggers garbage collection
inside a vat, and gives it a chance to drop imported c-list vrefs that
are no longer referenced by anything inside the vat.

Previously, each vat has a configurable parameter named
`reapInterval`, which defaults to a kernel-wide
`defaultReapInterval` (but can be set separately for each vat). This
defaults to 1, mainly for unit testing, but real applications set it
to something like 200.

This caused BOYD to happen once every 200 deliveries, plus an extra
BOYD just before we save an XS heap-state snapshot.

This commit switches to a "dirt"-based BOYD scheduler, wherein we
consider the vat to get more and more dirty as it does work, and
eventually it reaches a `reapDirtThreshold` that triggers the
BOYD (which resets the dirt counter).

We continue to track `dirt.deliveries` as before, with the same
defaults. But we add a new `dirt.gcKrefs` counter, which is
incremented by the krefs we submit to the vat in GC deliveries. For
example, calling `dispatch.dropImports([kref1, kref2])` would increase
`dirt.gcKrefs` by two.

The `reapDirtThreshold.gcKrefs` limit defaults to 20. For normal use
patterns, this will trigger a BOYD after ten krefs have been dropped
and retired. We choose this value to allow the #8928 slow vat
termination process to trigger BOYD frequently enough to keep the BOYD
cranks small: since these will be happening constantly (in the
"background"), we don't want them to take more than 500ms or so. Given
the current size of the large vats that #8928 seeks to terminate, 10
krefs seems like a reasonable limit. And of course we don't want to
perform too many BOYDs, so `gcKrefs: 20` is about the smallest
threshold we'd want to use.

External APIs continue to accept `reapInterval`, and now also accept
`reapGCKrefs`.

* kernel config record
  * takes `config.defaultReapInterval` and `defaultReapGCKrefs`
  * takes `vat.NAME.creationOptions.reapInterval` and `.reapGCKrefs`
* `controller.changeKernelOptions()` still takes `defaultReapInterval`
   but now also accepts `defaultReapGCKrefs`

The APIs available to userspace code (through `vatAdminSvc`) are
unchanged (partially due to upgrade/backwards-compatibility
limitations), and continue to only support setting `reapInterval`.
Internally, this just modifies `reapDirtThreshold.deliveries`.

* `E(vatAdminSvc).createVat(bcap, { reapInterval })`
* `E(adminNode).upgrade(bcap, { reapInterval })`
* `E(adminNode).changeOptions({ reapInterval })`

Internally, the kernel-wide state records `defaultReapDirtThreshold`
instead of `defaultReapInterval`, and each vat records
`.reapDirtThreshold` in their `vNN.options` key instead of
`vNN.reapInterval`. The current dirt level is recorded in
`vNN.reapDirt`.

The kernel will automatically upgrade both the kernel-wide and the
per-vat state upon the first reboot with the new kernel code. The old
`reapCountdown` value is used to initialize the vat's
`reapDirt.deliveries` counter, so the upgrade shouldn't disrupt the
existing schedule. Vats which used `reapInterval = 'never'` (eg comms)
will get a `reapDirtThreshold` of all 'never' values, so they continue
to inhibit BOYD. Otherwise, all vats get a `threshold.gcKrefs` of 20.

We do not track dirt when the corresponding threshold is 'never', to
avoid incrementing the comms dirt counters forever.

This design leaves room for adding `.computrons` to the dirt record,
as well as tracking a separate `snapshotDirt` counter (to trigger XS
heap snapshots, ala #6786). We add `reapDirtThreshold.computrons`, but
do not yet expose an API to set it.

Future work includes:
* upgrade vat-vat-admin to let userspace set `reapDirtThreshold`

New tests were added to exercise the upgrade process, and other tests
were updated to match the new internal initialization pattern.

We now reset the dirt counter upon any BOYD, so this also happens to
help with #8665 (doing a `reapAllVats()` resets the delivery counters,
so future BOYDs will be delayed, which is what we want). But we should
still change `controller.reapAllVats()` to avoid BOYDs on vats which
haven't received any deliveries.

closes #8980
@warner warner force-pushed the warner/8980-boyd-scheduler branch from 967e458 to 402811a Compare April 23, 2024 19:03
@mhofman
Copy link
Member

mhofman commented Apr 24, 2024

We choose this value to allow the #8928 slow vat
termination process to trigger BOYD frequently enough to keep the BOYD
cranks small: since these will be happening constantly (in the
"background"), we don't want them to take more than 500ms or so. Given
the current size of the large vats that #8928 seeks to terminate, 10
krefs seems like a reasonable limit. And of course we don't want to
perform too many BOYDs, so gcKrefs: 20 is about the smallest
threshold we'd want to use.

I am somewhat concerned that the cost of engine gc might mean 20 is too low. How do we validate we won't spend too much time in BOYD? Do we have an idea of what portion of the time spent in BOYD is related to the engine doing a gc mark and sweep, and how much of it is liveslots reacting to the found garbage? I assume liveslots usually has vatStore calls to make when garbage was found?

External APIs continue to accept reapInterval, and now also accept reapGCKrefs.

  • kernel config record

    • takes config.defaultReapInterval and defaultReapGCKrefs
    • takes vat.NAME.creationOptions.reapInterval and .reapGCKrefs
  • controller.changeKernelOptions() still takes defaultReapInterval
    but now also accepts defaultReapGCKrefs

This does not sound scalable for adding new dirt thresholds. Why not add a reapDirtThresholds record in configs / options?

The kernel will automatically upgrade both the kernel-wide and the
per-vat state upon the first reboot with the new kernel code.

I'll need to look at this. It sounds like a sort of schema migration.

as well as tracking a separate snapshotDirt counter (to trigger XS
heap snapshots, ala #6786).

I'm not sure I understand this. Is the idea to trigger snapshots every n BOYD? Snapshots should already force BYOD before snapshotting. The snapshotting process reveals gc to liveslots, which may have bugs revealing this out (#6784), so the idea has been to force a BOYD before each snapshot (#7504) but we expect BOYD to be more frequent than snapshots.

Edit: I totally misunderstood snapshortDirt: it's not meant to trigger reap, or be related to reapDirt, just will likely use a similar accounting mechanism, with likely similar dirt input.

Copy link
Member

@mhofman mhofman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is mostly looking good.

I think we're missing a clearReapDirt for the BOYD injected when creating a snapshot.

Wondering if the schema migration should be handled more explicitly when starting the kernel instead of optimistically running it before each crank.

I'll need to do another pass and review the tests.

Comment on lines +52 to +55
assert(options.workerOptions, `vat ${vatID} options missing workerOptions`);
assert(source);
assert('bundle' in source || 'bundleName' in source || 'bundleID' in source);
assert.typeof(options, 'object');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should check that options is a non-null object before checking that optons.workerOptions exists, but this seems to be a problem with the original setSourceAndOptions too.

Which brings me to the second issue: I'm not a major fan of duplicating the logic between there and here. Any way to refactor to use the same underlying implementation for both the checks and saving the values to the kvStore?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aye, I'll change the order of the asserts

I guess I could factor out a two-line function (plus some asserts), and call it from both places, but it didn't seem worth it at the time, and I didn't think of a good name for it. doSetSourceAndOptions(kvStore, source, options) ?

It's important that initializeVatState() remain a standalone function, not a method on vatKeeper, because it must be called before creating a vatKeeper() (which asserts the presence of .options, to populate the reapDirtThreshold RAM cache).

I'll see what I can do.

Comment on lines +19 to +24
/**
* @typedef { import('../types-external.js').SwingSetKernelConfig } SwingSetKernelConfig
* @typedef { import('../types-external.js').SwingStoreKernelStorage } SwingStoreKernelStorage
* @typedef { import('../types-internal.js').InternalKernelOptions } InternalKernelOptions
* @typedef { import('../types-internal.js').ReapDirtThreshold } ReapDirtThreshold
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We switched to TS 5.5 so we can now use @import ! See #9149

That said, there are a lot of other places in kernel code where we should switch, so I'm ok punting for now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I'd like to do that cleanup in a separate PR, and I'm concerned about how it will interact with our current branch policy (cherry-picking changes to a release branch)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good call

@@ -379,7 +380,7 @@ export default function buildKernel(
* didDelivery?: VatID, // we made a delivery to a vat, for run policy and save-snapshot
* computrons?: BigInt, // computron count for run policy
* meterID?: string, // deduct those computrons from a meter
* decrementReapCount?: { vatID: VatID }, // the reap counter should decrement
* measureDirt?: [ VatID, Dirt ], // the dirt counter should increment
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to switch to a tuple instead of a record here for the return value?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No good ones, I think I ran into a name collision during development, but I can't find evidence of it now, so I'll switch to a record.

@@ -601,6 +607,8 @@ export default function buildKernel(
if (!vatWarehouse.lookup(vatID)) {
return NO_DELIVERY_CRANK_RESULTS; // can't collect from the dead
}
const vatKeeper = kernelKeeper.provideVatKeeper(vatID);
vatKeeper.addDirt({ gcKrefs: krefs.length });
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not plumb this into deliveryCrankResults and rely the handling of crank results to add dirt?

Let's assume we started counting the number of refs introduced in every delivery as a dirt counter (the more you have, the more likely you are to have some that are now garbage). That number is also derived from the message. In that case, does it make sense to increment here or in the results? I suppose the handling of abort/terminate may have an influence on this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that would work. It would look like

function deliveryCrankResults(vatID, status, measureDirt, meterID, gcKrefs) {
  ...
  if (measureDirt && ..) {
    if (gcKrefs) {
      dirt.gcKrefs = gcKrefs;
    }
 ..
}

...

  deliveryCrankResults(vatID, status, true, undefined, krefs.length);

IIRC I balked at adding another positional argument to deliveryCrankResults() which would only be used by processGCMessage() but would need to be provided by all callers, especially since we've already got the optional meterID posarg, and stopped short of converting the posargs to an options bag.

I think the abort/terminate shouldn't be a big influence.. I don't know exactly whether we should increment the dirt counter if the delivery caused a vat-fatal error or not, but I think either way would be ok.

I'll look to see if there's a decent way to refactor it as you suggest.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah definitely should avoid more positional args. That's what named args (destructured records are for).

Btw, the above was a light suggestion, I'm really not sure if it makes sense. I just had some trouble following the flow of where dirt gets changed.

Comment on lines +381 to +382
// the caller is responsible for ensuring that any changes made here
// get committed at a useful point
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some worry that a rollback could revert the kernel state but would leave the upgraded flag in place, but it looks like this is always called outside savepoints. Why not run this at the beginning of every startCrank()?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, there's one upgraded flag per KernelKeeper object, and each kernel has exactly one KernelKeeper (created at the start of buildKernel()), but we also create a KernelKeeper when we initialize/populate the DB, and we should not be doing an upgrade at that time, which is why maybeUpgradeKernelState() is a method and not an automatic consequence of makeKernelKeeper().

It's currently called just before each call to startCrank().. I wanted to make it more explicit, because (as you noted) it's important to call it from outside a savepoint. I was worried that adding it to startCrank() would obscure the upgrade point and make it fragile under maintenance.

But.. in looking more carefully at how/when startCrank is called (it also show up in changeKernelOptions and installBundle, to which I did not add a maybeUpgrade call), I think my PR would have a small bug. kernelKeeper.startCrank() calls kernelStorage.startCrank() in swing-store, which sets an inCrank = true flag, and calls resetCrankHash(). By not calling startCrank before maybeUpgrade (or not calling it at all, if we start with changeKernelOptions or installBundle, which sounds pretty plausible TBH), we don't include any upgrade-related kvStore changes in the crankhash. That's not the intent: upgrades are in-consensus, all nodes of a consensus kernel should upgrade at the same time, and their changes should be included in the crankhash.

So I think I need to change this as you suggest: call maybeUpgradeKernelState() from inside kernelKeeper.startCrank(), but do it specifically after that function calls kernelStorage.startCrank(). I'm not excited to conceal the fact that upgrades are happening from call sites like step() and run(), but it's more important to have these upgrade changes be captured in the hash reliably.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we also create a KernelKeeper when we initialize/populate the DB, and we should not be doing an upgrade at that time, which is why maybeUpgradeKernelState() is a method and not an automatic consequence of makeKernelKeeper().

That had totally skipped my mind. Can we document somewhere these constraints?

@@ -68,7 +74,8 @@ const enableKernelGC = true;
// bundle.$BUNDLEID = JSON(bundle)
//
// kernel.defaultManagerType = managerType
// kernel.defaultReapInterval = $NN
// (old) kernel.defaultReapInterval = $NN
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we have a "formal" schema v0 designation. At some point I assume we'll have multiple "old"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we're still in that fuzzy ad-hoc stage of this table's maturity curve, but I'll improve the comments and write down the definitions of v0/v1/whatever even though we don't have a single key that says what version the kvStore is at. For swingstore I'm working on making it precise (and doing a full+atomic migration from vN to vN+1), but for this table, the weird thing was that vatKeepers have an "upgrade your keys" function, which the kernel calls at startup (sometimes), so there's a brief window during this call where vat4 has migrated but vat5 has not, and that didn't lend itself to declaring the entire kvStore as at "version 2", and I didn't want to introduce per-vat version keys either.

return reapDirt;
}

function clearReapDirt() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to also call this when injecting the BOYD message in maybeSaveSnapshot

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I'll see if there's a way to factor out BOYD delivery to have just one place to call it from (we actually have three code paths: processBringOutYourDead, processUpgradeVat as the last thing sent to the old incarnation, and maybeSaveSnapshot). Ideally all three would call processBringOutYourDead but that wasn't trivial. I'll see what I can do, worst case I'll just add another clearReapDirt() call like you suggested.

Comment on lines +191 to +195
// this default "reap interval" is low for the benefit of tests:
// applications should set it to something higher (perhaps 200) based
// on their expected usage

export const DEFAULT_DELIVERIES_PER_BOYD = 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am somewhat concerned with a default value that is not "production ready", but this seems to be the existing behavior already, so out of scope. Is the builder for a test kernel the same as the production one?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The (weak) philosophy here was that we don't know what a good value would be, so make the default a predictable and hyper-safe one, at the cost of performance. Making it convenient for tests is another consideration, although of course it's always a lazy and inadvisable choice to prioritize tests over reality.

We might even make a hand-wavy argument that this matches XS: minimizing RAM usage (ie GCing frequently) at the cost of performance.

Is the builder for a test kernel the same as the production one?

Not sure what you mean, our most important host application (cosmic-swingset) overrides this value, and in general we expect all host applictions to do so.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was remembering that all tests used a single function to build the kernel, in which case we could have a value more appropriate for production as a default here. Anyway, as stated this not really in scope for this PR.

Comment on lines +196 to +200
// We keep the vat's reap-dirt state (and threshold) in a
// write-through cache, and populate the cache at vatKeeper startup.
const reapDirtKey = `${vatID}.reapDirt`;
let reapDirt = JSON.parse(getRequired(reapDirtKey));
let reapDirtThreshold = getOptions().reapDirtThreshold;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having funny feelings about ad-hoc caching, in particular because I don't fully understand the impact of savepoints and crank rollbacks.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I probably haven't considered it thoroughly enough, but my basic idea was that crank rollbacks still happen in consensus, and still take time, so some of our reactions to their costs should still happen even if the crank gets rolled back. Like, the runPolicy should be told about computrons spent inside rolled-back cranks (we don't get more time for a block just because the vat did something weird and needs to forget the delivery).

Obviously that doesn't apply to the amount of dirt inside a vat: abandoning the external consequences of the delivery will also abandon the vat-internal changes, like an increase in the amount of dirt. So I agree it would be appropriate to not increment reapDirt in the case of an abandoned crank.

I wanted to avoid too much extra DB activity for every crank. We definitely need to write out the new dirt value, so it's persistent (and this triggers an export-data IAVL change, sigh). We almost certainly shouldn't read the dirt threshold each time, since it rarely changes (and only by setSourceAndOptions, which happens in its own non-delivery crank, which isn't subject to rollback since there's no vat).

I currently call addDirt() from two places (the generic one in processDeliveryMessage, and the gcKrefs -only one in processGCMessage), so if I remove the reapDirt cache and do a full DB read+increment+write cycle each time, I'm paying 2 or 4 DB queries per crank. If I can implement your other suggestion to pass gcKrefs into deliveryCrankResults(), then that is reduced to 2 per crank.

So I think I'll:

  • merge gcKrefs into deliveryCrankResults()
  • remove the RAM cache of reapDirt, doing 2 DB actions instead of 1 each delivery
  • leave the reapDirtThreshold RAM cache

Comment on lines +254 to +255
// The 'moreDirt' value might be Number or BigInt (eg
// .computrons). We coerce to Number so we can JSON-stringify.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we converted the computrons to Number before putting the value in the dirt record?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a duplicate coercion, yeah, I figured that was harmless.

This one was annoying, I first tried to make gcKrefs be a BigInt, to match how we handle computrons, but then I can't use simple JSON to store reapDirt or reapDirtThreshold, and holding it in RAM as a BigInt but converting back/forth to Number in the save/load functions was messy and distracting. Also, my choice to use key is missing as "no dirt of that category" made things slightly harder (the || 0 is to ignore gcKrefs: undefined -like cases).

And I didn't want to go back and change computrons to be Number everything, that was too much churn. I decided that I'd treat computrons as potentially large, and gcKrefs as always pretty small, and use that to justify leaving computrons as BigInts in the existing code. But then I ignore that argument in addDirt and do not tolerate large computron sizes. Sigh.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah Number is totally fine here for dirt computrons. Even if we go over the MAX_SAFE_INTEGER value, the loss of precision is deterministic and doesn't impact the threshold logic.

@warner
Copy link
Member Author

warner commented May 10, 2024

We choose this value to allow the #8928 slow vat
termination process to trigger BOYD frequently enough to keep the BOYD
cranks small: since these will be happening constantly (in the
"background"), we don't want them to take more than 500ms or so. Given
the current size of the large vats that #8928 seeks to terminate, 10
krefs seems like a reasonable limit. And of course we don't want to
perform too many BOYDs, so gcKrefs: 20 is about the smallest
threshold we'd want to use.

I am somewhat concerned that the cost of engine gc might mean 20 is too low. How do we validate we won't spend too much time in BOYD? Do we have an idea of what portion of the time spent in BOYD is related to the engine doing a gc mark and sweep, and how much of it is liveslots reacting to the found garbage? I assume liveslots usually has vatStore calls to make when garbage was found?

We certainly need to experiment on real data and tune this. The "reapDirtThresholds.gcKrefs -> work-per-BOYD" ratio is worst-case determined by how many objects/krefs are entrained by each #8401 Zoe/ExitObject cycle, and how much work (in particular how many syscalls) Zoe does in the resulting BOYD.

My notes/predictions are in #8928 (comment) . gcKrefs: 20 and a cleanup budget of "5" ought to trigger one zoe BOYD every other empty block, which will do 500 syscalls, free 10 cycles, and take 500ms. It will take about a week to get through the cycle-breaking c-list deletions.

This might actually be too aggressive, but if so we can reduce the cleanup rate instead of reducing the gcKrefs threshold.

My worry about increasing reapDirtThresholds.gcKrefs is that it would make the BOYDs bigger (albeit less frequent). We're going to be triggering BOYDs constantly, every N blocks, and I'm worried that the slower validators/followers will fall behind. You're absolutely right that smaller+faster BOYD vs larger+slower is influenced by the non-kref-handling code, and so the "percentage of total time spent in a BOYD" equation is not independent of gcKrefs. My hunch is that it's dominated by the syscalls, rather than the engine GC, but I'll see if I can measure that.

Also, your question means I should measure the rate of non-cleanup-related gcKrefs going into a vat like zoe, and estimate what kind of BOYD rate we expect this to provoke, just in normal operation. Our baseline is one BOYD per 200 deliveries (triggered by heap snapshots), so if reapDirtThresholds.gcKrefs doesn't trigger BOYDs faster than that, we should be ok (since we'll make snapshot-triggered BOYDs reset the dirt counters, so they'll replace instead of adding).

External APIs continue to accept reapInterval, and now also accept reapGCKrefs.

  • kernel config record

    • takes config.defaultReapInterval and defaultReapGCKrefs
    • takes vat.NAME.creationOptions.reapInterval and .reapGCKrefs
  • controller.changeKernelOptions() still takes defaultReapInterval
    but now also accepts defaultReapGCKrefs

This does not sound scalable for adding new dirt thresholds. Why not add a reapDirtThresholds record in configs / options?

I started out doing it that way, but it triggered a lot of test churn, and made it hard to distinguish between "I'm setting a threshold and I don't care about property X, use your default", vs "I'm setting a threshold and I want property X to never trigger BOYD", mostly because I stuck with a type of Number | undefined, and could only get one meaning for the undefined case. In the future, when we add a new threshold type/property, we want to be backwards-compatible with the intentions of the old config, so missing properties need to mean "use your default", and I found it hard to do that when the thresholds were all glued together into a single argument record.

I'll take another look and see if there's a way to get a cleaner API, but I have to admit it's not my highest priority.

The kernel will automatically upgrade both the kernel-wide and the
per-vat state upon the first reboot with the new kernel code.

I'll need to look at this. It sounds like a sort of schema migration.

It is, for sure. I thought briefly about adding a controller.upgradeKernelState(), but I didn't like the DX of that:

  • existing host apps which don't get the memo, and fail to add the upgrade call, would break
  • to make them break cleanly, I'd need buildKernel() to check the kvStore for evidence of being out-of-date, and bail early if so (with an error pointing them at upgradeKernelState), so we don't charge ahead with bogus state
  • creating the controller also starts the kernel, and a lot of schema upgrades we can imagine (for the future) will need to be done by that point, so it's kinda too late, which means it really wants a standalone function, kind of like initializeSwingset() (which takes a kernelStorage but doesn't use/create a controller or kernel), which would make the host app responsibility be like:
import { makeSwingsetController, maybeUpgradeKernelState } from '@agoric/swingset-vat';
const { kernelStorage, hostStorage } = openSwingStore(..);
maybeUpgradeKernelState(kernelStorage);
hostStorage.commit();
const controller = makeSwingsetController(kernelStorage, ..);
..

We've had similar discussions in #8089, and I think you convinced me to make it explicit, where the host app must call a function (at the right time), and not have the library perform an automatic upgrade, but then export-data considerations led us both to give up on using an explicit call, and instead having upgrade be implicit during the openSwingStore() call.

I think we've got similar issues here: the reapDirt upgrades will cause kvStore changes, and export-data/IAVL changes, that need to be committed. The situation is better, though, because we already have the kernelStorage object, so we've got somewhere to hold the pending changes.

I'll try to think about how we could make this an explicit call.

as well as tracking a separate snapshotDirt counter (to trigger XS
heap snapshots, ala #6786).

I'm not sure I understand this. Is the idea to trigger snapshots every n BOYD? Snapshots should already force BYOD before snapshotting. The snapshotting process reveals gc to liveslots, which may have bugs revealing this out (#6784), so the idea has been to force a BOYD before each snapshot (#7504) but we expect BOYD to be more frequent than snapshots.

Edit: I totally misunderstood snapshortDirt: it's not meant to trigger reap, or be related to reapDirt, just will likely use a similar accounting mechanism, with likely similar dirt input.

Right, we'd have reapDirt and snapshotDirt counters, compared against reapDirtThreshold and snapshotDirtThreshold respectively. We'd want snapshots to be driven mostly by computrons and partially by deliveries, and whereas we'd want BOYD to be mostly driven by memory/vref things rather than CPU-usage things, but both could use the same structure, just with different thresholds.

@aj-agoric aj-agoric assigned warner and unassigned mhofman May 14, 2024
@warner
Copy link
Member Author

warner commented May 17, 2024

I went to move the upgrade code into startCrank as suggested, and to write a test that the crankhash is updated to include the upgrade changes, and realized that this branch wouldn't work at all, because the kernelKeeper requires an up-to-date kvStore as soon as it is created, but the normal kernel.start() process creates the kernelKeeper before doing any cranks. My original unit test created a kernel in isolation, and the new test I was writing used controller instead, and it couldn't get past creating the kernel before it asserted.

I thought about our plans for swing-store schema migration (concluding by this comment), and the constraints therein, and how it applies to this upgrade step. We don't have the same "where do the export-data changes show up?" problem, since makeSwingsetController() accepts a swing-store (vs #8089 where openSwingStore() must return one). We still have the ergonomics question of "when does the DB get mutated?" and the corresponding host-app obligation to commit those changes (with swingStore.hostStorage.commit()) at an appropriate time.

Each kernel reboot uses some particular version of the @agoric/swingset-vat package (and likewise of the swing-store package). Each kernel "run" (the committed results of a controller.run(), plus whatever IO/device inputs arrived first) uses some version, and one of the host-app responsibilities is that this version never regresses: every run must use the a version that is the same or newer than the version used on the previous run. That means you can upgrade your kernel between reboots, and you are not allowed to revert it to an earlier version.

In a consensus system, each "block" (or whatever they call the coordinated units of work) is executed with some particular version of the kernel/swingstore. The host-app's responsbility is to ensure that each block is executed with the same version, across all workers. You must upgrade your kernel at the same time as everybody else, even though you're allowed to make non-upgrading reboots more frequently than that.

So, I think I see three options:

  • A: add a standalone upgradeSwingset(kernelStorage) function, which looks for an explicit kvStore version key, makes upgrading changes if necessary, and always leaves it at version = CURRENT ("1" in our case). Then kernelKeeper can assert that version is as expected. In this option, makeSwingsetController() does not mutate the DB.
  • B: add an { upgrade: true } option to makeSwingsetController(), which tells it to perform an upgrade when it starts, and then have kernelKeeper assert the version as above. Then makeSwingsetController() might mutate, but only if you approve it.
  • C: have makeSwingsetController() automatically upgrade, without an extra approval flag

I prefer being very explicit about when DB changes might happen, so I don't like C. And it feels odd to have makeSwingsetController() be a source of mutation: I expect controller.run() to mutate, and of course initializeSwingset, but not a function that claims to be merely creating a control surface. B would be a compromise, to make it very clear to the caller that they must commit any changes made by makeSwingsetController(), but it sticks out like a sore thumb.

So I'm going to pick A, where we have a separate upgradeSwingset() function that must be called before makeSwingsetController(). The simplest host-app behavior would be to call it every time, in every reboot, and rely upon the fact that it won't mutate anything if we're already up-to-date. Technically, if you knew exactly which reboots might have upgraded the kernel, you could refrain from calling it most of the time, but I don't think that would make anything simpler. Just in case, I'll have upgradeSwingset() return a boolean that says whether it actually changed anything, so host-apps could include an assertion like "hey, I wasn't following an upgrade plan, nothing should have changed, bail before I commit anything".

I'm also going to move to the explicit kvStore key that says what version the schema is, and graduate from our "ad-hoc" phase . I need something for makeSwingsetController() (really makeKernelKeeper()) to assert against. The rule will be that (only) upgradeSwingset() is allowed to change this, and consistency between the version key and the rest of the data is both a pre-condition and a post-condition around that call, but we're allowed to have inconsistent data while in the middle of the call (i.e. just what you'd expect). I'll need to look at initializeSwingset and decide whether to create a v0 DB and immediately upgrade it (as is my plan with #8089), or to create a v1 DB right off the bat. It might be easier to write unit tests if there's a createV0 function available.

And then I'll need to look at how cosmic-swingset calls this and make sure I'm not creating any problems. We'll already need to handle mutation (and thus export-data callbacks being invoked) with the #8089 changes, because the constraints there obligate us to perform upgrades as part of openSwingStore(), rather than using a separate upgrade function. But since #8089 isn't done yet, the cosmic-swingset code might not be ready for it. I think it won't require anything special, I'll add the call to upgradeSwingset() just before wherever we call makeSwingsetController(), with a note that this changes the DB, and that both the export-data callbacks (headed into IAVL) and the actual state changes will get committed as part of the first block after reboot. This should only actually change anything in the first block after a chain upgrade, which will be in-consensus.

Note that this means the upgrade changes won't be included in any crank, and they won't be included in a crankhash. (To be precise, they'll get hashed into the crankhasher created by swingStore.js when the swing-store is built, but they'll get cleared by the startCrank() that the kernel calls at the top of the first real crank, so they're effectively ignored). I think that's fine, we weren't expecting that #8089 swing-store upgrade changes (which might cause an IAVL change, even if just for the version key) would be included in a crank. We still get export-data/IAVL shadows of everything, so a buggy validator which upgrades at the wrong time will fall out of consensus right away, we just won't have a slogfile crank-hash value that explains why.

@mhofman
Copy link
Member

mhofman commented May 18, 2024

I think it won't require anything special, I'll add the call to upgradeSwingset() just before wherever we call makeSwingsetController()

That sounds right. cosmic-swingset only builds the kernel as part of an init message, which is always done while processing a block at the cosmos level, so calls to the bridge are safe during that time.

the upgrade changes won't be included in any crank, and they won't be included in a crankhash

This part is a little unfortunate, but I guess acceptable. If we ever change how we export the swing-store kv data, we'll likely still have data to validate it in IAVL. The unfortunate part is that right now we would have no way to detect this difference using the slog only, and exporting and comparing swing-store and iavl trees is difficult compared to comparing slog entries (e.g. from a flight recorder). Can we maybe add a new slog entry for upgrades, which maybe captures the crankHasher state at the end even if it doesn't end up chained like a normal crank?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
SwingSet package: SwingSet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

dirty-vat -driven BOYD scheduler
2 participants