sled agent: split ensure into "register" and "ensure state" APIs #2765

Split the sled agent's `/instances/{id}` PUT endpoint into two endpoints: - A PUT to `/instances/{id}` "registers" an instance with a sled. This creates a record for the instance in the manager, but does not start its Propolis and does not try to drive the instance to any particular state. - A PUT to `/instances/{id}/state` attempts to change the state of a previously- registered instance's VM by starting it, stopping it, rebooting it, initializing by live migration, or unceremoniously destroying it. (This last case is meant to provide a safety valve that lets Nexus get an unresponsive Propolis off a sled.) This allows the instance create saga to avoid a class of problems in which an instance starts, stops (due to user input to the VM), and then is errantly restarted by a replayed saga step: because sled agent will only accept requests to run a registered instance, and stopping an instance unregisters it, a replayed "run this VM" saga node won't restart the VM. The migration saga is vulnerable to a similar class of problem, so this groundwork is necessary to write that saga correctly. A secondary benefit of this change is that operations on running instances (like "stop" and "reboot") no longer need to construct an (unused) `InstanceHardware` to pass to the sled agent's ensure endpoint. Update the simulated sled agent to support these APIs, update callers in Nexus to use them, and split the instance create saga's "instance ensure" step into two steps as described above. This requires some extra affordances in simulated collections to support simulated disks, since instance state changes no longer go through a path where an instance's hardware manifest is available. Finally, add some Nexus logging to record information about CRDB updates that Nexus applies when a call to sled agent produces a new `InstanceRuntimeState`, since these are handy for debugging. Tested: cargo test; installed Omicron locally and played around with some instances.

Also ensure that explicitly destroying a running instance properly terminates it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sled agent: split ensure into "register" and "ensure state" APIs #2765

sled agent: split ensure into "register" and "ensure state" APIs #2765

Commits on Apr 12, 2023

Commits on Apr 13, 2023