Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sled agent: split ensure into "register" and "ensure state" APIs #2765

Merged
merged 5 commits into from Apr 14, 2023

Commits on Apr 12, 2023

  1. sled agent: split ensure into "register" and "ensure state" APIs

    Split the sled agent's `/instances/{id}` PUT endpoint into two endpoints:
    
    - A PUT to `/instances/{id}` "registers" an instance with a sled. This creates
      a record for the instance in the manager, but does not start its Propolis and
      does not try to drive the instance to any particular state.
    - A PUT to `/instances/{id}/state` attempts to change the state of a previously-
      registered instance's VM by starting it, stopping it, rebooting it,
      initializing by live migration, or unceremoniously destroying it. (This last
      case is meant to provide a safety valve that lets Nexus get an unresponsive
      Propolis off a sled.)
    
    This allows the instance create saga to avoid a class of problems in which an
    instance starts, stops (due to user input to the VM), and then is errantly
    restarted by a replayed saga step: because sled agent will only accept requests
    to run a registered instance, and stopping an instance unregisters it, a
    replayed "run this VM" saga node won't restart the VM. The migration saga is
    vulnerable to a similar class of problem, so this groundwork is necessary to
    write that saga correctly.
    
    A secondary benefit of this change is that operations on running instances (like
    "stop" and "reboot") no longer need to construct an (unused) `InstanceHardware`
    to pass to the sled agent's ensure endpoint.
    
    Update the simulated sled agent to support these APIs, update callers in Nexus
    to use them, and split the instance create saga's "instance ensure" step into
    two steps as described above. This requires some extra affordances in simulated
    collections to support simulated disks, since instance state changes no longer
    go through a path where an instance's hardware manifest is available.
    
    Finally, add some Nexus logging to record information about CRDB updates that
    Nexus applies when a call to sled agent produces a new `InstanceRuntimeState`,
    since these are handy for debugging.
    
    Tested: cargo test; installed Omicron locally and played around with some
    instances.
    gjcolombo committed Apr 12, 2023
    Configuration menu
    Copy the full SHA
    382a8c3 View commit details
    Browse the repository at this point in the history

Commits on Apr 13, 2023

  1. Move instance unregister into its own API

    Also ensure that explicitly destroying a running instance properly terminates
    it.
    gjcolombo committed Apr 13, 2023
    Configuration menu
    Copy the full SHA
    a960d29 View commit details
    Browse the repository at this point in the history
  2. improve logging

    gjcolombo committed Apr 13, 2023
    Configuration menu
    Copy the full SHA
    2067fc1 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    7e4c45b View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    d5add9c View commit details
    Browse the repository at this point in the history