sled agent: split ensure into "register" and "ensure state" APIs #2765

gjcolombo · 2023-04-04T22:23:35Z

This is the third PR in the live migration series. Not a net negative diff this time, I'm afraid :( But at least it fixes some outstanding issues!

Split the sled agent's /instances/{id} PUT endpoint into two endpoints:

A PUT to /instances/{id} "registers" an instance with a sled. This creates a record for the instance in the manager, but does not start its Propolis and does not try to drive the instance to any particular state.
A PUT to /instances/{id}/state attempts to change the state of a previously- registered instance's VM by starting it, stopping it, rebooting it, initializing by live migration, or unceremoniously destroying it. (This last case is meant to provide a safety valve that lets Nexus get an unresponsive Propolis off a sled.)

This allows the instance create saga to avoid a class of problems in which an instance starts, stops (due to user input to the VM), and then is errantly restarted by a replayed saga step: because sled agent will only accept requests to run a registered instance, and stopping an instance unregisters it, a replayed "run this VM" saga node won't restart the VM. The migration saga is vulnerable to a similar class of problem, so this groundwork is necessary to write that saga correctly.

A secondary benefit of this change is that operations on running instances (like "stop" and "reboot") no longer need to construct an (unused) InstanceHardware to pass to the sled agent's ensure endpoint.

Update the simulated sled agent to support these APIs, update callers in Nexus to use them, and split the instance create saga's "instance ensure" step into two steps as described above. This requires some extra affordances in simulated collections to support simulated disks, since instance state changes no longer go through a path where an instance's hardware manifest is available.

Finally, add some Nexus logging to record information about CRDB updates that Nexus applies when a call to sled agent produces a new InstanceRuntimeState, since these are handy for debugging.

Tested: cargo test; installed Omicron locally and played around with some instances.

Fixes #2209. Fixes #2517.

smklein

Great overhaul - really appreciate this work that not only moves us closer to migration, but also helps fix our idempotency issues.

One error-handling concern aside, this LGTM.

smklein · 2023-04-11T20:31:29Z

nexus/src/app/instance.rs

@@ -357,8 +347,26 @@ impl super::Nexus {
        opctx: &OpContext,
        instance_lookup: &lookup::Instance<'_>,
    ) -> UpdateResult<db::model::Instance> {
-        let (.., authz_instance, db_instance) = instance_lookup.fetch().await?;
-        self.instance_set_runtime(
+        // TODO-correctness: This needs to be a saga for crash resiliency


Do we have an issue tracking?

Also, I assume this would be a sub-saga of instance creation (kinda like how "disk creation" can be a sub-saga) if the "start" flag is true?

Filed #2824 and updated the TODO in 46e7d37.

Yeah, this should ultimately be a sub-saga within instance creation. Depending on when other things land (looking at you, #2315) we might also need to reorder things in the creation saga a bit to make sure that all of the start sub-saga steps can happen sequentially.

smklein · 2023-04-12T02:16:33Z

nexus/src/app/instance.rs

+        db_instance.runtime_state = new_runtime;
+        self.instance_ensure_registered(opctx, &authz_instance, &db_instance)
+            .await?;


If the instance already exists on the sled, what does the request for Creating do?

It's a no-op. I clarified this a bit in cefcf42.

I could probably pick this apart a little bit more as well. The reason this exists at all is that the call to register an instance needs an InstanceHardware that specifies an initial InstanceRuntimeState. Registration doesn't actually have to have that, I think, so long as sled agent is prepared to deal with the existence of instances that have no current runtime state. LMK if you'd like me to give this a shot in this change.

smklein · 2023-04-12T02:18:41Z

nexus/src/app/instance.rs

+        // destroyed, except that multiple requests to migrate in are allowed
+        // for idempotency.


"multiple requests to migrate in are allowed" -- phrasing? "migrate an instance"?

Fixed in 8977ba4.

smklein · 2023-04-12T02:19:35Z

nexus/src/app/instance.rs

        let allowed = match runtime.run_state {
            InstanceState::Creating => true,
            InstanceState::Starting => true,
            InstanceState::Running => true,
            InstanceState::Stopping => true,
            InstanceState::Stopped => true,
            InstanceState::Rebooting => true,
-            InstanceState::Migrating => false,
+            InstanceState::Migrating => {
+                matches!(requested, InstanceStateRequested::MigrationTarget(_))


Should we check it's the same target if we're already migrating?

That's tricky to do here because the body of MigrationTarget contains the source's Propolis address and Propolis ID, neither of which we'll have on this path.

The reworked migration saga registers the instance on the destination sled and passes it a specially-crafted InstanceRuntimeState that (more or less) reflects the state CRDB will get if migration actually succeeds. That means, in part, that its Propolis ID and address will be the destination Propolis's ID and address. That InstanceRuntimeState is the one that will be passed to runtime when the saga moves to the MigrationTarget state (and not what's currently in CRDB), so there's no way (without refactoring this routine to take extra information or having it do another query) to verify that the migration target params point to the instance's current source.

smklein · 2023-04-12T02:23:16Z

nexus/src/app/instance.rs

+    /// - `Ok(false)` if the sled agent call succeeded, but this routine did not
+    ///   update CRDB. This can happen either because sled agent didn't return


Suggested change

/// - `Ok(false)` if the sled agent call succeeded, but this routine did not

/// update CRDB. This can happen either because sled agent didn't return

/// - `Ok(false)` if the sled agent call succeeded, but this routine did not

/// update CRDB for a non-erroneous reason.

/// This can happen either because sled agent didn't return

Fixed in c1089f3.

smklein · 2023-04-12T02:48:28Z

sled-agent/src/http_entrypoints.rs

        api.register(instance_put)?;
+        api.register(instance_put_state)?;


We don't need to do it as a part of this PR, but IMO matching the language used by Nexus might be useful?

This used to be instance_put as a "catch-all" for "align all state to this incoming request, regardless of what previously existed" -- though that has the registration problems you described in issues, regarding idempotency.

Now, Nexus seems to be referring to instance_put as "registration", right?

Touched this up in 09a68fb.

smklein · 2023-04-12T03:01:14Z

sled-agent/src/instance.rs

+            // being launched.
+            if migration_params.is_none() {
+                inner.state.transition(InstanceState::Starting);
+                inner.publish_state_to_nexus().await?;


If this fails, are we relying on the saga to unwind our state?

If we exit with an error here, the InstanceState will be Starting, even though no propolis has actually been started. Perhaps we should handle an error here by transitioning to Failed?

I see we're doing something similar with setup_result below - maybe these transitions should be lumped into that block?

Good catch. Fixed in 7672fcf.

smklein · 2023-04-12T03:12:12Z

sled-agent/src/instance.rs

+                inner.publish_state_to_nexus().await?;
+            }
+
+            let setup_result: Result<(), Error> = {


Is this supposed to be wrapped up in a lambda?

I believe this code is equivalent to the following: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=5e000e905c95ffe9541c0d62f66f834a

In which the errors from setting up propolis will never be assigned to setup_result, but which will instead cause the propolis_ensure function to exit.

In contrast, the following actually does let the result get a non-Ok(()) value:
https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=50f83c62b40e5ebf27a22876828b0542

You're right, this is incorrect. Unless I'm missing something (very possible here!), turning it into a closure is nontrivial because async closures are still unstable. From what I've read, it might be possible to use a closure anyway if I write out a sufficiently gnarly Future return type for it, but in 7672fcf I went with a label-based approach that I think will do the job.

smklein · 2023-04-12T03:14:24Z

sled-agent/src/instance_manager.rs

+                    // If the instance isn't registered, then by definition it
+                    // isn't running here. Allow requests to stop or destroy the
+                    // instance to succeed to provide idempotency. This has to
+                    // be handled here (that is, on the "instance not found"
+                    // path) to handle the case where a stop request arrived,
+                    // Propolis handled it, sled agent unregistered the
+                    // instance, and only then did a second stop request
+                    // arrive.
+                    InstanceStateRequested::Stopped
+                    | InstanceStateRequested::Destroyed => {
+                        return Ok(InstancePutStateResponse {
+                            updated_runtime: None,
+                        });
+                    }


(Not a request, more just thinking out loud)

This special casing makes me wonder if the sled agent should expose a separate API for fully "destroying the instance" instead of relying on particular state transitions to pull that off for us -- similar to how we divided "creation" away from "starting".

Acknowledged. I think that makes sense, but will leave things as they are in this particular PR.

gjcolombo · 2023-04-12T18:50:11Z

Force-pushed to pick up #2807; there are no changes in this push. Will work on feedback this afternoon PDT.

smklein

Thanks for incorporating the feedback -- this looks good to merge!

gjcolombo · 2023-04-12T21:23:36Z

Thanks for taking the time to review! The merge conflicts are mostly in use declarations; will clean those up and merge once main's build is fixed.

Split the sled agent's `/instances/{id}` PUT endpoint into two endpoints: - A PUT to `/instances/{id}` "registers" an instance with a sled. This creates a record for the instance in the manager, but does not start its Propolis and does not try to drive the instance to any particular state. - A PUT to `/instances/{id}/state` attempts to change the state of a previously- registered instance's VM by starting it, stopping it, rebooting it, initializing by live migration, or unceremoniously destroying it. (This last case is meant to provide a safety valve that lets Nexus get an unresponsive Propolis off a sled.) This allows the instance create saga to avoid a class of problems in which an instance starts, stops (due to user input to the VM), and then is errantly restarted by a replayed saga step: because sled agent will only accept requests to run a registered instance, and stopping an instance unregisters it, a replayed "run this VM" saga node won't restart the VM. The migration saga is vulnerable to a similar class of problem, so this groundwork is necessary to write that saga correctly. A secondary benefit of this change is that operations on running instances (like "stop" and "reboot") no longer need to construct an (unused) `InstanceHardware` to pass to the sled agent's ensure endpoint. Update the simulated sled agent to support these APIs, update callers in Nexus to use them, and split the instance create saga's "instance ensure" step into two steps as described above. This requires some extra affordances in simulated collections to support simulated disks, since instance state changes no longer go through a path where an instance's hardware manifest is available. Finally, add some Nexus logging to record information about CRDB updates that Nexus applies when a call to sled agent produces a new `InstanceRuntimeState`, since these are handy for debugging. Tested: cargo test; installed Omicron locally and played around with some instances.

luqmana

Woo, thanks for getting this together Greg! One small nit but looks good to me. Also left some comments for some possible follow up

luqmana · 2023-04-12T23:56:02Z

nexus/src/app/sagas/instance_create.rs

+        // TODO-correctness: This is dangerous if this step is replayed, since
+        // a user can discover this instance and ask to start it in between
+        // attempts to run this step.


Perhaps we should just have Creating always transition to Stopped. That way if we're not starting the instance we can just avoid this instance_update_runtime call altogether. Anyways, that can be done in a followup.

Another approach might be to make the "Creating" DB record the output of an earlier saga step instead of refetching on line 1324. That way, if the initial record has generation 1, all the attempts to run this step will attempt to transition to generation 2, but only the first will succeed. I added a comment describing this in 7e4c45b.

nexus/src/app/sagas/instance_create.rs

luqmana · 2023-04-13T00:00:08Z

nexus/src/app/sagas/instance_create.rs

+            &opctx,
+            &authz_instance,
+            &db_instance,
+            InstanceStateRequested::Destroyed,


Seems a little odd the undo for ensure_registered is via instance_request_state.

Ah, I see sean's similar comment in instance_manager.

I took another look at this and realized that the existing Destroy code is actually wrong: it removes the instance ticket, but doesn't actually do any of the other cleanup that's needed to clean up a running instance. So I refactored this in a960d29 and introduced an explicit "unregister" call.

sled-agent/src/instance_manager.rs

Also ensure that explicitly destroying a running instance properly terminates it.

gjcolombo · 2023-04-13T19:35:36Z

@smklein @luqmana I've added an explicit "unregister this instance" API to sled agent in commit a960d29. Could you PTAL when you have a free moment? It's not hugely complicated, I think, but it's a big enough change that I'd like to get at least one set of eyes on it before merging.

nexus/src/app/instance.rs

gjcolombo · 2023-04-14T00:13:17Z

Thanks as always for the reviews!

gjcolombo requested review from luqmana, smklein and jordanhendricks April 4, 2023 22:23

gjcolombo force-pushed the gjcolombo/lets-migrate/3-split-register-and-start branch from ff46bd7 to 6ae798c Compare April 6, 2023 02:05

smklein self-assigned this Apr 11, 2023

smklein mentioned this pull request Apr 12, 2023

Transient failures in instance state monitor cause instances to be lost at sea #2727

Open

smklein reviewed Apr 12, 2023

View reviewed changes

smklein removed their assignment Apr 12, 2023

gjcolombo force-pushed the gjcolombo/lets-migrate/3-split-register-and-start branch from 6ae798c to 315bc91 Compare April 12, 2023 18:49

gjcolombo mentioned this pull request Apr 12, 2023

Instance start needs to be a saga #2824

Closed

gjcolombo requested a review from smklein April 12, 2023 21:13

smklein approved these changes Apr 12, 2023

View reviewed changes

gjcolombo force-pushed the gjcolombo/lets-migrate/3-split-register-and-start branch from 09a68fb to 382a8c3 Compare April 12, 2023 22:52

luqmana approved these changes Apr 13, 2023

View reviewed changes

gjcolombo added 3 commits April 13, 2023 19:15

Move instance unregister into its own API

a960d29

Also ensure that explicitly destroying a running instance properly terminates it.

improve logging

2067fc1

make comment more constructive

7e4c45b

luqmana approved these changes Apr 13, 2023

View reviewed changes

nexus/src/app/instance.rs Outdated Show resolved Hide resolved

PR feedback; clean up includes

d5add9c

gjcolombo merged commit e4a5dd0 into main Apr 14, 2023
20 checks passed

gjcolombo deleted the gjcolombo/lets-migrate/3-split-register-and-start branch April 14, 2023 00:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sled agent: split ensure into "register" and "ensure state" APIs #2765

sled agent: split ensure into "register" and "ensure state" APIs #2765

gjcolombo commented Apr 4, 2023

smklein left a comment

smklein Apr 11, 2023

gjcolombo Apr 12, 2023

smklein Apr 12, 2023

gjcolombo Apr 12, 2023

smklein Apr 12, 2023

gjcolombo Apr 12, 2023

smklein Apr 12, 2023

gjcolombo Apr 12, 2023

smklein Apr 12, 2023

gjcolombo Apr 12, 2023

smklein Apr 12, 2023

gjcolombo Apr 12, 2023

smklein Apr 12, 2023

smklein Apr 12, 2023

gjcolombo Apr 12, 2023

smklein Apr 12, 2023

gjcolombo Apr 12, 2023

smklein Apr 12, 2023

gjcolombo Apr 12, 2023

gjcolombo commented Apr 12, 2023

smklein left a comment

gjcolombo commented Apr 12, 2023

luqmana left a comment

luqmana Apr 12, 2023

gjcolombo Apr 13, 2023

luqmana Apr 13, 2023

luqmana Apr 13, 2023

gjcolombo Apr 13, 2023

gjcolombo commented Apr 13, 2023

gjcolombo commented Apr 14, 2023

		// destroyed, except that multiple requests to migrate in are allowed
		// for idempotency.

		/// - `Ok(false)` if the sled agent call succeeded, but this routine did not
		/// update CRDB. This can happen either because sled agent didn't return

		api.register(instance_put)?;
		api.register(instance_put_state)?;

sled agent: split ensure into "register" and "ensure state" APIs #2765

sled agent: split ensure into "register" and "ensure state" APIs #2765

Conversation

gjcolombo commented Apr 4, 2023

smklein left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gjcolombo commented Apr 12, 2023

smklein left a comment

Choose a reason for hiding this comment

gjcolombo commented Apr 12, 2023

luqmana left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gjcolombo commented Apr 13, 2023

gjcolombo commented Apr 14, 2023