Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sled agent: split ensure into "register" and "ensure state" APIs #2765

Merged
merged 5 commits into from Apr 14, 2023

Conversation

gjcolombo
Copy link
Contributor

This is the third PR in the live migration series. Not a net negative diff this time, I'm afraid :( But at least it fixes some outstanding issues!


Split the sled agent's /instances/{id} PUT endpoint into two endpoints:

  • A PUT to /instances/{id} "registers" an instance with a sled. This creates a record for the instance in the manager, but does not start its Propolis and does not try to drive the instance to any particular state.
  • A PUT to /instances/{id}/state attempts to change the state of a previously- registered instance's VM by starting it, stopping it, rebooting it, initializing by live migration, or unceremoniously destroying it. (This last case is meant to provide a safety valve that lets Nexus get an unresponsive Propolis off a sled.)

This allows the instance create saga to avoid a class of problems in which an instance starts, stops (due to user input to the VM), and then is errantly restarted by a replayed saga step: because sled agent will only accept requests to run a registered instance, and stopping an instance unregisters it, a replayed "run this VM" saga node won't restart the VM. The migration saga is vulnerable to a similar class of problem, so this groundwork is necessary to write that saga correctly.

A secondary benefit of this change is that operations on running instances (like "stop" and "reboot") no longer need to construct an (unused) InstanceHardware to pass to the sled agent's ensure endpoint.

Update the simulated sled agent to support these APIs, update callers in Nexus to use them, and split the instance create saga's "instance ensure" step into two steps as described above. This requires some extra affordances in simulated collections to support simulated disks, since instance state changes no longer go through a path where an instance's hardware manifest is available.

Finally, add some Nexus logging to record information about CRDB updates that Nexus applies when a call to sled agent produces a new InstanceRuntimeState, since these are handy for debugging.

Tested: cargo test; installed Omicron locally and played around with some instances.

Fixes #2209. Fixes #2517.

Copy link
Collaborator

@smklein smklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great overhaul - really appreciate this work that not only moves us closer to migration, but also helps fix our idempotency issues.

One error-handling concern aside, this LGTM.

@@ -357,8 +347,26 @@ impl super::Nexus {
opctx: &OpContext,
instance_lookup: &lookup::Instance<'_>,
) -> UpdateResult<db::model::Instance> {
let (.., authz_instance, db_instance) = instance_lookup.fetch().await?;
self.instance_set_runtime(
// TODO-correctness: This needs to be a saga for crash resiliency
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have an issue tracking?

Also, I assume this would be a sub-saga of instance creation (kinda like how "disk creation" can be a sub-saga) if the "start" flag is true?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed #2824 and updated the TODO in 46e7d37.

Yeah, this should ultimately be a sub-saga within instance creation. Depending on when other things land (looking at you, #2315) we might also need to reorder things in the creation saga a bit to make sure that all of the start sub-saga steps can happen sequentially.

Comment on lines 365 to 368
db_instance.runtime_state = new_runtime;
self.instance_ensure_registered(opctx, &authz_instance, &db_instance)
.await?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the instance already exists on the sled, what does the request for Creating do?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a no-op. I clarified this a bit in cefcf42.

I could probably pick this apart a little bit more as well. The reason this exists at all is that the call to register an instance needs an InstanceHardware that specifies an initial InstanceRuntimeState. Registration doesn't actually have to have that, I think, so long as sled agent is prepared to deal with the existence of instances that have no current runtime state. LMK if you'd like me to give this a shot in this change.

Comment on lines 415 to 414
// destroyed, except that multiple requests to migrate in are allowed
// for idempotency.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"multiple requests to migrate in are allowed" -- phrasing? "migrate an instance"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 8977ba4.

let allowed = match runtime.run_state {
InstanceState::Creating => true,
InstanceState::Starting => true,
InstanceState::Running => true,
InstanceState::Stopping => true,
InstanceState::Stopped => true,
InstanceState::Rebooting => true,
InstanceState::Migrating => false,
InstanceState::Migrating => {
matches!(requested, InstanceStateRequested::MigrationTarget(_))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we check it's the same target if we're already migrating?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's tricky to do here because the body of MigrationTarget contains the source's Propolis address and Propolis ID, neither of which we'll have on this path.

The reworked migration saga registers the instance on the destination sled and passes it a specially-crafted InstanceRuntimeState that (more or less) reflects the state CRDB will get if migration actually succeeds. That means, in part, that its Propolis ID and address will be the destination Propolis's ID and address. That InstanceRuntimeState is the one that will be passed to runtime when the saga moves to the MigrationTarget state (and not what's currently in CRDB), so there's no way (without refactoring this routine to take extra information or having it do another query) to verify that the migration target params point to the instance's current source.

Comment on lines 656 to 655
/// - `Ok(false)` if the sled agent call succeeded, but this routine did not
/// update CRDB. This can happen either because sled agent didn't return
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// - `Ok(false)` if the sled agent call succeeded, but this routine did not
/// update CRDB. This can happen either because sled agent didn't return
/// - `Ok(false)` if the sled agent call succeeded, but this routine did not
/// update CRDB for a non-erroneous reason.
/// This can happen either because sled agent didn't return

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in c1089f3.

Comment on lines 35 to 36
api.register(instance_put)?;
api.register(instance_put_state)?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to do it as a part of this PR, but IMO matching the language used by Nexus might be useful?

This used to be instance_put as a "catch-all" for "align all state to this incoming request, regardless of what previously existed" -- though that has the registration problems you described in issues, regarding idempotency.

Now, Nexus seems to be referring to instance_put as "registration", right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Touched this up in 09a68fb.

// being launched.
if migration_params.is_none() {
inner.state.transition(InstanceState::Starting);
inner.publish_state_to_nexus().await?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this fails, are we relying on the saga to unwind our state?

If we exit with an error here, the InstanceState will be Starting, even though no propolis has actually been started. Perhaps we should handle an error here by transitioning to Failed?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see we're doing something similar with setup_result below - maybe these transitions should be lumped into that block?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Fixed in 7672fcf.

inner.publish_state_to_nexus().await?;
}

let setup_result: Result<(), Error> = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this supposed to be wrapped up in a lambda?

I believe this code is equivalent to the following: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=5e000e905c95ffe9541c0d62f66f834a

In which the errors from setting up propolis will never be assigned to setup_result, but which will instead cause the propolis_ensure function to exit.

In contrast, the following actually does let the result get a non-Ok(()) value:
https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=50f83c62b40e5ebf27a22876828b0542

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, this is incorrect. Unless I'm missing something (very possible here!), turning it into a closure is nontrivial because async closures are still unstable. From what I've read, it might be possible to use a closure anyway if I write out a sufficiently gnarly Future return type for it, but in 7672fcf I went with a label-based approach that I think will do the job.

Comment on lines 182 to 195
// If the instance isn't registered, then by definition it
// isn't running here. Allow requests to stop or destroy the
// instance to succeed to provide idempotency. This has to
// be handled here (that is, on the "instance not found"
// path) to handle the case where a stop request arrived,
// Propolis handled it, sled agent unregistered the
// instance, and only then did a second stop request
// arrive.
InstanceStateRequested::Stopped
| InstanceStateRequested::Destroyed => {
return Ok(InstancePutStateResponse {
updated_runtime: None,
});
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Not a request, more just thinking out loud)

This special casing makes me wonder if the sled agent should expose a separate API for fully "destroying the instance" instead of relying on particular state transitions to pull that off for us -- similar to how we divided "creation" away from "starting".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged. I think that makes sense, but will leave things as they are in this particular PR.

@smklein smklein removed their assignment Apr 12, 2023
@gjcolombo gjcolombo force-pushed the gjcolombo/lets-migrate/3-split-register-and-start branch from 6ae798c to 315bc91 Compare April 12, 2023 18:49
@gjcolombo
Copy link
Contributor Author

Force-pushed to pick up #2807; there are no changes in this push. Will work on feedback this afternoon PDT.

Copy link
Collaborator

@smklein smklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for incorporating the feedback -- this looks good to merge!

@gjcolombo
Copy link
Contributor Author

Thanks for taking the time to review! The merge conflicts are mostly in use declarations; will clean those up and merge once main's build is fixed.

Split the sled agent's `/instances/{id}` PUT endpoint into two endpoints:

- A PUT to `/instances/{id}` "registers" an instance with a sled. This creates
  a record for the instance in the manager, but does not start its Propolis and
  does not try to drive the instance to any particular state.
- A PUT to `/instances/{id}/state` attempts to change the state of a previously-
  registered instance's VM by starting it, stopping it, rebooting it,
  initializing by live migration, or unceremoniously destroying it. (This last
  case is meant to provide a safety valve that lets Nexus get an unresponsive
  Propolis off a sled.)

This allows the instance create saga to avoid a class of problems in which an
instance starts, stops (due to user input to the VM), and then is errantly
restarted by a replayed saga step: because sled agent will only accept requests
to run a registered instance, and stopping an instance unregisters it, a
replayed "run this VM" saga node won't restart the VM. The migration saga is
vulnerable to a similar class of problem, so this groundwork is necessary to
write that saga correctly.

A secondary benefit of this change is that operations on running instances (like
"stop" and "reboot") no longer need to construct an (unused) `InstanceHardware`
to pass to the sled agent's ensure endpoint.

Update the simulated sled agent to support these APIs, update callers in Nexus
to use them, and split the instance create saga's "instance ensure" step into
two steps as described above. This requires some extra affordances in simulated
collections to support simulated disks, since instance state changes no longer
go through a path where an instance's hardware manifest is available.

Finally, add some Nexus logging to record information about CRDB updates that
Nexus applies when a call to sled agent produces a new `InstanceRuntimeState`,
since these are handy for debugging.

Tested: cargo test; installed Omicron locally and played around with some
instances.
@gjcolombo gjcolombo force-pushed the gjcolombo/lets-migrate/3-split-register-and-start branch from 09a68fb to 382a8c3 Compare April 12, 2023 22:52
Copy link
Contributor

@luqmana luqmana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Woo, thanks for getting this together Greg! One small nit but looks good to me. Also left some comments for some possible follow up

Comment on lines 1336 to 1338
// TODO-correctness: This is dangerous if this step is replayed, since
// a user can discover this instance and ask to start it in between
// attempts to run this step.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should just have Creating always transition to Stopped. That way if we're not starting the instance we can just avoid this instance_update_runtime call altogether. Anyways, that can be done in a followup.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another approach might be to make the "Creating" DB record the output of an earlier saga step instead of refetching on line 1324. That way, if the initial record has generation 1, all the attempts to run this step will attempt to transition to generation 2, but only the first will succeed. I added a comment describing this in 7e4c45b.

nexus/src/app/sagas/instance_create.rs Show resolved Hide resolved
&opctx,
&authz_instance,
&db_instance,
InstanceStateRequested::Destroyed,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems a little odd the undo for ensure_registered is via instance_request_state.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see sean's similar comment in instance_manager.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took another look at this and realized that the existing Destroy code is actually wrong: it removes the instance ticket, but doesn't actually do any of the other cleanup that's needed to clean up a running instance. So I refactored this in a960d29 and introduced an explicit "unregister" call.

sled-agent/src/instance_manager.rs Show resolved Hide resolved
Also ensure that explicitly destroying a running instance properly terminates
it.
@gjcolombo
Copy link
Contributor Author

@smklein @luqmana I've added an explicit "unregister this instance" API to sled agent in commit a960d29. Could you PTAL when you have a free moment? It's not hugely complicated, I think, but it's a big enough change that I'd like to get at least one set of eyes on it before merging.

nexus/src/app/instance.rs Outdated Show resolved Hide resolved
@gjcolombo gjcolombo merged commit e4a5dd0 into main Apr 14, 2023
20 checks passed
@gjcolombo gjcolombo deleted the gjcolombo/lets-migrate/3-split-register-and-start branch April 14, 2023 00:12
@gjcolombo
Copy link
Contributor Author

Thanks as always for the reviews!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants