Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Instance start needs to be a saga #2824

Closed
gjcolombo opened this issue Apr 12, 2023 · 1 comment · Fixed by #3873
Closed

Instance start needs to be a saga #2824

gjcolombo opened this issue Apr 12, 2023 · 1 comment · Fixed by #3873
Assignees
Labels
nexus Related to nexus
Milestone

Comments

@gjcolombo
Copy link
Contributor

Once #2765 lands, instance start will need to be a saga, since start requires two separate actions (registering an instance + starting an instance) that must either both complete or both be undone.

This will also be needed to fix #2315, since that will add a "assign this instance to a sled & allocate resources for it" step to this saga.

@gjcolombo
Copy link
Contributor Author

I'm actively working on this.

gjcolombo added a commit that referenced this issue Aug 13, 2023
Create a saga that starts instances. This has the following immediate
benefits:

- It's no longer possible to leak an instance registration during start;
previously this could happen if Nexus crashed while handling a start
call.
- The saga synchronizes properly with concurrent attempts to delete an
instance; the existing start routine may not be handling this correctly
(it can look up an instance and decide it's OK to start, then start
talking to sled agent about it while a deletion saga runs concurrently
and deletes the instance).
- The saga establishes networking state (Dendrite NAT entries, OPTE V2P
mappings) for a newly started instance if it wasn't previously
established. This is a stopgap measure to ensure that this state exists
when restarting an instance after a cluster is restarted. It should
eventually be replaced by a step that triggers the appropriate
networking RPW(s).

This saga can also be used, at least in theory, as a subsaga of the
instance create saga to replace that saga's logic for starting a
newly-created instance. This work isn't done in this PR, though. (The
change isn't trivial because the new start saga expects a prior instance
record as a parameter, and the create saga can't construct *a priori*
the instance record it intends to insert into CRDB.)

Tested via assorted new cargo tests and by launching a dev cluster with
the changes, stopping an instance, restarting it, and verifying that the
instance restarted correctly and that Nexus logs contained the expected
log lines.

Fixes #2824. Fixes #3813.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
nexus Related to nexus
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant