New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Instance start needs to be a saga #2824
Comments
This was referenced Jun 19, 2023
9 tasks
13 tasks
I'm actively working on this. |
gjcolombo
added a commit
that referenced
this issue
Aug 13, 2023
Create a saga that starts instances. This has the following immediate benefits: - It's no longer possible to leak an instance registration during start; previously this could happen if Nexus crashed while handling a start call. - The saga synchronizes properly with concurrent attempts to delete an instance; the existing start routine may not be handling this correctly (it can look up an instance and decide it's OK to start, then start talking to sled agent about it while a deletion saga runs concurrently and deletes the instance). - The saga establishes networking state (Dendrite NAT entries, OPTE V2P mappings) for a newly started instance if it wasn't previously established. This is a stopgap measure to ensure that this state exists when restarting an instance after a cluster is restarted. It should eventually be replaced by a step that triggers the appropriate networking RPW(s). This saga can also be used, at least in theory, as a subsaga of the instance create saga to replace that saga's logic for starting a newly-created instance. This work isn't done in this PR, though. (The change isn't trivial because the new start saga expects a prior instance record as a parameter, and the create saga can't construct *a priori* the instance record it intends to insert into CRDB.) Tested via assorted new cargo tests and by launching a dev cluster with the changes, stopping an instance, restarting it, and verifying that the instance restarted correctly and that Nexus logs contained the expected log lines. Fixes #2824. Fixes #3813.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Once #2765 lands, instance start will need to be a saga, since start requires two separate actions (registering an instance + starting an instance) that must either both complete or both be undone.
This will also be needed to fix #2315, since that will add a "assign this instance to a sled & allocate resources for it" step to this saga.
The text was updated successfully, but these errors were encountered: