Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transient failures in instance state monitor cause instances to be lost at sea #2727

Open
gjcolombo opened this issue Mar 31, 2023 · 2 comments
Labels
bug Something that isn't working. Sled Agent Related to the Per-Sled Configuration and Management
Milestone

Comments

@gjcolombo
Copy link
Contributor

Sled agent spawns a state monitor task per instance that calls Propolis's instance_state_monitor endpoint, watches for VM state transitions, processes them, and relays any resulting instance state changes to Nexus. Any error in Instance::monitor_state_task or any of its callees bubbles up through the task and causes it to exit. After this, nothing monitors Propolis for subsequent state changes or notifies Nexus if they occur.

The DNS resolution error in #2726 brought this to my immediate attention, but any failure in the monitoring task will do the trick; another not-too-esoteric one is a Propolis panic that causes the call to instance_state_monitor to fail.

@gjcolombo gjcolombo added bug Something that isn't working. Sled Agent Related to the Per-Sled Configuration and Management labels Mar 31, 2023
@smklein
Copy link
Collaborator

smklein commented Apr 12, 2023

Good catch, I'm looking at this bug in the context of #2765 .

A quick fix here would be to just add a retry loop within these external calls of state monitoring (e.g., if we cannot contact nexus, retry with exponential backoff).

However, this raises the question: if we fail to alert nexus for long enough, or if we fail to monitor the instance, is that a valid condition for "setting the instance to failed"?

@askfongjojo
Copy link

However, this raises the question: if we fail to alert nexus for long enough, or if we fail to monitor the instance, is that a valid condition for "setting the instance to failed"?

It makes sense to give up after reaching a certain threshold. I mentioned in an earlier discussion today (in the context of sled fault tolerance) that we probably shouldn't automatically reap the failed instances because they could still be working perfectly from the end-user's POV. What may be a good compromise is some amount of human intervention, e.g. the failed instance status handling described here.

@morlandi7 morlandi7 added this to the MVP milestone May 9, 2023
gjcolombo added a commit that referenced this issue May 15, 2023
Whenever Nexus gets a new instance runtime state from a sled agent, compare the
state to the existing runtime state to see if applying the new state will update
the instance's Propolis generation. If it will, use the sled ID in the new
record to create updated OPTE V2P mappings and Dendrite NAT entries for the
instance.

Retry with backoff when sled agent fails to publish a state update to Nexus.
This was required for correctness anyway (see #2727) but is especially
important now that there are many more ways for Nexus to fail to apply a state
update. See the comments in the new code for more details.
gjcolombo added a commit that referenced this issue May 17, 2023
Whenever Nexus gets a new instance runtime state from a sled agent, compare the
state to the existing runtime state to see if applying the new state will update
the instance's Propolis generation. If it will, use the sled ID in the new
record to create updated OPTE V2P mappings and Dendrite NAT entries for the
instance.

Retry with backoff when sled agent fails to publish a state update to Nexus.
This was required for correctness anyway (see #2727) but is especially
important now that there are many more ways for Nexus to fail to apply a state
update. See the comments in the new code for more details.
gjcolombo added a commit that referenced this issue May 17, 2023
Whenever Nexus gets a new instance runtime state from a sled agent,
compare the state to the existing runtime state to see if applying the
new state will update the instance's Propolis generation. If it will,
use the sled ID in the new record to create updated OPTE V2P mappings
and Dendrite NAT entries for the instance.

Retry with backoff when sled agent fails to publish a state update to
Nexus. This was required for correctness anyway (see #2727) but is
especially important now that there are many more ways for Nexus to fail
to apply a state update. See the comments in the new code for more
details.

In the future, it might be better to update this configuration using a
reliable persistent workflow that's triggered by Propolis location
changes. This approach will require at least some additional work in
OPTE to assign generation numbers to V2P mappings (Dendrite might have a
similar problem but I'm not as familiar with the tables Nexus is trying
to maintain in this change).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that isn't working. Sled Agent Related to the Per-Sled Configuration and Management
Projects
None yet
Development

No branches or pull requests

4 participants