Transient failures in instance state monitor cause instances to be lost at sea #2727

gjcolombo · 2023-03-31T21:52:59Z

Sled agent spawns a state monitor task per instance that calls Propolis's instance_state_monitor endpoint, watches for VM state transitions, processes them, and relays any resulting instance state changes to Nexus. Any error in Instance::monitor_state_task or any of its callees bubbles up through the task and causes it to exit. After this, nothing monitors Propolis for subsequent state changes or notifies Nexus if they occur.

The DNS resolution error in #2726 brought this to my immediate attention, but any failure in the monitoring task will do the trick; another not-too-esoteric one is a Propolis panic that causes the call to instance_state_monitor to fail.

The text was updated successfully, but these errors were encountered:

smklein · 2023-04-12T02:43:37Z

Good catch, I'm looking at this bug in the context of #2765 .

A quick fix here would be to just add a retry loop within these external calls of state monitoring (e.g., if we cannot contact nexus, retry with exponential backoff).

However, this raises the question: if we fail to alert nexus for long enough, or if we fail to monitor the instance, is that a valid condition for "setting the instance to failed"?

askfongjojo · 2023-04-12T05:10:24Z

However, this raises the question: if we fail to alert nexus for long enough, or if we fail to monitor the instance, is that a valid condition for "setting the instance to failed"?

It makes sense to give up after reaching a certain threshold. I mentioned in an earlier discussion today (in the context of sled fault tolerance) that we probably shouldn't automatically reap the failed instances because they could still be working perfectly from the end-user's POV. What may be a good compromise is some amount of human intervention, e.g. the failed instance status handling described here.

Whenever Nexus gets a new instance runtime state from a sled agent, compare the state to the existing runtime state to see if applying the new state will update the instance's Propolis generation. If it will, use the sled ID in the new record to create updated OPTE V2P mappings and Dendrite NAT entries for the instance. Retry with backoff when sled agent fails to publish a state update to Nexus. This was required for correctness anyway (see #2727) but is especially important now that there are many more ways for Nexus to fail to apply a state update. See the comments in the new code for more details.

Whenever Nexus gets a new instance runtime state from a sled agent, compare the state to the existing runtime state to see if applying the new state will update the instance's Propolis generation. If it will, use the sled ID in the new record to create updated OPTE V2P mappings and Dendrite NAT entries for the instance. Retry with backoff when sled agent fails to publish a state update to Nexus. This was required for correctness anyway (see #2727) but is especially important now that there are many more ways for Nexus to fail to apply a state update. See the comments in the new code for more details. In the future, it might be better to update this configuration using a reliable persistent workflow that's triggered by Propolis location changes. This approach will require at least some additional work in OPTE to assign generation numbers to V2P mappings (Dendrite might have a similar problem but I'm not as familiar with the tables Nexus is trying to maintain in this change).

gjcolombo added bug Something that isn't working. Sled Agent Related to the Per-Sled Configuration and Management labels Mar 31, 2023

gjcolombo mentioned this issue Mar 31, 2023

Intermittent DNS resolution failures from sleds in an at-home multi-machine Omicron cluster #2726

Open

gjcolombo mentioned this issue Apr 12, 2023

Failed instances should be allowed to stop and restart #2825

Open

morlandi7 added this to the MVP milestone May 9, 2023

gjcolombo mentioned this issue May 15, 2023

nexus: update instance networking config after live migration #3127

Merged

jordanhendricks mentioned this issue May 24, 2023

sled-agent failed to notice propolis-server panicked #3206

Open

gjcolombo mentioned this issue May 26, 2023

publish_state_to_nexus should distinguish between transient and permanent update failures #3230

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transient failures in instance state monitor cause instances to be lost at sea #2727

Transient failures in instance state monitor cause instances to be lost at sea #2727

gjcolombo commented Mar 31, 2023

smklein commented Apr 12, 2023

askfongjojo commented Apr 12, 2023

Transient failures in instance state monitor cause instances to be lost at sea #2727

Transient failures in instance state monitor cause instances to be lost at sea #2727

Comments

gjcolombo commented Mar 31, 2023

smklein commented Apr 12, 2023

askfongjojo commented Apr 12, 2023