New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Transient failures in instance state monitor cause instances to be lost at sea #2727
Comments
Good catch, I'm looking at this bug in the context of #2765 . A quick fix here would be to just add a retry loop within these external calls of state monitoring (e.g., if we cannot contact nexus, retry with exponential backoff). However, this raises the question: if we fail to alert nexus for long enough, or if we fail to monitor the instance, is that a valid condition for "setting the instance to failed"? |
It makes sense to give up after reaching a certain threshold. I mentioned in an earlier discussion today (in the context of sled fault tolerance) that we probably shouldn't automatically reap the failed instances because they could still be working perfectly from the end-user's POV. What may be a good compromise is some amount of human intervention, e.g. the failed instance status handling described here. |
Whenever Nexus gets a new instance runtime state from a sled agent, compare the state to the existing runtime state to see if applying the new state will update the instance's Propolis generation. If it will, use the sled ID in the new record to create updated OPTE V2P mappings and Dendrite NAT entries for the instance. Retry with backoff when sled agent fails to publish a state update to Nexus. This was required for correctness anyway (see #2727) but is especially important now that there are many more ways for Nexus to fail to apply a state update. See the comments in the new code for more details.
Whenever Nexus gets a new instance runtime state from a sled agent, compare the state to the existing runtime state to see if applying the new state will update the instance's Propolis generation. If it will, use the sled ID in the new record to create updated OPTE V2P mappings and Dendrite NAT entries for the instance. Retry with backoff when sled agent fails to publish a state update to Nexus. This was required for correctness anyway (see #2727) but is especially important now that there are many more ways for Nexus to fail to apply a state update. See the comments in the new code for more details.
Whenever Nexus gets a new instance runtime state from a sled agent, compare the state to the existing runtime state to see if applying the new state will update the instance's Propolis generation. If it will, use the sled ID in the new record to create updated OPTE V2P mappings and Dendrite NAT entries for the instance. Retry with backoff when sled agent fails to publish a state update to Nexus. This was required for correctness anyway (see #2727) but is especially important now that there are many more ways for Nexus to fail to apply a state update. See the comments in the new code for more details. In the future, it might be better to update this configuration using a reliable persistent workflow that's triggered by Propolis location changes. This approach will require at least some additional work in OPTE to assign generation numbers to V2P mappings (Dendrite might have a similar problem but I'm not as familiar with the tables Nexus is trying to maintain in this change).
Sled agent spawns a state monitor task per instance that calls Propolis's
instance_state_monitor
endpoint, watches for VM state transitions, processes them, and relays any resulting instance state changes to Nexus. Any error inInstance::monitor_state_task
or any of its callees bubbles up through the task and causes it to exit. After this, nothing monitors Propolis for subsequent state changes or notifies Nexus if they occur.The DNS resolution error in #2726 brought this to my immediate attention, but any failure in the monitoring task will do the trick; another not-too-esoteric one is a Propolis panic that causes the call to
instance_state_monitor
to fail.The text was updated successfully, but these errors were encountered: