Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop all services if core settings upgrade fails #601

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

sisuresh
Copy link
Contributor

Resolves #547

An example of the failure that this can result in can be seen here - sisuresh#1.

Copy link
Member

@leighmcculloch leighmcculloch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have this issue on a larger stage that for this image when things go wrong, it rarely stops running. Sometimes this is okay, as it would be actually more disruptive if the failure of a single service that a dev might not even be using fails to start. But routinely these quiet failures cause us to miss a failure until some later point where we are debugging the problem indirectly.

I'm saying this because I spent yesterday debugging a failure in another part of the start script where it just kept on going, with a small error in the logs that was not easy to spot.

Instead of the change here, I'm wondering if we should set some different failure configs so that if anything goes wrong we hard exit the script, and so at the location of this check we'd do that instead of stopping services.

@sisuresh
Copy link
Contributor Author

We have this issue on a larger stage that for this image when things go wrong, it rarely stops running. Sometimes this is okay, as it would be actually more disruptive if the failure of a single service that a dev might not even be using fails to start. But routinely these quiet failures cause us to miss a failure until some later point where we are debugging the problem indirectly.

I'm saying this because I spent yesterday debugging a failure in another part of the start script where it just kept on going, with a small error in the logs that was not easy to spot.

Instead of the change here, I'm wondering if we should set some different failure configs so that if anything goes wrong we hard exit the script, and so at the location of this check we'd do that instead of stopping services.

The change in this PR prevents the github actions from passing, so we'll never publish a broken image that does this type of error checking. Hard exiting would be ideal though, but I'm not sure how to do that. If that's possible, then we should go that route.

@leighmcculloch
Copy link
Member

Looks like there are still situations where the upgrade can fail, such as if the file needs updating due to an xdr change, and the image can keep running.

@leighmcculloch leighmcculloch enabled auto-merge (squash) May 1, 2024 13:37
@sisuresh
Copy link
Contributor Author

sisuresh commented May 1, 2024

Looks like there are still situations where the upgrade can fail, such as if the file needs updating due to an xdr change, and the image can keep running.

Argh yeah good point. This check requires the previous steps to pass. A "health" check at the end where we validate that the upgrade went through would be better. The fact that the errors are just swallowed by the service is annoying though.

@sisuresh
Copy link
Contributor Author

sisuresh commented May 1, 2024

I'm looking into doing something like https://serverfault.com/a/922943.

@sisuresh
Copy link
Contributor Author

sisuresh commented May 1, 2024

@leighmcculloch was the error you ran into in the start script or within a service? I made a change to just trap on ERR and kill supervisor, but it'll only trap in the upgrade_local function for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Soroban settings upgrade failure is silent
2 participants