New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stop all services if core settings upgrade fails #601
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have this issue on a larger stage that for this image when things go wrong, it rarely stops running. Sometimes this is okay, as it would be actually more disruptive if the failure of a single service that a dev might not even be using fails to start. But routinely these quiet failures cause us to miss a failure until some later point where we are debugging the problem indirectly.
I'm saying this because I spent yesterday debugging a failure in another part of the start script where it just kept on going, with a small error in the logs that was not easy to spot.
Instead of the change here, I'm wondering if we should set some different failure configs so that if anything goes wrong we hard exit the script, and so at the location of this check we'd do that instead of stopping services.
The change in this PR prevents the github actions from passing, so we'll never publish a broken image that does this type of error checking. Hard exiting would be ideal though, but I'm not sure how to do that. If that's possible, then we should go that route. |
Looks like there are still situations where the upgrade can fail, such as if the file needs updating due to an xdr change, and the image can keep running. |
Argh yeah good point. This check requires the previous steps to pass. A "health" check at the end where we validate that the upgrade went through would be better. The fact that the errors are just swallowed by the service is annoying though. |
I'm looking into doing something like https://serverfault.com/a/922943. |
@leighmcculloch was the error you ran into in the start script or within a service? I made a change to just trap on ERR and kill supervisor, but it'll only trap in the |
Resolves #547
An example of the failure that this can result in can be seen here - sisuresh#1.