Stop all services if core settings upgrade fails #601

sisuresh · 2024-04-26T23:37:18Z

Resolves #547

An example of the failure that this can result in can be seen here - sisuresh#1.

leighmcculloch

We have this issue on a larger stage that for this image when things go wrong, it rarely stops running. Sometimes this is okay, as it would be actually more disruptive if the failure of a single service that a dev might not even be using fails to start. But routinely these quiet failures cause us to miss a failure until some later point where we are debugging the problem indirectly.

I'm saying this because I spent yesterday debugging a failure in another part of the start script where it just kept on going, with a small error in the logs that was not easy to spot.

Instead of the change here, I'm wondering if we should set some different failure configs so that if anything goes wrong we hard exit the script, and so at the location of this check we'd do that instead of stopping services.

sisuresh · 2024-04-30T16:29:16Z

We have this issue on a larger stage that for this image when things go wrong, it rarely stops running. Sometimes this is okay, as it would be actually more disruptive if the failure of a single service that a dev might not even be using fails to start. But routinely these quiet failures cause us to miss a failure until some later point where we are debugging the problem indirectly.

I'm saying this because I spent yesterday debugging a failure in another part of the start script where it just kept on going, with a small error in the logs that was not easy to spot.

Instead of the change here, I'm wondering if we should set some different failure configs so that if anything goes wrong we hard exit the script, and so at the location of this check we'd do that instead of stopping services.

The change in this PR prevents the github actions from passing, so we'll never publish a broken image that does this type of error checking. Hard exiting would be ideal though, but I'm not sure how to do that. If that's possible, then we should go that route.

leighmcculloch · 2024-05-01T13:32:06Z

Looks like there are still situations where the upgrade can fail, such as if the file needs updating due to an xdr change, and the image can keep running.

sisuresh · 2024-05-01T16:22:25Z

Looks like there are still situations where the upgrade can fail, such as if the file needs updating due to an xdr change, and the image can keep running.

Argh yeah good point. This check requires the previous steps to pass. A "health" check at the end where we validate that the upgrade went through would be better. The fact that the errors are just swallowed by the service is annoying though.

sisuresh · 2024-05-01T16:51:23Z

I'm looking into doing something like https://serverfault.com/a/922943.

sisuresh · 2024-05-01T18:34:15Z

@leighmcculloch was the error you ran into in the start script or within a service? I made a change to just trap on ERR and kill supervisor, but it'll only trap in the upgrade_local function for now.

Stop all services if core settings upgrade fails

0407c44

sisuresh requested a review from leighmcculloch April 26, 2024 23:37

leighmcculloch reviewed Apr 30, 2024

View reviewed changes

Merge branch 'master' into check2

967359d

leighmcculloch enabled auto-merge (squash) May 1, 2024 13:37

Trap on ERR

473bb55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stop all services if core settings upgrade fails #601

Stop all services if core settings upgrade fails #601

sisuresh commented Apr 26, 2024

leighmcculloch left a comment

sisuresh commented Apr 30, 2024

leighmcculloch commented May 1, 2024

sisuresh commented May 1, 2024 •

edited

sisuresh commented May 1, 2024

sisuresh commented May 1, 2024

Stop all services if core settings upgrade fails #601

Are you sure you want to change the base?

Stop all services if core settings upgrade fails #601

Conversation

sisuresh commented Apr 26, 2024

leighmcculloch left a comment

Choose a reason for hiding this comment

sisuresh commented Apr 30, 2024

leighmcculloch commented May 1, 2024

sisuresh commented May 1, 2024 • edited

sisuresh commented May 1, 2024

sisuresh commented May 1, 2024

sisuresh commented May 1, 2024 •

edited