You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Right now we don't terminate failed repair jobs by default - the problem is that they might have failed because of a timeout on our side and in fact still be running. This causes two problems:
in case of a timeout, SM believes that the task has failed and stopped running, so it schedules new repair jobs for "released" hosts. This can break the 1 job per 1 host rule.
The goal is to call the Scylla API to kill the repair job that timeout on the job status check to assure that the job is not handled by the Scylla server anymore.
The timeout for the repair status check is set for 30 minutes right now. @asias Any clue what would be the best timeout we can set for waiting on the repair job status ?
We need to have the integration test covering this scenario.
Timeout the repair job
Assert that the job is terminated and no longer running on the Scylla server.
It may create a need for controlling the timeout value via yaml or other configuration.
Right now we don't terminate failed repair jobs by default - the problem is that they might have failed because of a timeout on our side and in fact still be running. This causes two problems:
The text was updated successfully, but these errors were encountered: