Terminate failed repair jobs #3806

Michal-Leszczynski · 2024-04-17T07:50:54Z

Right now we don't terminate failed repair jobs by default - the problem is that they might have failed because of a timeout on our side and in fact still be running. This causes two problems:

in case of a timeout, SM believes that the task has failed and stopped running, so it schedules new repair jobs for "released" hosts. This can break the 1 job per 1 host rule.
not terminated repair jobs running after SM task has ended might make it impossible to retry the SM task until they are finished (see https://github.com/scylladb/scylla-enterprise/issues/4055)

karol-kokoszka · 2024-04-22T08:59:14Z

Gromming notes

The goal is to call the Scylla API to kill the repair job that timeout on the job status check to assure that the job is not handled by the Scylla server anymore.

The timeout for the repair status check is set for 30 minutes right now.
@asias Any clue what would be the best timeout we can set for waiting on the repair job status ?

We need to have the integration test covering this scenario.

Timeout the repair job
Assert that the job is terminated and no longer running on the Scylla server.

It may create a need for controlling the timeout value via yaml or other configuration.

The issue describes just a corner case.

Michal-Leszczynski added bug Something isn't working repair labels Apr 17, 2024

karol-kokoszka added the ready-for-development label Apr 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Terminate failed repair jobs #3806

Terminate failed repair jobs #3806

Michal-Leszczynski commented Apr 17, 2024

karol-kokoszka commented Apr 22, 2024

Terminate failed repair jobs #3806

Terminate failed repair jobs #3806

Comments

Michal-Leszczynski commented Apr 17, 2024

karol-kokoszka commented Apr 22, 2024