Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Support resurrecting blacklisted hosts
This adds support for resurrecting blacklisted hosts in elastic mode. Currently hosts that get blacklisted remain in the blacklist for the lifetime of the job. This cannot handle transient host failure or a scale-up after as scale-down. This is especially the case for the Kubeflow mpi-operator on Kubernetes, as it always gives pods known hostnames from its hostfile. This patch will allow blacklisted hosts to become whitelisted after a countdown period. For repeat failures the cooldown period grows with an exponential backoff delay: 10s, 20s, 30s. Cooldown period is capped at 5 minutes. Signed-off-by: Abin Shahab <ashahab@linkedin.com>
- Loading branch information
Showing
4 changed files
with
198 additions
and
12 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters