New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement more forgiving host blacklist policy in elastic mode #1926
Comments
Any updates or plan? |
For policy:
If we want to implement this, does "removed gracefully" means that the hosts that have removed from "discovery_hosts" result and we will not need to blacklist that/those hosts? |
From my user experience with Kubernetes, we'd like to 'whitelist' back those hosts that are already blacklisted. That means, the blacklisting action is not permanent: if a blacklisted hosts reported by the |
@tgaddair hi, any updates or plans about this? |
@tingweiwu I will take a look at this, thanks for reaching out. |
@zw0610 In the case of Kubernetes, you mean pods that get blacklisted can often become available again? Does this occur because of intermittent networking issues? |
@ashahab Somehow, yes, but not because networking issue. The resurrection (not sure any other proper words) of blacklisted pods results from the scheduler. When we deploy elastic training job with low priority along with other jobs with high priority, pods belonging to elastic training may get evicted by the scheduler when no computation resource left and later restart when resource released from other prioritized jobs. |
Can’t these pods get unique hostnames/ip addresses when they are
resurrected?
…On Thu, Dec 9, 2021 at 6:18 PM Wang Zhang ***@***.***> wrote:
@ashahab <https://github.com/ashahab> Somehow, yes, but not because
networking issue. The *resurrection* (not sure any other proper words) of
blacklisted pods results from the scheduler.
When we deploy elastic training job with low priority along with other
jobs with high priority, pods belonging to elastic training may get evicted
by the scheduler when no computation resource left and later restart when
resource released from other prioritized jobs.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1926 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAFISPKKMT22CLTZ5BNXQVLUQFPPZANCNFSM4MWZQYDA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
Unfortunately no, at least with the contemporary design of kubeflow/mpi-operator. From the very beginning of mpi-operator design, the worker pods are stateful because the |
Got it, that makes sense. Just for my curiosity, do you know if the new dl-training operator in kubeflow has a better way to handle this situation? Kubernetes was created with the idea of ephemeral hostnames, I wonder if the new operator can return to that principle. |
Generally it is possible to use random suffix[*], but unfortunately it's not on the roadmap we planned for the next two releases. Meanwhile, such changes will definitely lead inconsistency in user experience. [*] Users expect mpi-operator to also provide consistent user experience when deploying regular and elastic distributed training. To use random suffix, we can move the creation of |
@zw0610 @tgaddair Should the cooldown policy be configurable? I have it working but I'm thinking if users would want to configure the following parameters: |
We definitely hope the resurrection could be configured. But let me explain how I understand these 3 parameters so we can further discuss if all of them are required in our cases. After I check #3319 , I got how these three parameters are used. It seems we need to explain to users how
|
@zw0610 I have introduced the I am currently defaulting it to no-cooldown as this is a new parameter and behavior and I do not want existing users to become surprised. We can make it default after a few releases of usage in larger clusters. @tgaddair What do you think? |
Currently, process failure results in permanent blacklist of the offending host. This is problematic, as often a host failure will be transient, or a host will be temporarily removed and later restored (in the case of graceful removal).
As such, we should add two policies:
This will be a follow-up to #1849.
The text was updated successfully, but these errors were encountered: