Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement more forgiving host blacklist policy in elastic mode #1926

Closed
tgaddair opened this issue May 1, 2020 · 14 comments · Fixed by #3319
Closed

Implement more forgiving host blacklist policy in elastic mode #1926

tgaddair opened this issue May 1, 2020 · 14 comments · Fixed by #3319

Comments

@tgaddair
Copy link
Collaborator

tgaddair commented May 1, 2020

Currently, process failure results in permanent blacklist of the offending host. This is problematic, as often a host failure will be transient, or a host will be temporarily removed and later restored (in the case of graceful removal).

As such, we should add two policies:

  1. Cooldown on failing hosts that allows them to be assigned again once the cooldown period has elapsed.
  2. Do not blacklist hosts that are removed from the discovered set and are removed gracefully.

This will be a follow-up to #1849.

@xychu
Copy link

xychu commented Dec 10, 2020

Any updates or plan?
When used with kubernetes, training worker pods may fail or be preempted and recovery later, we should allow to add these pods back.

@xychu
Copy link

xychu commented Dec 11, 2020

For policy:

Do not blacklist hosts that are removed from the discovered set and are removed gracefully.

If we want to implement this, does "removed gracefully" means that the hosts that have removed from "discovery_hosts" result and we will not need to blacklist that/those hosts?

@zw0610
Copy link
Contributor

zw0610 commented May 20, 2021

For policy:

Do not blacklist hosts that are removed from the discovered set and are removed gracefully.

If we want to implement this, does "removed gracefully" means that the hosts that have removed from "discovery_hosts" result and we will not need to blacklist that/those hosts?

From my user experience with Kubernetes, we'd like to 'whitelist' back those hosts that are already blacklisted. That means, the blacklisting action is not permanent: if a blacklisted hosts reported by the discover_hosts for a certain period (x times or y seconds), we can consider this host is back to live and 'whitelist' it back. Of course, such 'whitelist' action is based on the trust to the timeliness of the discover_hosts script.

@tingweiwu
Copy link

@tgaddair hi, any updates or plans about this?

@tgaddair tgaddair assigned ashahab and unassigned tgaddair Dec 9, 2021
@ashahab
Copy link
Collaborator

ashahab commented Dec 9, 2021

@tingweiwu I will take a look at this, thanks for reaching out.

@ashahab
Copy link
Collaborator

ashahab commented Dec 9, 2021

@zw0610 In the case of Kubernetes, you mean pods that get blacklisted can often become available again? Does this occur because of intermittent networking issues?

@zw0610
Copy link
Contributor

zw0610 commented Dec 10, 2021

@ashahab Somehow, yes, but not because networking issue. The resurrection (not sure any other proper words) of blacklisted pods results from the scheduler.

When we deploy elastic training job with low priority along with other jobs with high priority, pods belonging to elastic training may get evicted by the scheduler when no computation resource left and later restart when resource released from other prioritized jobs.

@ashahab
Copy link
Collaborator

ashahab commented Dec 10, 2021 via email

@zw0610
Copy link
Contributor

zw0610 commented Dec 10, 2021

Can’t these pods get unique hostnames/ip addresses when they are resurrected?

On Thu, Dec 9, 2021 at 6:18 PM Wang Zhang @.***> wrote: @ashahab https://github.com/ashahab Somehow, yes, but not because networking issue. The resurrection (not sure any other proper words) of blacklisted pods results from the scheduler. When we deploy elastic training job with low priority along with other jobs with high priority, pods belonging to elastic training may get evicted by the scheduler when no computation resource left and later restart when resource released from other prioritized jobs. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1926 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFISPKKMT22CLTZ5BNXQVLUQFPPZANCNFSM4MWZQYDA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

Unfortunately no, at least with the contemporary design of kubeflow/mpi-operator. From the very beginning of mpi-operator design, the worker pods are stateful because the hostfile (in which hostnames of all worker pods listed) that launcher pod reads is generated before worker pods are actually created. If we use random suffix, it will be very difficult to generate such a hostfile.

@ashahab
Copy link
Collaborator

ashahab commented Dec 10, 2021

Got it, that makes sense. Just for my curiosity, do you know if the new dl-training operator in kubeflow has a better way to handle this situation? Kubernetes was created with the idea of ephemeral hostnames, I wonder if the new operator can return to that principle.

@zw0610
Copy link
Contributor

zw0610 commented Dec 10, 2021

Generally it is possible to use random suffix[*], but unfortunately it's not on the roadmap we planned for the next two releases. Meanwhile, such changes will definitely lead inconsistency in user experience.

[*] Users expect mpi-operator to also provide consistent user experience when deploying regular and elastic distributed training. To use random suffix, we can move the creation of hostfile to the init container in the launcher pod, while the discover_hosts.sh script is still generated by the mpi-operator itself. Such changes should work fine, but may introduce a bit confuse to developers as they may spot difference between the hostfile and discover_hosts.sh

@ashahab
Copy link
Collaborator

ashahab commented Dec 13, 2021

@zw0610 @tgaddair Should the cooldown policy be configurable? I have it working but I'm thinking if users would want to configure the following parameters:
--cooldown-delta-seconds: Number of seconds that are used to increment the delay before a blacklisted host can be resurrected
--cooldown-lower-limit-seconds: Minimum number of seconds a host has to wait in blacklist
--cooldown-upper-limit-seconds: Maximum number of seconds a host has to wait in blacklist.

@zw0610
Copy link
Contributor

zw0610 commented Dec 14, 2021

We definitely hope the resurrection could be configured. But let me explain how I understand these 3 parameters so we can further discuss if all of them are required in our cases.

After I check #3319 , I got how these three parameters are used. It seems we need to explain to users how cooldown_period is calculated, even in the parameter descriptions. So do you think we can combine the cooldown-lower-limit-seconds and cooldown-upper-limit-seconds into one parameter, like cooldown-range and users can specify the range as python train-elastic.py --cooldown-range [100,200] (it does have to be square brackets).

  • --cooldown-lower-limit-seconds : after cooldown-lower-limit-seconds, if the discover_hosts.sh reports a hostname that already in the blacklist, we shall move the host out of the blacklist and launch worker on that host.
  • --cooldown-upper-limit-seconds : after cooldown-upper-limit-seconds, if a blacklisted host is not reported by discover_hosts.sh, we are going to permanently blacklist this host?

cooldown-delta-seconds is not ambiguous to me neither. I just wonder what benefits we shall have with such a parameter configurable?

@ashahab
Copy link
Collaborator

ashahab commented Dec 14, 2021

@zw0610 I have introduced the cooldown-range param as you've suggested. Please review.

I am currently defaulting it to no-cooldown as this is a new parameter and behavior and I do not want existing users to become surprised. We can make it default after a few releases of usage in larger clusters. @tgaddair What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging a pull request may close this issue.

5 participants