Implement more forgiving host blacklist policy in elastic mode #1926

tgaddair · 2020-05-01T02:13:18Z

Currently, process failure results in permanent blacklist of the offending host. This is problematic, as often a host failure will be transient, or a host will be temporarily removed and later restored (in the case of graceful removal).

As such, we should add two policies:

Cooldown on failing hosts that allows them to be assigned again once the cooldown period has elapsed.
Do not blacklist hosts that are removed from the discovered set and are removed gracefully.

This will be a follow-up to #1849.

xychu · 2020-12-10T09:17:07Z

Any updates or plan?
When used with kubernetes, training worker pods may fail or be preempted and recovery later, we should allow to add these pods back.

xychu · 2020-12-11T07:44:27Z

For policy:

Do not blacklist hosts that are removed from the discovered set and are removed gracefully.

If we want to implement this, does "removed gracefully" means that the hosts that have removed from "discovery_hosts" result and we will not need to blacklist that/those hosts?

zw0610 · 2021-05-20T13:10:15Z

For policy:

Do not blacklist hosts that are removed from the discovered set and are removed gracefully.

If we want to implement this, does "removed gracefully" means that the hosts that have removed from "discovery_hosts" result and we will not need to blacklist that/those hosts?

From my user experience with Kubernetes, we'd like to 'whitelist' back those hosts that are already blacklisted. That means, the blacklisting action is not permanent: if a blacklisted hosts reported by the discover_hosts for a certain period (x times or y seconds), we can consider this host is back to live and 'whitelist' it back. Of course, such 'whitelist' action is based on the trust to the timeliness of the discover_hosts script.

tingweiwu · 2021-12-08T11:59:46Z

@tgaddair hi, any updates or plans about this?

ashahab · 2021-12-09T17:53:41Z

@tingweiwu I will take a look at this, thanks for reaching out.

ashahab · 2021-12-09T17:54:52Z

@zw0610 In the case of Kubernetes, you mean pods that get blacklisted can often become available again? Does this occur because of intermittent networking issues?

zw0610 · 2021-12-10T02:18:25Z

@ashahab Somehow, yes, but not because networking issue. The resurrection (not sure any other proper words) of blacklisted pods results from the scheduler.

When we deploy elastic training job with low priority along with other jobs with high priority, pods belonging to elastic training may get evicted by the scheduler when no computation resource left and later restart when resource released from other prioritized jobs.

ashahab · 2021-12-10T03:34:00Z

Can’t these pods get unique hostnames/ip addresses when they are resurrected?

…

On Thu, Dec 9, 2021 at 6:18 PM Wang Zhang ***@***.***> wrote: @ashahab <https://github.com/ashahab> Somehow, yes, but not because networking issue. The *resurrection* (not sure any other proper words) of blacklisted pods results from the scheduler. When we deploy elastic training job with low priority along with other jobs with high priority, pods belonging to elastic training may get evicted by the scheduler when no computation resource left and later restart when resource released from other prioritized jobs. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1926 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAFISPKKMT22CLTZ5BNXQVLUQFPPZANCNFSM4MWZQYDA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

zw0610 · 2021-12-10T03:44:31Z

Can’t these pods get unique hostnames/ip addresses when they are resurrected?
…
On Thu, Dec 9, 2021 at 6:18 PM Wang Zhang @.***> wrote: @ashahab https://github.com/ashahab Somehow, yes, but not because networking issue. The resurrection (not sure any other proper words) of blacklisted pods results from the scheduler. When we deploy elastic training job with low priority along with other jobs with high priority, pods belonging to elastic training may get evicted by the scheduler when no computation resource left and later restart when resource released from other prioritized jobs. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1926 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFISPKKMT22CLTZ5BNXQVLUQFPPZANCNFSM4MWZQYDA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

Unfortunately no, at least with the contemporary design of kubeflow/mpi-operator. From the very beginning of mpi-operator design, the worker pods are stateful because the hostfile (in which hostnames of all worker pods listed) that launcher pod reads is generated before worker pods are actually created. If we use random suffix, it will be very difficult to generate such a hostfile.

ashahab · 2021-12-10T04:33:08Z

Got it, that makes sense. Just for my curiosity, do you know if the new dl-training operator in kubeflow has a better way to handle this situation? Kubernetes was created with the idea of ephemeral hostnames, I wonder if the new operator can return to that principle.

zw0610 · 2021-12-10T06:58:09Z

Generally it is possible to use random suffix[*], but unfortunately it's not on the roadmap we planned for the next two releases. Meanwhile, such changes will definitely lead inconsistency in user experience.

[*] Users expect mpi-operator to also provide consistent user experience when deploying regular and elastic distributed training. To use random suffix, we can move the creation of hostfile to the init container in the launcher pod, while the discover_hosts.sh script is still generated by the mpi-operator itself. Such changes should work fine, but may introduce a bit confuse to developers as they may spot difference between the hostfile and discover_hosts.sh

ashahab · 2021-12-13T17:50:51Z

@zw0610 @tgaddair Should the cooldown policy be configurable? I have it working but I'm thinking if users would want to configure the following parameters:
--cooldown-delta-seconds: Number of seconds that are used to increment the delay before a blacklisted host can be resurrected
--cooldown-lower-limit-seconds: Minimum number of seconds a host has to wait in blacklist
--cooldown-upper-limit-seconds: Maximum number of seconds a host has to wait in blacklist.

zw0610 · 2021-12-14T02:06:24Z

We definitely hope the resurrection could be configured. But let me explain how I understand these 3 parameters so we can further discuss if all of them are required in our cases.

After I check #3319 , I got how these three parameters are used. It seems we need to explain to users how cooldown_period is calculated, even in the parameter descriptions. So do you think we can combine the cooldown-lower-limit-seconds and cooldown-upper-limit-seconds into one parameter, like cooldown-range and users can specify the range as python train-elastic.py --cooldown-range [100,200] (it does have to be square brackets).

--cooldown-lower-limit-seconds : after cooldown-lower-limit-seconds, if the discover_hosts.sh reports a hostname that already in the blacklist, we shall move the host out of the blacklist and launch worker on that host.
~~--cooldown-upper-limit-seconds : after cooldown-upper-limit-seconds, if a blacklisted host is not reported by discover_hosts.sh, we are going to permanently blacklist this host?~~

~~cooldown-delta-seconds is not ambiguous to me neither. I just wonder what benefits we shall have with such a parameter configurable?~~

ashahab · 2021-12-14T19:01:25Z

@zw0610 I have introduced the cooldown-range param as you've suggested. Please review.

I am currently defaulting it to no-cooldown as this is a new parameter and behavior and I do not want existing users to become surprised. We can make it default after a few releases of usage in larger clusters. @tgaddair What do you think?

tgaddair added enhancement elastic labels May 1, 2020

tgaddair self-assigned this May 1, 2020

tgaddair mentioned this issue May 1, 2020

Initial bare-metal implementation of elastic mode for fault tolerance and auto-scaling #1849

Merged

woodlgz mentioned this issue Dec 15, 2020

provide a way for ever-blacklisted hosts be available again #2483

Closed

4 tasks

tgaddair assigned ashahab and unassigned tgaddair Dec 9, 2021

ashahab mentioned this issue Dec 13, 2021

Support resurrecting blacklisted hosts #3319

Merged

3 tasks

EnricoMi closed this as completed in #3319 Jan 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement more forgiving host blacklist policy in elastic mode #1926

Implement more forgiving host blacklist policy in elastic mode #1926

tgaddair commented May 1, 2020

xychu commented Dec 10, 2020

xychu commented Dec 11, 2020

zw0610 commented May 20, 2021 •

edited

tingweiwu commented Dec 8, 2021

ashahab commented Dec 9, 2021

ashahab commented Dec 9, 2021

zw0610 commented Dec 10, 2021

ashahab commented Dec 10, 2021 via email

zw0610 commented Dec 10, 2021 •

edited

ashahab commented Dec 10, 2021

zw0610 commented Dec 10, 2021

ashahab commented Dec 13, 2021

zw0610 commented Dec 14, 2021 •

edited

ashahab commented Dec 14, 2021

Implement more forgiving host blacklist policy in elastic mode #1926

Implement more forgiving host blacklist policy in elastic mode #1926

Comments

tgaddair commented May 1, 2020

xychu commented Dec 10, 2020

xychu commented Dec 11, 2020

zw0610 commented May 20, 2021 • edited

tingweiwu commented Dec 8, 2021

ashahab commented Dec 9, 2021

ashahab commented Dec 9, 2021

zw0610 commented Dec 10, 2021

ashahab commented Dec 10, 2021 via email

zw0610 commented Dec 10, 2021 • edited

ashahab commented Dec 10, 2021

zw0610 commented Dec 10, 2021

ashahab commented Dec 13, 2021

zw0610 commented Dec 14, 2021 • edited

ashahab commented Dec 14, 2021

zw0610 commented May 20, 2021 •

edited

zw0610 commented Dec 10, 2021 •

edited

zw0610 commented Dec 14, 2021 •

edited