Question regards to soure code #207

eedalong · 2022-03-29T07:45:54Z

Elastic training and Non-Elastic training seems to have the same failure processing strategy, both restart all failed workers and wait until these workers finish loading data.source code here

eedalong · 2022-03-29T07:48:20Z

And also here I think _maybe_schedule_new_actors may never function because no one update training_state before this function. source code here

krfricke · 2022-10-07T15:22:39Z

Hi @eedalong, we do test elastic and non-elastic fault tolerance in the unit tests and also in ray release tests, so it should generally work. Do you have an example where it does not?

For the code:

In your first linked code the relevant part is this:

                # Do not start new actors before resuming training
                # (this might still restart actors during training)
                start_actor_ranks.clear()

which will not force actor restart before we commence training.

For your second linked code, the relevant part is:

                # This may raise RayXGBoostActorAvailable
                _update_scheduled_actor_states(_training_state)

which updates the training state. This will trigger multiple times as the training futures will usually not be ready. If they are ready, training is over, so we don't care about actor states anymore.

Does this make sense?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question regards to soure code #207

Question regards to soure code #207

eedalong commented Mar 29, 2022

eedalong commented Mar 29, 2022 •

edited

krfricke commented Oct 7, 2022

Question regards to soure code #207

Question regards to soure code #207

Comments

eedalong commented Mar 29, 2022

eedalong commented Mar 29, 2022 • edited

krfricke commented Oct 7, 2022

eedalong commented Mar 29, 2022 •

edited