You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Elastic training and Non-Elastic training seems to have the same failure processing strategy, both restart all failed workers and wait until these workers finish loading data.source code here
The text was updated successfully, but these errors were encountered:
Hi @eedalong, we do test elastic and non-elastic fault tolerance in the unit tests and also in ray release tests, so it should generally work. Do you have an example where it does not?
For the code:
In your first linked code the relevant part is this:
# Do not start new actors before resuming training
# (this might still restart actors during training)
start_actor_ranks.clear()
which will not force actor restart before we commence training.
For your second linked code, the relevant part is:
# This may raise RayXGBoostActorAvailable
_update_scheduled_actor_states(_training_state)
which updates the training state. This will trigger multiple times as the training futures will usually not be ready. If they are ready, training is over, so we don't care about actor states anymore.
Elastic training and Non-Elastic training seems to have the same failure processing strategy, both restart all failed workers and wait until these workers finish loading data.source code here
The text was updated successfully, but these errors were encountered: