Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question regards to soure code #207

Open
eedalong opened this issue Mar 29, 2022 · 2 comments
Open

Question regards to soure code #207

eedalong opened this issue Mar 29, 2022 · 2 comments

Comments

@eedalong
Copy link

Elastic training and Non-Elastic training seems to have the same failure processing strategy, both restart all failed workers and wait until these workers finish loading data.source code here

@eedalong
Copy link
Author

eedalong commented Mar 29, 2022

And also here I think _maybe_schedule_new_actors may never function because no one update training_state before this function. source code here

@krfricke
Copy link
Collaborator

krfricke commented Oct 7, 2022

Hi @eedalong, we do test elastic and non-elastic fault tolerance in the unit tests and also in ray release tests, so it should generally work. Do you have an example where it does not?

For the code:

In your first linked code the relevant part is this:

                # Do not start new actors before resuming training
                # (this might still restart actors during training)
                start_actor_ranks.clear()

which will not force actor restart before we commence training.

For your second linked code, the relevant part is:

                # This may raise RayXGBoostActorAvailable
                _update_scheduled_actor_states(_training_state)

which updates the training state. This will trigger multiple times as the training futures will usually not be ready. If they are ready, training is over, so we don't care about actor states anymore.

Does this make sense?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants