Refactor ElasticRayExecutor as part of RayExecutor #3190

tgaddair · 2021-10-01T21:28:11Z

Currently, ElasticRayExecutor is implemented as a separate class from the normal RayExecutor, but this doesn't need to be the case. If we instead add the ElasticRayExecutor params to the RayExecutor constructor, we can dynamically create an autoscaling worker topology when the user provides min_workers or max_workers, or any other elastic-only params.

The text was updated successfully, but these errors were encountered:

This resolves horovod#3190 with a new RayExecutor API for horovod: `RayExecutorV2`. This API supports both static(non-elastic) and elastic horovod jobs. Example of static job: ```python from horovod.ray import RayExecutor ray.init() hjob = RayExecutorV2(setting, NoneElasticParams( np=num_workers, use_gpu=True )) executor.start() def simple_fn(): hvd.init() print("hvd rank", hvd.rank()) return hvd.rank() result = executor.run(simple_fn) assert len(set(result)) == hosts * num_slots executor.shutdown() ``` Example of an elastic job: ``` import horovod.torch as hvd def training_fn(): hvd.init() model = Model() torch.cuda.set_device(hvd.local_rank()) @hvd.elastic.run def train(state): for state.epoch in range(state.epoch, epochs): ... state.commit() state = hvd.elastic.TorchState(model, optimizer, batch=0, epoch=0) state.register_reset_callbacks([on_state_reset]) train(state) return executor = RayExecutorV2(settings, ElasticParams(use_gpu=True, cpus_per_worker=2)) executor.start() executor.run(training_fn) ```

This resolves horovod#3190 with a new RayExecutor API for horovod: `RayExecutorV2`. This API supports both static(non-elastic) and elastic horovod jobs. Example of static job: ```python from horovod.ray import RayExecutor ray.init() hjob = RayExecutorV2(setting, NoneElasticParams( np=num_workers, use_gpu=True )) executor.start() def simple_fn(): hvd.init() print("hvd rank", hvd.rank()) return hvd.rank() result = executor.run(simple_fn) assert len(set(result)) == hosts * num_slots executor.shutdown() ``` Example of an elastic job: ``` import horovod.torch as hvd def training_fn(): hvd.init() model = Model() torch.cuda.set_device(hvd.local_rank()) @hvd.elastic.run def train(state): for state.epoch in range(state.epoch, epochs): ... state.commit() state = hvd.elastic.TorchState(model, optimizer, batch=0, epoch=0) state.register_reset_callbacks([on_state_reset]) train(state) return executor = RayExecutorV2(settings, ElasticParams(use_gpu=True, cpus_per_worker=2)) executor.start() executor.run(training_fn) ``` Signed-off-by: Abin Shahab <ashahab@linkedin.com>

This resolves horovod#3190 with a new RayExecutor API for horovod: `RayExecutorV2`. This API supports both static(non-elastic) and elastic horovod jobs. Example of static job: ```python from horovod.ray import RayExecutor ray.init() hjob = RayExecutorV2(setting, NoneElasticParams( np=num_workers, use_gpu=True )) executor.start() def simple_fn(): hvd.init() print("hvd rank", hvd.rank()) return hvd.rank() result = executor.run(simple_fn) assert len(set(result)) == hosts * num_slots executor.shutdown() ``` Example of an elastic job: ```python import horovod.torch as hvd def training_fn(): hvd.init() model = Model() torch.cuda.set_device(hvd.local_rank()) @hvd.elastic.run def train(state): for state.epoch in range(state.epoch, epochs): ... state.commit() state = hvd.elastic.TorchState(model, optimizer, batch=0, epoch=0) state.register_reset_callbacks([on_state_reset]) train(state) return executor = RayExecutorV2(settings, ElasticParams(use_gpu=True, cpus_per_worker=2)) executor.start() executor.run(training_fn) ``` Signed-off-by: Abin Shahab <ashahab@linkedin.com>

This resolves horovod#3190 by adding elastic params to the RayExecutor API for horovod: This API now supports both static(non-elastic) and elastic horovod jobs. Example of static job(Identical to current RayExecutor): ```python from horovod.ray import RayExecutor ray.init() hjob = RayExecutor(setting, num_workers=num_workers, use_gpu=True )) executor.start() def simple_fn(): hvd.init() print("hvd rank", hvd.rank()) return hvd.rank() result = executor.run(simple_fn) assert len(set(result)) == hosts * num_slots executor.shutdown() ``` Example of an elastic job: ```python from horovod.ray import RayExecutor import horovod.torch as hvd def training_fn(): hvd.init() model = Model() torch.cuda.set_device(hvd.local_rank()) @hvd.elastic.run def train(state): for state.epoch in range(state.epoch, epochs): ... state.commit() state = hvd.elastic.TorchState(model, optimizer, batch=0, epoch=0) state.register_reset_callbacks([on_state_reset]) train(state) return executor = RayExecutor(settings, min_workers=1, use_gpu=True, cpus_per_worker=2) executor.start() executor.run(training_fn) ``` Signed-off-by: Abin Shahab <ashahab@linkedin.com>

This resolves #3190 by adding elastic params to the RayExecutor API for horovod: This API now supports both static(non-elastic) and elastic horovod jobs. Example of static job(Identical to current RayExecutor): ```python from horovod.ray import RayExecutor ray.init() hjob = RayExecutor(setting, num_workers=num_workers, use_gpu=True )) executor.start() def simple_fn(): hvd.init() print("hvd rank", hvd.rank()) return hvd.rank() result = executor.run(simple_fn) assert len(set(result)) == hosts * num_slots executor.shutdown() ``` Example of an elastic job: ```python from horovod.ray import RayExecutor import horovod.torch as hvd def training_fn(): hvd.init() model = Model() torch.cuda.set_device(hvd.local_rank()) @hvd.elastic.run def train(state): for state.epoch in range(state.epoch, epochs): ... state.commit() state = hvd.elastic.TorchState(model, optimizer, batch=0, epoch=0) state.register_reset_callbacks([on_state_reset]) train(state) return executor = RayExecutor(settings, min_workers=1, use_gpu=True, cpus_per_worker=2) executor.start() executor.run(training_fn) ``` Signed-off-by: Abin Shahab <ashahab@linkedin.com>

tgaddair added the enhancement label Oct 1, 2021

tgaddair mentioned this issue Oct 5, 2021

RayElastic scale-up test fails #3197

Closed

ashahab mentioned this issue Oct 19, 2021

RayExecutor: Dynamic executor for elastic and static jobs #3230

Merged

ashahab closed this as completed in #3230 Nov 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor ElasticRayExecutor as part of RayExecutor #3190

Refactor ElasticRayExecutor as part of RayExecutor #3190

tgaddair commented Oct 1, 2021

Refactor ElasticRayExecutor as part of RayExecutor #3190

Refactor ElasticRayExecutor as part of RayExecutor #3190

Comments

tgaddair commented Oct 1, 2021