-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RayExecutor: Dynamic executor for elastic and static jobs #3230
Conversation
5664d5e
to
43ddef8
Compare
Unit Test Results 765 files + 369 765 suites +369 6h 6m 54s ⏱️ + 2h 4m 35s For more details on these failures, see this check. Results for commit e031074. ± Comparison against base commit 06aa579. This pull request removes 1 and adds 12 tests. Note that renamed tests count towards both.
♻️ This comment has been updated with latest results. |
Unit Test Results (with flaky tests) 995 files + 435 995 suites +435 7h 0m 29s ⏱️ + 2h 22m 32s For more details on these failures, see this check. Results for commit e031074. ± Comparison against base commit 06aa579. This pull request removes 1 and adds 12 tests. Note that renamed tests count towards both.
♻️ This comment has been updated with latest results. |
cc8edda
to
8195064
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ashahab Can you please explicitly state what is the delta between the standard RayExecutor and the new RayExecutorV2 in terms of API?
Is it possible to maintain compatibility across the two versions (while being OK with breaking ElasticRayExecutor)?
@richardliaw I see two options here: Option 2: Get rid of the Which option is preferred? @tgaddair what's your thought on keeping the current API vs. making the new API identical for elastic and non-elastic |
I think (1) is a safer move given that there are multiple organizations depending on RayExecutor, and introducing instability into the supply chain is highly undesirable. I'm actually not that worried about the behaviors of "start" being different from "run". That being said, I would be open to user feedback in the future about the APIs being unintuitive. I would recommend documenting it clearly though. |
Hey @ashahab, this looks awesome. In general, I think this does a great job of cleanly unifying the elastic and non-elastic use cases. A few general observations based on the above discussion:
There is a useful benefit to https://github.com/ludwig-ai/ludwig/blob/master/ludwig/backend/ray.py#L304 So to sum up, I think preserving RayExecutor's existing API is important, and it's okay if the elastic mode behaves slightly differently for now, as long as it's documented and there is plan for getting closer alignment in the future (where it makes sense). @ashahab @richardliaw does this make sense to you, or were there other considerations I missed? |
69cf2af
to
f0af3e5
Compare
97fd361
to
d482803
Compare
d482803
to
54bd952
Compare
ec16f73
to
bbece07
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This diff looks good to me. Can you try merging master again?
bbece07
to
d199fc1
Compare
This resolves horovod#3190 by adding elastic params to the RayExecutor API for horovod: This API now supports both static(non-elastic) and elastic horovod jobs. Example of static job(Identical to current RayExecutor): ```python from horovod.ray import RayExecutor ray.init() hjob = RayExecutor(setting, num_workers=num_workers, use_gpu=True )) executor.start() def simple_fn(): hvd.init() print("hvd rank", hvd.rank()) return hvd.rank() result = executor.run(simple_fn) assert len(set(result)) == hosts * num_slots executor.shutdown() ``` Example of an elastic job: ```python from horovod.ray import RayExecutor import horovod.torch as hvd def training_fn(): hvd.init() model = Model() torch.cuda.set_device(hvd.local_rank()) @hvd.elastic.run def train(state): for state.epoch in range(state.epoch, epochs): ... state.commit() state = hvd.elastic.TorchState(model, optimizer, batch=0, epoch=0) state.register_reset_callbacks([on_state_reset]) train(state) return executor = RayExecutor(settings, min_workers=1, use_gpu=True, cpus_per_worker=2) executor.start() executor.run(training_fn) ``` Signed-off-by: Abin Shahab <ashahab@linkedin.com>
d199fc1
to
e031074
Compare
RayExecutor V2: Dynamic executor for elastic and static jobs
This resolves #3190 by adding elastic params to the RayExecutor API for horovod:
This API now supports both static(non-elastic) and elastic horovod jobs.
Example of static job(Identical to current RayExecutor):
Example of an elastic job: