New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Bagua Strategy #11146
Add Bagua Strategy #11146
Conversation
for more information, see https://pre-commit.ci
start_training was removed on master
for more information, see https://pre-commit.ci
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome!
…g into bagua-plugin
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM !
🎉 🎉 🎉 🎉 🎉 |
Congrats ! Awesome work ! |
@@ -807,7 +816,7 @@ def select_cluster_environment(self) -> ClusterEnvironment: | |||
rank_zero_info("Multiprocessing is handled by SLURM.") | |||
return SLURMEnvironment() | |||
|
|||
for env_type in (TorchElasticEnvironment, KubeflowEnvironment, LSFEnvironment): | |||
for env_type in (BaguaEnvironment, TorchElasticEnvironment, KubeflowEnvironment, LSFEnvironment): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why does Bagua need to be the first environment to check?
) | ||
|
||
@classmethod | ||
def register_plugins(cls, plugin_registry: Dict) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method name has to be register_strategies, otherwise won't be called
https://github.com/PyTorchLightning/pytorch-lightning/blob/1203094a201bd38f0b8b77d93bc39fc95f06d8ae/pytorch_lightning/strategies/strategy_registry.py#L137
There is no test covers strategy="bagua", so this issue didn't get caught
I will fix it in #11448, or feel free to open a seperate PR to fix this
What does this PR do?
Fixes #10455
Fixes BaguaSys/bagua#304
Suggested follow ups:
bagua.distributed.run
tobagua.distributed.launch
to match torchDoes your PR introduce any breaking changes? If yes, please list them.
None
Before submitting
PR review
Anyone in the community is welcome to review the PR.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:
Did you have fun?
Make sure you had fun coding 🙃