New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve c10d::ReduceOp
& torch.distributed.distributed_c10d.ReduceOp
#87555
Comments
What users have to doExcept premul_sum, users don't need to update their code. The new op requires a scaling factor thus I currently have users call pytorch/test/distributed/test_c10d_nccl.py Line 340 in 5ee5f5a
An example of premul_sum usage is pytorch/test/distributed/test_c10d_nccl.py Line 355 in 5ee5f5a
What users cannot doAs @wanchaol reported in #87303 (comment), The cause of these two is that Suggested ImprovementMake API changehow we create a reduceop instance of premul_sum could change in the future. |
Summary: - Customize the metaclass of `torch.distributed.distributed_c10d.ReduceOp` for the sake of custom `__instancecheck__` - Add `copy.copy`, `copy.deepcopy`, and `pickle` support with tests Rel: - #81272 - #84243 - #87191 - #87303 - #87555 Ref: - pybind/pybind11#2696 Pull Request resolved: #88275 Approved by: https://github.com/wanchaol
) Summary: - Customize the metaclass of `torch.distributed.distributed_c10d.ReduceOp` for the sake of custom `__instancecheck__` - Add `copy.copy`, `copy.deepcopy`, and `pickle` support with tests Rel: - pytorch#81272 - pytorch#84243 - pytorch#87191 - pytorch#87303 - pytorch#87555 Ref: - pybind/pybind11#2696 Pull Request resolved: pytorch#88275 Approved by: https://github.com/wanchaol
c10d::ReduceOp
is now a struct which contains an enum class ofRedOptype
in order to supportPREMUL_SUM
(premul_sum is only supported by NCCL backend).This new reduce op type takes either a Python scalar or a Tensor and that scaling value needs to be stored somewhere while keeping the compatibility with dispatchable reduce ops (note that TorchScript compiler's support is limited) and keeping
torch.distributed.ReduceOp
instances enum like as possible (to name a few, allowing for__members__
andisinstance
).As the op type itself is marked experimental for now but with this requirements and the changes caused, we have to improve the API.
The question is how to have users pass a scale value and how to create ReduceOp (while before premul_sum, there was no need to create a ReduceOp instance as it was closer to Python enum).
Related:
isinstance
withtorch.distributed.ReduceOp
#87303redce
ops #84243cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @wanchaol @carmocca
The text was updated successfully, but these errors were encountered: