Improve `c10d::ReduceOp` & `torch.distributed.distributed_c10d.ReduceOp` #87555

crcrpar · 2022-10-22T19:11:45Z

c10d::ReduceOp is now a struct which contains an enum class of RedOptype in order to support PREMUL_SUM (premul_sum is only supported by NCCL backend).
This new reduce op type takes either a Python scalar or a Tensor and that scaling value needs to be stored somewhere while keeping the compatibility with dispatchable reduce ops (note that TorchScript compiler's support is limited) and keeping torch.distributed.ReduceOp instances enum like as possible (to name a few, allowing for __members__ and isinstance).

As the op type itself is marked experimental for now but with this requirements and the changes caused, we have to improve the API.
The question is how to have users pass a scale value and how to create ReduceOp (while before premul_sum, there was no need to create a ReduceOp instance as it was closer to Python enum).

The text was updated successfully, but these errors were encountered:

rohan-varma · 2022-10-26T16:34:05Z

@crcrpar For clarity purposes, could you provide a snippet of what user currently has to do, the suggested improvement, and what the API will look like after?

cc @wanchaol @kwen2501 for comments.

crcrpar · 2022-10-26T23:43:18Z

What users have to do

Except premul_sum, users don't need to update their code. The new op requires a scaling factor thus I currently have users call torch.distributed._make_nccl_premul_sum. At the moment, the other ops don't require changes, e.g.

pytorch/test/distributed/test_c10d_nccl.py

Line 340 in 5ee5f5a

allreduce(tensors, c10d.ReduceOp.AVG)

.
An example of premul_sum usage is

pytorch/test/distributed/test_c10d_nccl.py

Line 355 in 5ee5f5a

allreduce(tensors, c10d._make_nccl_premul_sum(factor))

What users cannot do

As @wanchaol reported in #87303 (comment), copy is disalloed.
Before the PR above (and after the premul_sum merge), isinstance didn't work.

The cause of these two is that c10d::ReduceOp is now a struct which has an internal enum type of RedOpType.
The motivation was to let ReduceOp holds some user-supplied data like a scaling factor for premul_sum.

Suggested Improvement

Make ReduceOp more enum like by writing a custom __instancecheck__

API change

how we create a reduceop instance of premul_sum could change in the future.
@wanchaol suggested something like ReduceOp.PREMUL_SUM(scale) instead of forcing us to call dist._make_nccl_premul_sum.

Summary: - Customize the metaclass of `torch.distributed.distributed_c10d.ReduceOp` for the sake of custom `__instancecheck__` - Add `copy.copy`, `copy.deepcopy`, and `pickle` support with tests Rel: - #81272 - #84243 - #87191 - #87303 - #87555 Ref: - pybind/pybind11#2696 Pull Request resolved: #88275 Approved by: https://github.com/wanchaol

) Summary: - Customize the metaclass of `torch.distributed.distributed_c10d.ReduceOp` for the sake of custom `__instancecheck__` - Add `copy.copy`, `copy.deepcopy`, and `pickle` support with tests Rel: - pytorch#81272 - pytorch#84243 - pytorch#87191 - pytorch#87303 - pytorch#87555 Ref: - pybind/pybind11#2696 Pull Request resolved: pytorch#88275 Approved by: https://github.com/wanchaol

ngimel added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Oct 23, 2022

rohan-varma added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Oct 26, 2022

crcrpar mentioned this issue Nov 2, 2022

[c10d] Implement __instancecheck__ for c10d::ReduceOp #88275

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve `c10d::ReduceOp` & `torch.distributed.distributed_c10d.ReduceOp` #87555

Improve `c10d::ReduceOp` & `torch.distributed.distributed_c10d.ReduceOp` #87555

crcrpar commented Oct 22, 2022 •

edited by pytorch-bot bot

rohan-varma commented Oct 26, 2022

crcrpar commented Oct 26, 2022

Improve c10d::ReduceOp & torch.distributed.distributed_c10d.ReduceOp #87555

Improve c10d::ReduceOp & torch.distributed.distributed_c10d.ReduceOp #87555

Comments

crcrpar commented Oct 22, 2022 • edited by pytorch-bot bot

rohan-varma commented Oct 26, 2022

crcrpar commented Oct 26, 2022

What users have to do

What users cannot do

Suggested Improvement

API change

Improve `c10d::ReduceOp` & `torch.distributed.distributed_c10d.ReduceOp` #87555

Improve `c10d::ReduceOp` & `torch.distributed.distributed_c10d.ReduceOp` #87555

crcrpar commented Oct 22, 2022 •

edited by pytorch-bot bot