Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve c10d::ReduceOp & torch.distributed.distributed_c10d.ReduceOp #87555

Open
crcrpar opened this issue Oct 22, 2022 · 2 comments
Open

Improve c10d::ReduceOp & torch.distributed.distributed_c10d.ReduceOp #87555

crcrpar opened this issue Oct 22, 2022 · 2 comments
Labels
oncall: distributed Add this issue/PR to distributed oncall triage queue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@crcrpar
Copy link
Collaborator

crcrpar commented Oct 22, 2022

c10d::ReduceOp is now a struct which contains an enum class of RedOptype in order to support PREMUL_SUM (premul_sum is only supported by NCCL backend).
This new reduce op type takes either a Python scalar or a Tensor and that scaling value needs to be stored somewhere while keeping the compatibility with dispatchable reduce ops (note that TorchScript compiler's support is limited) and keeping torch.distributed.ReduceOp instances enum like as possible (to name a few, allowing for __members__ and isinstance).

As the op type itself is marked experimental for now but with this requirements and the changes caused, we have to improve the API.
The question is how to have users pass a scale value and how to create ReduceOp (while before premul_sum, there was no need to create a ReduceOp instance as it was closer to Python enum).

Related:

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @wanchaol @carmocca

@ngimel ngimel added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Oct 23, 2022
@rohan-varma
Copy link
Member

@crcrpar For clarity purposes, could you provide a snippet of what user currently has to do, the suggested improvement, and what the API will look like after?

cc @wanchaol @kwen2501 for comments.

@rohan-varma rohan-varma added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Oct 26, 2022
@crcrpar
Copy link
Collaborator Author

crcrpar commented Oct 26, 2022

What users have to do

Except premul_sum, users don't need to update their code. The new op requires a scaling factor thus I currently have users call torch.distributed._make_nccl_premul_sum. At the moment, the other ops don't require changes, e.g.

allreduce(tensors, c10d.ReduceOp.AVG)
.
An example of premul_sum usage is
allreduce(tensors, c10d._make_nccl_premul_sum(factor))

What users cannot do

As @wanchaol reported in #87303 (comment), copy is disalloed.
Before the PR above (and after the premul_sum merge), isinstance didn't work.

The cause of these two is that c10d::ReduceOp is now a struct which has an internal enum type of RedOpType.
The motivation was to let ReduceOp holds some user-supplied data like a scaling factor for premul_sum.

Suggested Improvement

Make ReduceOp more enum like by writing a custom __instancecheck__

API change

how we create a reduceop instance of premul_sum could change in the future.
@wanchaol suggested something like ReduceOp.PREMUL_SUM(scale) instead of forcing us to call dist._make_nccl_premul_sum.

pytorchmergebot pushed a commit that referenced this issue Nov 15, 2022
Summary:
- Customize the metaclass of `torch.distributed.distributed_c10d.ReduceOp` for the sake of custom `__instancecheck__`
- Add `copy.copy`, `copy.deepcopy`, and `pickle` support with tests

Rel:
- #81272
- #84243
- #87191
- #87303
- #87555

Ref:
- pybind/pybind11#2696

Pull Request resolved: #88275
Approved by: https://github.com/wanchaol
kulinseth pushed a commit to kulinseth/pytorch that referenced this issue Dec 10, 2022
)

Summary:
- Customize the metaclass of `torch.distributed.distributed_c10d.ReduceOp` for the sake of custom `__instancecheck__`
- Add `copy.copy`, `copy.deepcopy`, and `pickle` support with tests

Rel:
- pytorch#81272
- pytorch#84243
- pytorch#87191
- pytorch#87303
- pytorch#87555

Ref:
- pybind/pybind11#2696

Pull Request resolved: pytorch#88275
Approved by: https://github.com/wanchaol
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
oncall: distributed Add this issue/PR to distributed oncall triage queue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

3 participants