Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reenable isinstance with torch.distributed.ReduceOp #87303

Closed
wants to merge 7 commits into from

Conversation

crcrpar
Copy link
Collaborator

@crcrpar crcrpar commented Oct 19, 2022

tentatively marking as draft as I haven't gotten a comprehensive list of side effects...

Ref: https://stackoverflow.com/questions/40244413/python-static-class-attribute-of-the-class-itself
Rel: #87191

cc @kwen2501

@pytorch-bot
Copy link

pytorch-bot bot commented Oct 19, 2022

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/87303

Note: Links to docs will display an error until the docs builds have been completed.

❌ 4 Failures

As of commit 16bdeae:

The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>
Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>
@malfet
Copy link
Contributor

malfet commented Oct 20, 2022

.pyi changes are pure syntactic sugar and are not used for anything in particular... Let me think of a way to fix type ownership

Copy link
Contributor

@wanchaol wanchaol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for caring about BC! Left some comment, I'm wondering if we should just do sth like ReduceOp.PREMUL_SUM(premul_value)?

@@ -46,7 +47,9 @@ struct TORCH_API ReduceOp : torch::CustomClassHolder {

ReduceOp(RedOpType op) : op_(op) {
TORCH_INTERNAL_ASSERT(
op_ != PREMUL_SUM, "PREMUL_SUM requires a scale factor tensor or scalar argument");
op_ != PREMUL_SUM,
"Use `torch.distributed._make_nccl_premul_sum` to create an instance of ReduceOp with PREMUL_SUM"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the reason user need to use this API to construct a premul_sum? could user just create this reduce op type like other reduce ops? The fact that there are two APIs constructing different reduce ops are confusing

@@ -566,8 +571,7 @@ They are used in specifying strategies for reduction collectives, e.g.,
.value("BAND", ::c10d::ReduceOp::RedOpType::BAND)
.value("BOR", ::c10d::ReduceOp::RedOpType::BOR)
.value("BXOR", ::c10d::ReduceOp::RedOpType::BXOR)
.value("PREMUL_SUM", ::c10d::ReduceOp::RedOpType::PREMUL_SUM)
.export_values();
.value("PREMUL_SUM", ::c10d::ReduceOp::RedOpType::PREMUL_SUM);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just make pybind API on ReduceOp class directly, so that user can still do things like ReduceOp.SUM (i.e. as a class attribute)?

I feel maybe we can do sth like this ReduceOp.PREMUL_SUM(premul_value) looks cleaner than _make_nccl_premul_sum

Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>
Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>
Copy link
Contributor

@wanchaol wanchaol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that this is the last day of cherry picking, let's get this in once the CI passed. Could you please create an issue to follow up on a more user friendly API once you are confident to expose premul_sum as public API? Thanks!

):
self.assertTrue(isinstance(reduce_op, c10d.ReduceOp))
for scale in ([torch.tensor(1.0)], 2.0):
self.assertTrue(dist._make_nccl_premul_sum(scale), c10d.ReduceOp)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't we also need isinstance here?

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 21, 2022
Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>
I need to get familiar with pybind11 more

Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>
@seemethere
Copy link
Member

@pytorchbot merge -f "needed to fix regression in 1.13, failures unrelated"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@github-actions
Copy link

Hey @crcrpar.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

atalman pushed a commit to atalman/pytorch that referenced this pull request Oct 21, 2022
tentatively marking as draft as I haven't gotten a comprehensive list of side effects...

Ref: https://stackoverflow.com/questions/40244413/python-static-class-attribute-of-the-class-itself
Rel: pytorch#87191

cc @kwen2501
Pull Request resolved: pytorch#87303
Approved by: https://github.com/wanchaol
malfet pushed a commit that referenced this pull request Oct 21, 2022
)

tentatively marking as draft as I haven't gotten a comprehensive list of side effects...

Ref: https://stackoverflow.com/questions/40244413/python-static-class-attribute-of-the-class-itself
Rel: #87191

cc @kwen2501
Pull Request resolved: #87303
Approved by: https://github.com/wanchaol

Co-authored-by: Masaki Kozuki <mkozuki@nvidia.com>
# which broke an implicit contract of ReduceOp being enum-like with which users apply isinstance to
# `op`, for example, `isinstance(ReduceOp.SUM, ReduceOp)`: https://github.com/pytorch/pytorch/issues/87191
DENY_LIST = ("PREMUL_SUM", )
for _red_op_name, _red_op_value in ReduceOp.RedOpType.__members__.items():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@crcrpar I think this PR break the serialization of c10d.ReduceOp, with the newest nightly this now crashes:

import torch
import copy

a = torch.distributed.distributed_c10d.ReduceOp.SUM
copy.deepcopy(a)

I checked the nightly a few days ago and it works all fine, and the issue seems coming from the changes in this file, if I comment this changes, then it works again, could you help take a look on this and fix it? Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we should try to override __isinstance__ instead of overriding the attributes on class?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can confirm that I'm also running into the issue @wanchaol mentioned

wanchaol added a commit to pytorch/PiPPy that referenced this pull request Oct 24, 2022
A recent change of c10d.ReduceOp crashes deepcopy of it, pin pt version
temporarily to fix CI.
see pytorch/pytorch#87303 (comment)
wanchaol added a commit to pytorch/PiPPy that referenced this pull request Oct 24, 2022
A recent change of c10d.ReduceOp crashes deepcopy of it, pin pt version
temporarily to fix CI.
see pytorch/pytorch#87303 (comment)
wz337 pushed a commit to wz337/PiPPy that referenced this pull request Oct 24, 2022
A recent change of c10d.ReduceOp crashes deepcopy of it, pin pt version
temporarily to fix CI.
see pytorch/pytorch#87303 (comment)
sgrigory pushed a commit to sgrigory/pytorch that referenced this pull request Oct 28, 2022
tentatively marking as draft as I haven't gotten a comprehensive list of side effects...

Ref: https://stackoverflow.com/questions/40244413/python-static-class-attribute-of-the-class-itself
Rel: pytorch#87191

cc @kwen2501
Pull Request resolved: pytorch#87303
Approved by: https://github.com/wanchaol
@crcrpar crcrpar deleted the more-enum-like-reduceop branch November 1, 2022 18:58
kulinseth pushed a commit to kulinseth/pytorch that referenced this pull request Nov 5, 2022
tentatively marking as draft as I haven't gotten a comprehensive list of side effects...

Ref: https://stackoverflow.com/questions/40244413/python-static-class-attribute-of-the-class-itself
Rel: pytorch#87191

cc @kwen2501
Pull Request resolved: pytorch#87303
Approved by: https://github.com/wanchaol
pytorchmergebot pushed a commit that referenced this pull request Nov 15, 2022
Summary:
- Customize the metaclass of `torch.distributed.distributed_c10d.ReduceOp` for the sake of custom `__instancecheck__`
- Add `copy.copy`, `copy.deepcopy`, and `pickle` support with tests

Rel:
- #81272
- #84243
- #87191
- #87303
- #87555

Ref:
- pybind/pybind11#2696

Pull Request resolved: #88275
Approved by: https://github.com/wanchaol
kulinseth pushed a commit to kulinseth/pytorch that referenced this pull request Dec 10, 2022
tentatively marking as draft as I haven't gotten a comprehensive list of side effects...

Ref: https://stackoverflow.com/questions/40244413/python-static-class-attribute-of-the-class-itself
Rel: pytorch#87191

cc @kwen2501
Pull Request resolved: pytorch#87303
Approved by: https://github.com/wanchaol
kulinseth pushed a commit to kulinseth/pytorch that referenced this pull request Dec 10, 2022
)

Summary:
- Customize the metaclass of `torch.distributed.distributed_c10d.ReduceOp` for the sake of custom `__instancecheck__`
- Add `copy.copy`, `copy.deepcopy`, and `pickle` support with tests

Rel:
- pytorch#81272
- pytorch#84243
- pytorch#87191
- pytorch#87303
- pytorch#87555

Ref:
- pybind/pybind11#2696

Pull Request resolved: pytorch#88275
Approved by: https://github.com/wanchaol
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request Merged open source release notes: distributed (c10d) release notes category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants