Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ReduceOP] Type bug since Torch 1.13 #90072

Closed
chongxiaoc opened this issue Dec 2, 2022 · 1 comment
Closed

[ReduceOP] Type bug since Torch 1.13 #90072

chongxiaoc opened this issue Dec 2, 2022 · 1 comment
Labels
oncall: distributed Add this issue/PR to distributed oncall triage queue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@chongxiaoc
Copy link

chongxiaoc commented Dec 2, 2022

馃悰 Describe the bug

Since Torch 1.13, ReduceOP type seems changed and the below scripts would throw out an error:

        >>> from torch.distributed import ReduceOp
        >>> op = None
        >>> op in (ReduceOp.SUM, None)
        Traceback (most recent call last):
            File "<stdin>", line 1, in <module>
            TypeError: __eq__(): incompatible function arguments. The following argument types are supported:
            1. (self: torch._C._distributed_c10d.ReduceOp, arg0: c10d::ReduceOp::RedOpType) -> bool
            2. (self: torch._C._distributed_c10d.ReduceOp, arg0: torch._C._distributed_c10d.ReduceOp) -> bool
        Invoked with: <torch.distributed.distributed_c10d.ReduceOp object at 0x7fba78c9e0b0>, None

This impacts Horovod and Lightning end-to-end run, see Lightning side issue Lightning-AI/pytorch-lightning#15802

Versions

1.13

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu

@pritamdamania87
Copy link
Contributor

pritamdamania87 commented Dec 2, 2022

@crcrpar @kwen2501 This seems to be related to #84243

@pritamdamania87 pritamdamania87 added triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module oncall: distributed Add this issue/PR to distributed oncall triage queue labels Dec 2, 2022
kulinseth pushed a commit to kulinseth/pytorch that referenced this issue Dec 10, 2022
Improve the completeness of `ReduceOp.__eq__`.

Should support the equal operator with the first argument of `RedOpType` and the second of `ReduceOp` in a follow-up.

Fixes pytorch#90072

Pull Request resolved: pytorch#90088
Approved by: https://github.com/kwen2501
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
oncall: distributed Add this issue/PR to distributed oncall triage queue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants