Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reenable isinstance with torch.distributed.ReduceOp #87303

Closed
wants to merge 7 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
12 changes: 12 additions & 0 deletions test/distributed/test_c10d_common.py
Expand Up @@ -1622,6 +1622,18 @@ def comm_fn(tensor, group=None):
self._test_work_wait(tensor, comm_fn=comm_fn)


class ReduceOpTest(TestCase):

def test_op_isinstance_of_reduceop(self):
for reduce_op in (
c10d.ReduceOp.SUM, c10d.ReduceOp.AVG, c10d.ReduceOp.PRODUCT, c10d.ReduceOp.MIN, c10d.ReduceOp.MAX,
c10d.ReduceOp.BAND, c10d.ReduceOp.BOR, c10d.ReduceOp.BXOR,
):
self.assertTrue(isinstance(reduce_op, c10d.ReduceOp))
for scale in ([torch.tensor(1.0)], 2.0):
self.assertTrue(isinstance(dist._make_nccl_premul_sum(scale), c10d.ReduceOp))


if __name__ == "__main__":
assert (
not torch.cuda._initialized
Expand Down
3 changes: 1 addition & 2 deletions torch/_C/_distributed_c10d.pyi
Expand Up @@ -63,7 +63,6 @@ class DebugLevel(Enum):

class ReduceOp:

# note(crcrpar): These values are populated from Kind
SUM = ...
PRODUCT = ...
MIN = ...
Expand All @@ -74,7 +73,7 @@ class ReduceOp:
PREMUL_SUM = ...
UNUSED = ...

class Kind(Enum): ...
class RedOpType(Enum): ...

class BroadcastOptions:
rootRank: int
Expand Down
5 changes: 4 additions & 1 deletion torch/csrc/distributed/c10d/Types.hpp
Expand Up @@ -29,6 +29,7 @@ struct NCCLPreMulSumSupplement : _SupplementBase {
// Other ReduceOps that need different supplementary data can also
// derive from _SupplementBase.
struct TORCH_API ReduceOp : torch::CustomClassHolder {
// note(crcrpar): RedOpType could be defined outside of `ReduceOp`
enum RedOpType : uint8_t {
SUM = 0,
AVG = 1,
Expand All @@ -46,7 +47,9 @@ struct TORCH_API ReduceOp : torch::CustomClassHolder {

ReduceOp(RedOpType op) : op_(op) {
TORCH_INTERNAL_ASSERT(
op_ != PREMUL_SUM, "PREMUL_SUM requires a scale factor tensor or scalar argument");
op_ != PREMUL_SUM,
"Use `torch.distributed._make_nccl_premul_sum` to create an instance of ReduceOp with PREMUL_SUM"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the reason user need to use this API to construct a premul_sum? could user just create this reduce op type like other reduce ops? The fact that there are two APIs constructing different reduce ops are confusing

);
}

ReduceOp(RedOpType op, c10::intrusive_ptr<_SupplementBase> optional_supplement) {
Expand Down
29 changes: 21 additions & 8 deletions torch/csrc/distributed/c10d/init.cpp
Expand Up @@ -515,10 +515,14 @@ An enum-like class for built-in communication hooks: ``ALLREDUCE`` and ``FP16_CO
R"(Sets the debug level of the torch.distributed package from the
``TORCH_DISTRIBUTED_DEBUG`` environment variable.)");

// TODO(crcrpar): Hardening `ReduceOp`.
// While keeping most op types as enum value,
// making `PREMUL_SUM` callable, i.e., allowing for
// `ReduceOp.PREMUL_SUM(scale)` might be better as per @wanchaol.
// https://pybind11.readthedocs.io/en/stable/classes.html#enumerations-and-internal-types
py::class_<::c10d::ReduceOp> reduce_op(module, "ReduceOp", R"(
An enum-like class for available reduction operations: ``SUM``, ``PRODUCT``,
``MIN``, ``MAX``, ``BAND``, ``BOR``, and ``BXOR``.
``MIN``, ``MAX``, ``BAND``, ``BOR``, ``BXOR``, and ``PREMUL_SUM``.
``BAND``, ``BOR``, and ``BXOR`` reductions are not available when
using the ``NCCL`` backend.
Expand All @@ -529,13 +533,16 @@ and only for NCCL versions 2.10 or later.
``PREMUL_SUM`` multiplies inputs by a given scalar locally before reduction.
``PREMUL_SUM`` is only available with the ``NCCL`` backend,
and only available for NCCL versions 2.11 or later.
and only available for NCCL versions 2.11 or later. Users are supposed to
use ``torch.distributed._make_nccl_premul_sum``.
Additionally, ``MAX``, ``MIN`` and ``PRODUCT`` are not supported for complex tensors.
The values of this class can be accessed as attributes, e.g., ``ReduceOp.SUM``.
They are used in specifying strategies for reduction collectives, e.g.,
:func:`reduce`, :func:`all_reduce_multigpu`, etc.)");
:func:`reduce`, :func:`all_reduce_multigpu`, etc.
This class does not support ``__members__`` property.)");

reduce_op.def(py::init<::c10d::ReduceOp::RedOpType>())
.def_readwrite("op", &::c10d::ReduceOp::op_);
Expand All @@ -555,8 +562,14 @@ They are used in specifying strategies for reduction collectives, e.g.,
[](const ::c10d::ReduceOp& self, const ::c10d::ReduceOp& other) {
return self == other.op_;
})
.def("__hash__", [](const ::c10d::ReduceOp& self) { return self.op_; });

.def("__hash__", [](const ::c10d::ReduceOp& self) {
return static_cast<uint8_t>(self.op_);
});

// note(crcrpar): Deliberately skip
// [`export_values`](https://pybind11.readthedocs.io/en/stable/classes.html#enumerations-and-internal-types)
// here and manually set values in Python side. See note "ReduceOp static
// class attributes to support `isinstance`"
py::enum_<::c10d::ReduceOp::RedOpType>(reduce_op, "RedOpType")
.value("SUM", ::c10d::ReduceOp::RedOpType::SUM)
.value("AVG", ::c10d::ReduceOp::RedOpType::AVG)
Expand All @@ -566,10 +579,10 @@ They are used in specifying strategies for reduction collectives, e.g.,
.value("BAND", ::c10d::ReduceOp::RedOpType::BAND)
.value("BOR", ::c10d::ReduceOp::RedOpType::BOR)
.value("BXOR", ::c10d::ReduceOp::RedOpType::BXOR)
.value("PREMUL_SUM", ::c10d::ReduceOp::RedOpType::PREMUL_SUM)
.export_values();
.value("PREMUL_SUM", ::c10d::ReduceOp::RedOpType::PREMUL_SUM);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just make pybind API on ReduceOp class directly, so that user can still do things like ReduceOp.SUM (i.e. as a class attribute)?

I feel maybe we can do sth like this ReduceOp.PREMUL_SUM(premul_value) looks cleaner than _make_nccl_premul_sum


// Ref: [Implicit
// note(crcrpar): This could be removed because users will not pass
// `RedOpType` to reduce collective ops Ref: [Implicit
// conversions](https://pybind11.readthedocs.io/en/stable/advanced/classes.html#implicit-conversions)
// Let us skip the explicit construction of `c10d::ReduceOp` from
// `c10d::ReduceOp::RedOpType` in Python.
Expand Down
11 changes: 11 additions & 0 deletions torch/distributed/distributed_c10d.py
Expand Up @@ -233,6 +233,17 @@ def register_backend(cls, name, func, extended_api=False):
dist_backend = Backend


# NOTE(crcrpar): [ReduceOp static class attributes to support `isinstance`]
# A ReduceOp instance of `PREMUL_SUM` is supposed to be created via `_make_nccl_premul_sum`
# while the other `op`s (meaning RedOpType members) can be directly passed to c10d reduce collectives.
# I changed `ReduceOp` to struct from enum class and introduced RedOpType enum class for PREMUL_SUM,
# which broke an implicit contract of ReduceOp being enum-like with which users apply isinstance to
# `op`, for example, `isinstance(ReduceOp.SUM, ReduceOp)`: https://github.com/pytorch/pytorch/issues/87191
DENY_LIST = ("PREMUL_SUM", )
for _red_op_name, _red_op_value in ReduceOp.RedOpType.__members__.items():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@crcrpar I think this PR break the serialization of c10d.ReduceOp, with the newest nightly this now crashes:

import torch
import copy

a = torch.distributed.distributed_c10d.ReduceOp.SUM
copy.deepcopy(a)

I checked the nightly a few days ago and it works all fine, and the issue seems coming from the changes in this file, if I comment this changes, then it works again, could you help take a look on this and fix it? Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we should try to override __isinstance__ instead of overriding the attributes on class?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can confirm that I'm also running into the issue @wanchaol mentioned

setattr(ReduceOp, _red_op_name, _red_op_value if _red_op_name in DENY_LIST else ReduceOp(_red_op_value))


class _reduce_op(object):
r"""
Deprecated enum-like class for reduction operations: ``SUM``, ``PRODUCT``,
Expand Down