sync_grads flag in all_gather method #11652

dalessioluca · 2022-01-28T05:48:11Z

dalessioluca
Jan 28, 2022

The documentation about the flag sync_grads in the all_gather method is a bit mysterious.

sync_grads (bool) – flag that allows users to synchronize gradients for the all_gather operation

Do I have to set sync_grads=True if I intend to run backpropagation on the result of the gathering operation?
If not, what is a situation in which sync_grads must be set to True?

To be concrete.
Let's say that I have I am training on multiple GPUs using the ddp strategy.
Each GPU computes some tensor which need to be aggregated in order to compute the loss.
Do I aggregate the tensors using sync_grads=True flag?
What is a situation in which sync_grads must be set to True?

 def training_step(self, batch, batch_idx) -> torch.Tensor:

        # each gpu computes the image embeddings
        x, y = batch
        z = self(x)

        # gather all the embeddings to compute the loss
        world_z = self.all_gather(z, sync_grad=True) # true or false?
        loss = my_special_function(world_z)
        return loss

Akshay1-6180 · 2024-02-26T19:36:13Z

Akshay1-6180
Feb 26, 2024

pytorch-lightning>=2.1.3
In PyTorch Lightning, specifically within the distributed training context, the all_gather function is an essential tool for aggregating tensors across multiple processes. This function is located at lightning/pytorch/strategies/parallel.py. Here's a look at its implementation:

def all_gather(self, tensor: Tensor, group: Optional[Any] = None, sync_grads: bool = False) -> Tensor:
    """Perform an all_gather on all processes."""
    return _all_gather_ddp_if_available(tensor, group=group, sync_grads=sync_grads)

And if u see how _all_gather_ddp_if_available(tensor, group=group, sync_grads=sync_grads) is written in

def _all_gather_ddp_if_available(
    tensor: Tensor, group: Optional["torch.distributed.ProcessGroup"] = None, sync_grads: bool = False
) -> Tensor:
    """Function to gather a tensor from several distributed processes.

    Args:
        tensor: Tensor of shape (batch, ...)
        group: The process group to gather results from. Defaults to all processes (world)
        sync_grads: Flag that allows users to synchronize gradients for all_gather op

    Return:
        A tensor of shape (world_size, batch, ...)

    """
    if not _distributed_is_initialized():
        return tensor

    from torch.distributed.nn.functional import all_gather

    tensor = tensor.contiguous()  # https://github.com/pytorch/pytorch/issues/73515
    with nullcontext() if sync_grads else torch.no_grad():
        gathered_tensors = all_gather(tensor, group)
    return torch.stack(gathered_tensors)

The critical aspect of this implementation is the sync_grads parameter. By default, sync_grads is set to False, which, during the execution of _all_gather_ddp_if_available, wraps the operation within a torch.no_grad() context. This behavior is crucial because if all_gather is used during the training phase without enabling sync_grads (i.e., keeping it False), the operation will not compute gradients. This lack of gradient computation can severely impact the training process, as gradients are essential for updating model parameters.

To demonstrate the significance of the sync_grads parameter, consider the following two simplified neural network examples: SimpleNetWithoutNoGrad and SimpleNet. The former processes its inputs without employing torch.no_grad(), simulating the effect of setting sync_grads=True, while the latter uses torch.no_grad(), mimicking the default behavior of sync_grads=False.

A dummy example

class SimpleNetWithoutNoGrad(nn.Module):
    def __init__(self):
        super(SimpleNetWithoutNoGrad, self).__init__()
        self.fc1 = nn.Linear(2, 2)  # Two inputs to two outputs
        self.fc2 = nn.Linear(2, 2)  # Two inputs to two outputs

    def forward(self, input1,input2):
        x = torch.relu(self.fc1(input1))
        y = torch.relu(self.fc2(input2))

        #simulating torch gather
        x_ = torch.cat((x, x), dim=0).view(-1)
        y_ = torch.cat((y, y), dim=0).view(-1)

        k = x_ * y_
        return k
    
class SimpleNet(nn.Module):
    def __init__(self):
        super(SimpleNet, self).__init__()
        self.fc1 = nn.Linear(2, 2)  # Two inputs to two outputs
        self.fc2 = nn.Linear(2, 2)  # Two inputs to two outputs



    def forward(self, input1,input2):
        x = torch.relu(self.fc1(input))
        y = torch.relu(self.fc2(input))

        with torch.no_grad():
            # simulating torch gather with su=ync grad as false
            x_ = torch.cat((x, x), dim=0).view(-1)
        
        y_ = torch.cat((y, y), dim=0).view(-1)
        
        k = x_ * y_


        return k
        
torch.manual_seed(42)

# If you are using CUDA (PyTorch with GPU support)
torch.cuda.manual_seed(42)
torch.cuda.manual_seed_all(42)  # For multi-GPU setups

# Set seed for NumPy (if used)
np.random.seed(42)

# Set seed for Python's `random` module (if used)
random.seed(42)

# Ensure deterministic behavior in Convolution operations (if used)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

# Initialize both versions of the network
net_with_no_grad = SimpleNet()  # Original version with torch.no_grad()
net_without_no_grad = SimpleNetWithoutNoGrad()  # Modified version without torch.no_grad()

# Ensure same initial weights for fair comparison
net_without_no_grad.fc1.weight = nn.Parameter(net_with_no_grad.fc1.weight.clone())
net_without_no_grad.fc1.bias = nn.Parameter(net_with_no_grad.fc1.bias.clone())
net_without_no_grad.fc2.weight = nn.Parameter(net_with_no_grad.fc2.weight.clone())
net_without_no_grad.fc2.bias = nn.Parameter(net_with_no_grad.fc2.bias.clone())

# Define input and target
input1 = torch.tensor([[1.0, 2.0]], requires_grad=True)
input2 = torch.tensor([[3.0, 2.0]], requires_grad=True)
target = torch.tensor([[0.0]])

# Define the loss function
criterion = nn.MSELoss()

# Forward and backward pass with torch.no_grad()
predicted_with_no_grad = net_with_no_grad(input1,input2)
loss_with_no_grad = criterion(predicted_with_no_grad, target)
loss_with_no_grad.backward()

# Forward and backward pass without torch.no_grad()
predicted_without_no_grad = net_without_no_grad(input1,input2)
loss_without_no_grad = criterion(predicted_without_no_grad, target)
loss_without_no_grad.backward()

# Compare gradients
print("Gradients with torch.no_grad():")
for name, param in net_with_no_grad.named_parameters():
    print(f"{name} grad: {param.grad}")

print("\nGradients without torch.no_grad():")
for name, param in net_without_no_grad.named_parameters():
    print(f"{name} grad: {param.grad}")

OUTPUT

Gradients with torch.no_grad():
fc1.weight grad: None
fc1.bias grad: None
fc2.weight grad: tensor([[2.6774, 5.3548],
        [0.0000, 0.0000]])
fc2.bias grad: tensor([2.6774, 0.0000])

Gradients without torch.no_grad():
fc1.weight grad: tensor([[0.2652, 0.5304],
        [1.1881, 2.3761]])
fc1.bias grad: tensor([0.2652, 1.1881])
fc2.weight grad: tensor([[3.0087, 2.0058],
        [4.7140, 3.1427]])
fc2.bias grad: tensor([1.0029, 1.5713])

This comparison highlights that when torch.no_grad() is used (akin to having sync_grads=False during an all_gather operation in training), certain gradients are not computed, potentially undermining the model's training effectiveness.

Therefore, when utilizing all_gather during training, it is imperative to set sync_grads=True to ensure gradients are correctly computed and propagated. This setting enables the model to learn effectively from the aggregated data. Conversely, for operations during validation or inference, where gradient computation is unnecessary, sync_grads can remain False to optimize computational efficiency.

In summary, the sync_grads parameter in PyTorch Lightning's all_gather function plays a pivotal role in distributed training. Properly setting this parameter ensures that gradient computation occurs as needed, safeguarding the training process's integrity

PS: I was stuck with this same error and i used to by default use ddp_find_unused_parameters_true and i couldn't find the error , its only after i set up ddp as the strategy, i realized some grads were None and investigated it. I find it weird that the docs also dont mention it properly and this was the only issue i came across and no one was able to answer it . I hope this comment helps future lightning programmers to not get stuck in a stupid error like this and waste your countless hours contemplating if your math is wrong or the code is wrong or worse if u choose the wrong profession.

3 replies

miguelalba96 Mar 9, 2024

I need to gather features computed by all gpus before computing a contrastive loss, I had this raw pytorch function based on open clip implementation:

def gather_features(
        image_features,
        text_features,
        local_loss=False,
        gather_with_grad=False,
        rank=0,
        world_size=1,
):
        # We gather tensors from all gpus
    if gather_with_grad:
        all_image_features = torch.cat(torch.distributed.nn.all_gather(image_features), dim=0)
        all_text_features = torch.cat(torch.distributed.nn.all_gather(text_features), dim=0)
    else:
        gathered_image_features = [torch.zeros_like(image_features) for _ in range(world_size)]
        gathered_text_features = [torch.zeros_like(text_features) for _ in range(world_size)]
        dist.all_gather(gathered_image_features, image_features)
        dist.all_gather(gathered_text_features, text_features)
        if not local_loss:
            # ensure grads for local rank when all_* features don't have a gradient
            gathered_image_features[rank] = image_features
            gathered_text_features[rank] = text_features
        all_image_features = torch.cat(gathered_image_features, dim=0)
        all_text_features = torch.cat(gathered_text_features, dim=0)
        
    return all_image_features, all_text_features

when implementing the same using torch lighting fabric, and gathering the features from multiple GPUs using (sync_grads=False) as default:

    all_image_features = fabric.all_gather(image_features).flatten(0, 1)
    all_text_features = fabric.all_gather(text_features).flatten(0, 1)

then the loss is computed from all the gathered tensors and backpropagated. However, I noticed after a while of training that even if the loss was decreasing, the model was not learning and I suspect is because of that (not using sync_grads=True), so probably the model learns from the local gradients and collected features, but not based on the gradients of all features across multiple GPUs

Akshay1-6180 Mar 9, 2024

Even i was training a clip based model when i noticed it and even for me the loss was decreasing then i realized that the only parameter the model was learning was the logit_scale and towards the end the logit scale becomes 1 and the loss starts becoming constant.It wasnt learning based on the model gradients and when I sent the vision model for linear probing it was probably used the pretrained imagenet weights.

miguelalba96 Mar 16, 2024

just as a reference for more people I will put the gather features using fabric:

def gather_features(
        image_features,
        text_features,
        fabric: Fabric,
        local_loss=False,
        gather_with_grad=False
):
    """ Gather features across GPUS to the current global rank
    """
    if gather_with_grad:
        all_image_features = fabric.all_gather(image_features, sync_grads=True).flatten(0, 1)
        all_text_features = fabric.all_gather(text_features, sync_grads=True).flatten(0, 1)
    else:
        gathered_image_features = [torch.zeros_like(image_features) for _ in range(fabric.world_size)]
        gathered_text_features = [torch.zeros_like(text_features) for _ in range(fabric.world_size)]
        dist.all_gather(gathered_image_features, image_features)
        dist.all_gather(gathered_text_features, text_features)
        if not local_loss:
            # ensure grads for local rank when all_* features don't have a gradient
            gathered_image_features[fabric.global_rank] = image_features
            gathered_text_features[fabric.global_rank] = text_features
        all_image_features = torch.cat(gathered_image_features, dim=0)
        all_text_features = torch.cat(gathered_text_features, dim=0)
    return all_image_features, all_text_features

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sync_grads flag in all_gather method #11652

{{title}}

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

sync_grads flag in all_gather method #11652

dalessioluca Jan 28, 2022

Replies: 1 comment · 3 replies

Akshay1-6180 Feb 26, 2024

miguelalba96 Mar 9, 2024

Akshay1-6180 Mar 9, 2024

miguelalba96 Mar 16, 2024

dalessioluca
Jan 28, 2022

Replies: 1 comment 3 replies

Akshay1-6180
Feb 26, 2024