sync_grads flag in all_gather method #11652
Replies: 1 comment 3 replies
-
pytorch-lightning>=2.1.3
And if u see how _all_gather_ddp_if_available(tensor, group=group, sync_grads=sync_grads) is written in
The critical aspect of this implementation is the sync_grads parameter. By default, sync_grads is set to False, which, during the execution of _all_gather_ddp_if_available, wraps the operation within a torch.no_grad() context. This behavior is crucial because if all_gather is used during the training phase without enabling sync_grads (i.e., keeping it False), the operation will not compute gradients. This lack of gradient computation can severely impact the training process, as gradients are essential for updating model parameters. To demonstrate the significance of the sync_grads parameter, consider the following two simplified neural network examples: SimpleNetWithoutNoGrad and SimpleNet. The former processes its inputs without employing torch.no_grad(), simulating the effect of setting sync_grads=True, while the latter uses torch.no_grad(), mimicking the default behavior of sync_grads=False. A dummy example
OUTPUT
This comparison highlights that when torch.no_grad() is used (akin to having sync_grads=False during an all_gather operation in training), certain gradients are not computed, potentially undermining the model's training effectiveness. Therefore, when utilizing all_gather during training, it is imperative to set sync_grads=True to ensure gradients are correctly computed and propagated. This setting enables the model to learn effectively from the aggregated data. Conversely, for operations during validation or inference, where gradient computation is unnecessary, sync_grads can remain False to optimize computational efficiency. In summary, the sync_grads parameter in PyTorch Lightning's all_gather function plays a pivotal role in distributed training. Properly setting this parameter ensures that gradient computation occurs as needed, safeguarding the training process's integrity PS: I was stuck with this same error and i used to by default use ddp_find_unused_parameters_true and i couldn't find the error , its only after i set up ddp as the strategy, i realized some grads were None and investigated it. I find it weird that the docs also dont mention it properly and this was the only issue i came across and no one was able to answer it . I hope this comment helps future lightning programmers to not get stuck in a stupid error like this and waste your countless hours contemplating if your math is wrong or the code is wrong or worse if u choose the wrong profession. |
Beta Was this translation helpful? Give feedback.
-
The documentation about the flag
sync_grads
in theall_gather
method is a bit mysterious.Do I have to set
sync_grads=True
if I intend to run backpropagation on the result of the gathering operation?If not, what is a situation in which
sync_grads
must be set to True?To be concrete.
Let's say that I have I am training on multiple GPUs using the ddp strategy.
Each GPU computes some tensor which need to be aggregated in order to compute the loss.
Do I aggregate the tensors using
sync_grads=True
flag?What is a situation in which
sync_grads
must be set toTrue
?Beta Was this translation helpful? Give feedback.
All reactions