Gradient checkpointing with DDP in a loop #10479

shivammehta25 · 2021-11-11T11:15:05Z

shivammehta25
Nov 11, 2021

Since my method is an Autoregressive algorithm It is making a huge gradient tape, I am trying to do something like this

for i in range(len(maxtrix.shape)):
    output = torch.utils.checkpoint.checkpoint(NNModel(matrix[i]))

It works fine on single GPU but on DDP it throws this error

RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the `forward` function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes. or try to use _set_static_graph() as a workaround if this module graph does not change during training loop.2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple `checkpoint` functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases in default. You can try to use _set_static_graph() as a workaround if your module graph does not change over iterations.
Parameter at index 30 with name module.model.decoder.decoder_network.layers.1.weight has been marked as ready twice. This means that multiple autograd engine  hooks have fired for this particular parameter during this iteration.

I am running it with

pytorch_lightning.plugins.DDPPlugin(find_unused_parameters=False)

Any workaround for this?

kuixu · 2022-01-22T05:36:48Z

kuixu
Jan 22, 2022

Dear @shivammehta007 , I also got this error, does it have been solved?

0 replies

ananthsub · 2022-01-22T06:07:48Z

ananthsub
Jan 22, 2022

This does not appear to be a lightning issue, but rather with DistributedDataParallel from torch.distributed not supporting gradient checkpointing

1 reply

Bhavay-2001 Aug 4, 2023

Hi @ananthsub , I guess they have added the functionality of _set_static_graph() in the torch DDP but how to work it with pytorchlightning DDP?

kuixu · 2022-01-22T06:27:01Z

kuixu
Jan 22, 2022

Currently, I solved this problem. Cause of the model has parameters that were not used in producing a loss. Do the following two settings, and you will find the unused parameter names.

setfind_unused_parameters=False in DDP
set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL.

1 reply

stuartzong Apr 16, 2024

can you elaborate on your solution?

Bhavay-2001 · 2023-08-04T18:44:42Z

Bhavay-2001
Aug 4, 2023

Hi @kuixu, we can find the parameter names but how to go about the solution? Do we need to remove them or what? How will the issue be solved?
I'm facing a similar problem

1 reply

stuartzong Apr 16, 2024

any update on this. have you solve the problem?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gradient checkpointing with DDP in a loop #10479

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Gradient checkpointing with DDP in a loop #10479

Replies: 4 comments · 3 replies

Replies: 4 comments 3 replies