Accessing validation metrics at checkpoint from outside module in ddp mode #680
Replies: 7 comments
-
This is unfortunately a limitation of DDP. When you train with DDP, Lightning launches sub-processes and the training happens in those. Thus, the metrics get set on the trainer that lives in the sub-process, and don't get propagated back to the main process. I'd suggest using a logger to write out metrics to disk, then reading them back from there. Another option would be to implement whatever you're trying to do in the |
Beta Was this translation helpful? Give feedback.
-
Just noticed you're using slurm. If you let slurm manage tasks, DDP doesn't need to launch new processes, and you should be fine. |
Beta Was this translation helpful? Give feedback.
-
Hi @neggert Regarding the use of Thanks |
Beta Was this translation helpful? Give feedback.
-
You just need to set the slurm setting |
Beta Was this translation helpful? Give feedback.
-
Thanks, but I think that’s what I’m doing already. Here are the slurm instructions that I’m using:
Or should I set gres to gpu:4 ? (There are 2 gpus per node). |
Beta Was this translation helpful? Give feedback.
-
For 2 nodes each with 2 gpus do:
if this fails we can reopen! |
Beta Was this translation helpful? Give feedback.
-
These are the settings I was using and that were giving me trouble. I implemented @neggert 's suggestion of writing the value to disk and read it from there after training, and it does the trick. I'm still unclear on why the slurm settings did not work as expected. |
Beta Was this translation helpful? Give feedback.
-
Hi,
Thank you for this excellent library. It is saving me a ton of time.
I have recently implemented some code that trains and evaluates a model, then assigns a score to the model. This score is the model's validation accuracy at the best epoch, at which point the model is checkpointed and its weights are saved. To access the best validation accuracy, I use the following code:
This works fine with a single GPU, but when I run it on in multi-node ddp (2 x 2 GPUs), I get the following message:
My guess is that in ddp mode, the callback metrics are encapsulated in a dict or similar, but I have not been able to find the exact structure.
Could you please provide some assistance?
Many thanks in advance.
Beta Was this translation helpful? Give feedback.
All reactions