Accessing validation metrics at checkpoint from outside module in ddp mode #680

LucFrachon · 2020-01-13T14:38:25Z

LucFrachon
Jan 13, 2020

Hi,
Thank you for this excellent library. It is saving me a ton of time.

I have recently implemented some code that trains and evaluates a model, then assigns a score to the model. This score is the model's validation accuracy at the best epoch, at which point the model is checkpointed and its weights are saved. To access the best validation accuracy, I use the following code:

        trainer.fit(model)
        return trainer.callback_metrics['val_acc']

This works fine with a single GPU, but when I run it on in multi-node ddp (2 x 2 GPUs), I get the following message:

  File "/uoa/scratch/users/r01ljf18/immune_nas/src/pytorch_model/evaluation.py", line 45, in train_and_evaluate
    return trainer.callback_metrics['val_acc']
KeyError: 'val_acc'

My guess is that in ddp mode, the callback metrics are encapsulated in a dict or similar, but I have not been able to find the exact structure.

Could you please provide some assistance?
Many thanks in advance.

OS: Single-GPU: Windows 10 laptop; Multi-GPU: SLURM-managed Unix cluster
Packaging: conda environment; Pytorch Lightning installed with pip
Version: 0.5.3.2

neggert · 2020-01-13T16:01:00Z

neggert
Jan 13, 2020

This is unfortunately a limitation of DDP. When you train with DDP, Lightning launches sub-processes and the training happens in those. Thus, the metrics get set on the trainer that lives in the sub-process, and don't get propagated back to the main process.

I'd suggest using a logger to write out metrics to disk, then reading them back from there.

Another option would be to implement whatever you're trying to do in the on_train_end hook of your module. Code you put there will run in the sub-process and have access to the metrics.

0 replies

neggert · 2020-01-13T16:02:15Z

neggert
Jan 13, 2020

Just noticed you're using slurm. If you let slurm manage tasks, DDP doesn't need to launch new processes, and you should be fine.

0 replies

LucFrachon · 2020-01-13T17:21:40Z

LucFrachon
Jan 13, 2020
Author

Hi @neggert
Thanks for your comments. Regarding your latest message, I am not sure what you mean by "letting SLURM manage tasks". Apologies if this sounds like a stupid question, but isn't it what is already happening? How would I implement that concretely?

Regarding the use of on_train_end, I tried to override the hook in the module, but it doesn't get executed. The documentation suggests that it should be executed automatically, so I am not sure what is going on here.

Thanks

0 replies

neggert · 2020-01-13T17:49:44Z

neggert
Jan 13, 2020

You just need to set the slurm setting ntasks-per-node to the number of GPUs per node. Then slurm will launch a separate task for each GPU. Lightning will detect this and not launch its own subprocesses.

0 replies

LucFrachon · 2020-01-13T21:54:42Z

LucFrachon
Jan 13, 2020
Author

Thanks, but I think that’s what I’m doing already. Here are the slurm instructions that I’m using:

#SBATCH --nodes=2
#SBATCH --ntasks-per-node=2
#SBATCH --gres=gpu:2

Or should I set gres to gpu:4 ? (There are 2 gpus per node).

0 replies

williamFalcon · 2020-01-14T03:57:32Z

williamFalcon
Jan 14, 2020
Maintainer

For 2 nodes each with 2 gpus do:

#SBATCH --nodes=2
#SBATCH --ntasks-per-node=2
#SBATCH --gres=gpu:2

if this fails we can reopen!

0 replies

LucFrachon · 2020-01-14T15:57:03Z

LucFrachon
Jan 14, 2020
Author

For 2 nodes each with 2 gpus do:
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=2
#SBATCH --gres=gpu:2
if this fails we can reopen!

These are the settings I was using and that were giving me trouble. I implemented @neggert 's suggestion of writing the value to disk and read it from there after training, and it does the trick. I'm still unclear on why the slurm settings did not work as expected.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accessing validation metrics at checkpoint from outside module in ddp mode #680

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 7 comments

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Accessing validation metrics at checkpoint from outside module in ddp mode #680

LucFrachon Jan 13, 2020

Replies: 7 comments

neggert Jan 13, 2020

neggert Jan 13, 2020

LucFrachon Jan 13, 2020 Author

neggert Jan 13, 2020

LucFrachon Jan 13, 2020 Author

williamFalcon Jan 14, 2020 Maintainer

LucFrachon Jan 14, 2020 Author

LucFrachon
Jan 13, 2020

neggert
Jan 13, 2020

neggert
Jan 13, 2020

LucFrachon
Jan 13, 2020
Author

neggert
Jan 13, 2020

LucFrachon
Jan 13, 2020
Author

williamFalcon
Jan 14, 2020
Maintainer

LucFrachon
Jan 14, 2020
Author