The trainers runs a single validation step after resume (not sanity) #18110
Replies: 2 comments 1 reply
-
Seeing a similar behavior with version |
Beta Was this translation helpful? Give feedback.
-
I debugged the phenomenon step by step. The issue is that during the restarting status, 0 steps of training and 1 step of validation are performed. To fix this, it is necessary to correct the restart logic, which seems to require a deep understanding of the PL code (I gave up and decided to allow this 1-step validation). When loading a checkpoint, When training begins, it first checks if pytorch-lightning/src/lightning/pytorch/loops/training_epoch_loop.py Lines 199 to 201 in 1439da4 Next, it enters the validation loop. pytorch-lightning/src/lightning/pytorch/loops/fit_loop.py Lines 197 to 200 in 1439da4 In EvaluationLoop's pytorch-lightning/src/lightning/pytorch/loops/evaluation_loop.py Lines 147 to 148 in 1439da4 Following the In EvaluationLoop, |
Beta Was this translation helpful? Give feedback.
-
I am encountering a weird behavior using lightning (I use the lightningCLI as well).
When I resume a training that has failed, with
python train.py fit
, the trainer runs a single validation step that is not a sanity check (I disabled sanity checks, and also the state of the trainer isRunningStage.VALIDATING
).I run my main training with val_check_interval=0.5.
This is a problem, as it saves in my logs a metric point for this single batch as if it was an evaluation on the full validation set.
I have metrics curve like this: when I resume, there is a wrong point (much higher here than the other ones).
Do you know what could cause this issue and how to get around it ?
Beta Was this translation helpful? Give feedback.
All reactions