Numerical unstable in mixed precision (FP16) when training with DDP #19790
Unanswered
WayenVan
asked this question in
DDP / multi-GPU / multi-node
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi, when I was doing research, I found that using FP16 leads to a degradation of accuracy, which is not a problem when I use my own DDP code with native pytorch.
An interest finding is that when I do inference with trained model(training with FP16), the model outputs NaN if set the inference trainer as FP32, while FP16 inference is still working. I don't know if it is caused by lightning itself, or maybe because of a bad GPU?
here is a short code of my trainer
I also tried GradScaler with enable=True, the problem is still exist.
Beta Was this translation helpful? Give feedback.
All reactions