Numerical unstable in mixed precision (FP16) when training with DDP #19790

WayenVan · 2024-04-20T01:17:39Z

WayenVan
Apr 20, 2024

Hi, when I was doing research, I found that using FP16 leads to a degradation of accuracy, which is not a problem when I use my own DDP code with native pytorch.

An interest finding is that when I do inference with trained model(training with FP16), the model outputs NaN if set the inference trainer as FP32, while FP16 inference is still working. I don't know if it is caused by lightning itself, or maybe because of a bad GPU?

here is a short code of my trainer

    t = trainer.Trainer(
        accelerator='gpu',
        strategy='ddp',
        devices=2,
        callbacks=[ckpt_callback, callback],
        logger=logger,
        log_every_n_steps=50,
        max_epochs=cfg.epoch,
        sync_batchnorm=True,
        use_distributed_sampler=True,
        gradient_clip_val= 0.5,
        plugins=[
            plugins.MixedPrecision(
                precision='16-mixed',
                device='cuda',
                scaler=GradScaler(enabled=False)
            )
        ]
    )

I also tried GradScaler with enable=True, the problem is still exist.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Numerical unstable in mixed precision (FP16) when training with DDP #19790

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Numerical unstable in mixed precision (FP16) when training with DDP #19790

WayenVan Apr 20, 2024

Replies: 0 comments

WayenVan
Apr 20, 2024