PT DistributedDataParallel with mixed precision training #1473

albertz · 2023-12-05T10:27:42Z

I noticed that the DistributedDataParallel module has the option mixed_precision which is for mixed precision training. We don't use that, even if the user specifies torch_amp to use mixed precision. So I wonder now, what happens if the user sets torch_distributed = {} (so using multi-GPU training via DistributedDataParallel) and also sets torch_amp = "bfloat16" (as an example)? Does this work correctly? Is this currently suboptimal? (Actually, I'm using that in some experiments, and memory consumption looks normal, just as in single-GPU training, but I did not really check carefully.)

The text was updated successfully, but these errors were encountered:

albertz · 2023-12-05T10:29:07Z

@kuacakuaca @Judyxujj any idea? Done that before?

albertz · 2023-12-05T11:46:25Z

From the documentation on AMP on Working with Multiple GPUs, it sounds like it should already be fine in case we use DistributedDataParallel, one GPU per process.

kuacakuaca · 2023-12-06T12:04:54Z

@kuacakuaca @Judyxujj any idea? Done that before?

no, so you didn't observe a decrease in memory consumption?

albertz · 2023-12-06T12:28:07Z

so you didn't observe a decrease in memory consumption?

Compared to what? Of course, enabling AMP (I usually use bfloat16 without grad scaler) reduces GPU memory. But that is not really what I write here. This issue here is about distributed training. What I was saying is, going single GPU to multi GPU does not reduce memory. Why should it? But that is also not really my question here, that was just an observation, which could be relevant. My question was whether it actually works correctly. This observation is a hint that it probably uses AMP in some way also with distributed training (otherwise it would not be the same memory consumption as single GPU with AMP), but still I'm not sure if it does it correctly w.r.t. the distributed training. In AMP, the gradients are then also bfloat16? So AMP with distributed training, it means it would allreduce the bfloat16 gradients? So it should also save communication bandwidth? Or maybe it allreduces the wrong gradients, and multi GPU is effectively not correctly used here? This is my actual question.

Judyxujj · 2023-12-06T15:35:15Z

@albertz I used mixed precision training along with torch distributed training in fairseq framework to train the wav2vec2 model. With mixed precision, the training gets speed up. But I am haven't looked into the implementation details.

albertz added the MultiGPU label Dec 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PT DistributedDataParallel with mixed precision training #1473

PT DistributedDataParallel with mixed precision training #1473

albertz commented Dec 5, 2023

albertz commented Dec 5, 2023

albertz commented Dec 5, 2023

kuacakuaca commented Dec 6, 2023

albertz commented Dec 6, 2023

Judyxujj commented Dec 6, 2023

PT DistributedDataParallel with mixed precision training #1473

PT DistributedDataParallel with mixed precision training #1473

Comments

albertz commented Dec 5, 2023

albertz commented Dec 5, 2023

albertz commented Dec 5, 2023

kuacakuaca commented Dec 6, 2023

albertz commented Dec 6, 2023

Judyxujj commented Dec 6, 2023