Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PT DistributedDataParallel with mixed precision training #1473

Open
albertz opened this issue Dec 5, 2023 · 5 comments
Open

PT DistributedDataParallel with mixed precision training #1473

albertz opened this issue Dec 5, 2023 · 5 comments
Labels

Comments

@albertz
Copy link
Member

albertz commented Dec 5, 2023

I noticed that the DistributedDataParallel module has the option mixed_precision which is for mixed precision training. We don't use that, even if the user specifies torch_amp to use mixed precision. So I wonder now, what happens if the user sets torch_distributed = {} (so using multi-GPU training via DistributedDataParallel) and also sets torch_amp = "bfloat16" (as an example)? Does this work correctly? Is this currently suboptimal? (Actually, I'm using that in some experiments, and memory consumption looks normal, just as in single-GPU training, but I did not really check carefully.)

@albertz
Copy link
Member Author

albertz commented Dec 5, 2023

@kuacakuaca @Judyxujj any idea? Done that before?

@albertz
Copy link
Member Author

albertz commented Dec 5, 2023

From the documentation on AMP on Working with Multiple GPUs, it sounds like it should already be fine in case we use DistributedDataParallel, one GPU per process.

@kuacakuaca
Copy link
Contributor

@kuacakuaca @Judyxujj any idea? Done that before?

no, so you didn't observe a decrease in memory consumption?

@albertz
Copy link
Member Author

albertz commented Dec 6, 2023

so you didn't observe a decrease in memory consumption?

Compared to what? Of course, enabling AMP (I usually use bfloat16 without grad scaler) reduces GPU memory. But that is not really what I write here. This issue here is about distributed training. What I was saying is, going single GPU to multi GPU does not reduce memory. Why should it? But that is also not really my question here, that was just an observation, which could be relevant. My question was whether it actually works correctly. This observation is a hint that it probably uses AMP in some way also with distributed training (otherwise it would not be the same memory consumption as single GPU with AMP), but still I'm not sure if it does it correctly w.r.t. the distributed training. In AMP, the gradients are then also bfloat16? So AMP with distributed training, it means it would allreduce the bfloat16 gradients? So it should also save communication bandwidth? Or maybe it allreduces the wrong gradients, and multi GPU is effectively not correctly used here? This is my actual question.

@Judyxujj
Copy link
Contributor

Judyxujj commented Dec 6, 2023

@albertz I used mixed precision training along with torch distributed training in fairseq framework to train the wav2vec2 model. With mixed precision, the training gets speed up. But I am haven't looked into the implementation details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants