Why FSDPStrategy is so slow-down when I use multi-machine #1369

Graduo · 2024-04-29T12:21:17Z

Hello
I was struggling to train a 1.5B LlaMA ,but I observed an unexpectedly slow-down when using FSDP strategy in 2 devices.
'''
FLOPs not found for 'NVIDIA H800'
Measured TFLOPs: 2539.13
Epoch 1 | iter 16 step 1 | loss train: 8.515, val: n/a | iter time: 26133.73 ms (step) remaining time: 909 days, 3:20:37
Epoch 1 | iter 32 step 2 | loss train: 8.509, val: n/a | iter time: 26446.05 ms (step) remaining time: 635 days, 12:08:15
Epoch 1 | iter 48 step 3 | loss train: 8.491, val: n/a | iter time: 26204.95 ms (step) remaining time: 543 days, 2:07:38
Epoch 1 | iter 64 step 4 | loss train: 8.472, val: n/a | iter time: 26227.60 ms (step) remaining time: 496 days, 22:22:41
Epoch 1 | iter 80 step 5 | loss train: 8.492, val: n/a | iter time: 26297.45 ms (step) remaining time: 469 days, 9:35:18
Epoch 1 | iter 96 step 6 | loss train: 8.395, val: n/a | iter time: 25975.68 ms (step) remaining time: 450 days, 10:46:30
Epoch 1 | iter 112 step 7 | loss train: 8.383, val: n/a | iter time: 26152.08 ms (step) remaining time: 437 days, 4:40:59
Epoch 1 | iter 128 step 8 | loss train: 8.314, val: n/a | iter time: 26192.78 ms (step) remaining time: 427 days, 22:27:04
Epoch 1 | iter 144 step 9 | loss train: 8.411, val: n/a | iter time: 26267.13 ms (step) remaining time: 420 days, 6:28:56
'''
And when train it in single machine , the iter time is around 700ms.
could I get any idea about the reason and how can I fix it? Thank you!

lantiga · 2024-04-29T18:22:57Z

Hi, can you post the CLI args or code you are using?
Also this is with two machines and 8 GPUs per machine?

lantiga · 2024-04-29T18:27:40Z

Just to confirm: are you running the pretraining command?

Maybe try to comment this line out: https://github.com/Lightning-AI/litgpt/blob/main/litgpt/pretrain.py#L174

We have bumped into issues with PyTorch 2.2 and torch.compile recently, let's take this variable out of the equation.

Graduo · 2024-04-30T14:14:19Z

Hi, can you post the CLI args or code you are using? Also this is with two machines and 8 GPUs per machine?

HI thanks for your prompt reply! Yes , I use two machines and 8 GPUs per machine. I just use args like
`fabric run --node-rank=0 --main-address=ip1 --accelerator=cuda --devices=8 --num-nodes=2 litgpt/pretrain_multinode_myllama.py --config config_hub/pretrain/myllama.yaml

fabric run --node-rank=1 --main-address=ip1 --accelerator=cuda --devices=8 --num-nodes=2 litgpt/pretrain_multinode_myllama.py --config config_hub/pretrain/myllama.yaml`

I once tried using 'litgpt run', but it did not work successfully. Based on the suggestions, I changed it to 'fabric run' .And the code about strategy

strategy = FSDPStrategy(auto_wrap_policy={Block}, state_dict_type="full", sharding_strategy="HYBRID_SHARD")

Graduo · 2024-04-30T14:17:12Z

Just to confirm: are you running the pretraining command?

Maybe try to comment this line out: https://github.com/Lightning-AI/litgpt/blob/main/litgpt/pretrain.py#L174

We have bumped into issues with PyTorch 2.2 and torch.compile recently, let's take this variable out of the equation.

Yeah,I am running the pretraining command

rasbt mentioned this issue Apr 29, 2024

Eliminate cuda syncs #1374

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why FSDPStrategy is so slow-down when I use multi-machine #1369

Why FSDPStrategy is so slow-down when I use multi-machine #1369

Graduo commented Apr 29, 2024

lantiga commented Apr 29, 2024

lantiga commented Apr 29, 2024

Graduo commented Apr 30, 2024 •

edited

Graduo commented Apr 30, 2024

Why FSDPStrategy is so slow-down when I use multi-machine #1369

Why FSDPStrategy is so slow-down when I use multi-machine #1369

Comments

Graduo commented Apr 29, 2024

lantiga commented Apr 29, 2024

lantiga commented Apr 29, 2024

Graduo commented Apr 30, 2024 • edited

Graduo commented Apr 30, 2024

Graduo commented Apr 30, 2024 •

edited