Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why FSDPStrategy is so slow-down when I use multi-machine #1369

Open
Graduo opened this issue Apr 29, 2024 · 4 comments
Open

Why FSDPStrategy is so slow-down when I use multi-machine #1369

Graduo opened this issue Apr 29, 2024 · 4 comments

Comments

@Graduo
Copy link

Graduo commented Apr 29, 2024

Hello
I was struggling to train a 1.5B LlaMA ,but I observed an unexpectedly slow-down when using FSDP strategy in 2 devices.
'''
FLOPs not found for 'NVIDIA H800'
Measured TFLOPs: 2539.13
Epoch 1 | iter 16 step 1 | loss train: 8.515, val: n/a | iter time: 26133.73 ms (step) remaining time: 909 days, 3:20:37
Epoch 1 | iter 32 step 2 | loss train: 8.509, val: n/a | iter time: 26446.05 ms (step) remaining time: 635 days, 12:08:15
Epoch 1 | iter 48 step 3 | loss train: 8.491, val: n/a | iter time: 26204.95 ms (step) remaining time: 543 days, 2:07:38
Epoch 1 | iter 64 step 4 | loss train: 8.472, val: n/a | iter time: 26227.60 ms (step) remaining time: 496 days, 22:22:41
Epoch 1 | iter 80 step 5 | loss train: 8.492, val: n/a | iter time: 26297.45 ms (step) remaining time: 469 days, 9:35:18
Epoch 1 | iter 96 step 6 | loss train: 8.395, val: n/a | iter time: 25975.68 ms (step) remaining time: 450 days, 10:46:30
Epoch 1 | iter 112 step 7 | loss train: 8.383, val: n/a | iter time: 26152.08 ms (step) remaining time: 437 days, 4:40:59
Epoch 1 | iter 128 step 8 | loss train: 8.314, val: n/a | iter time: 26192.78 ms (step) remaining time: 427 days, 22:27:04
Epoch 1 | iter 144 step 9 | loss train: 8.411, val: n/a | iter time: 26267.13 ms (step) remaining time: 420 days, 6:28:56
'''
And when train it in single machine , the iter time is around 700ms.
could I get any idea about the reason and how can I fix it? Thank you!

@lantiga
Copy link
Contributor

lantiga commented Apr 29, 2024

Hi, can you post the CLI args or code you are using?
Also this is with two machines and 8 GPUs per machine?

@lantiga
Copy link
Contributor

lantiga commented Apr 29, 2024

Just to confirm: are you running the pretraining command?

Maybe try to comment this line out: https://github.com/Lightning-AI/litgpt/blob/main/litgpt/pretrain.py#L174

We have bumped into issues with PyTorch 2.2 and torch.compile recently, let's take this variable out of the equation.

@Graduo
Copy link
Author

Graduo commented Apr 30, 2024

Hi, can you post the CLI args or code you are using? Also this is with two machines and 8 GPUs per machine?

HI thanks for your prompt reply! Yes , I use two machines and 8 GPUs per machine. I just use args like
`fabric run --node-rank=0 --main-address=ip1 --accelerator=cuda --devices=8 --num-nodes=2 litgpt/pretrain_multinode_myllama.py --config config_hub/pretrain/myllama.yaml

fabric run --node-rank=1 --main-address=ip1 --accelerator=cuda --devices=8 --num-nodes=2 litgpt/pretrain_multinode_myllama.py --config config_hub/pretrain/myllama.yaml`

I once tried using 'litgpt run', but it did not work successfully. Based on the suggestions, I changed it to 'fabric run' .And the code about strategy

strategy = FSDPStrategy(auto_wrap_policy={Block}, state_dict_type="full", sharding_strategy="HYBRID_SHARD")

@Graduo
Copy link
Author

Graduo commented Apr 30, 2024

Just to confirm: are you running the pretraining command?

Maybe try to comment this line out: https://github.com/Lightning-AI/litgpt/blob/main/litgpt/pretrain.py#L174

We have bumped into issues with PyTorch 2.2 and torch.compile recently, let's take this variable out of the equation.

Yeah,I am running the pretraining command

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants