Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A potential bug for multi-GPU training #1368

Open
zyushun opened this issue Apr 28, 2024 · 5 comments
Open

A potential bug for multi-GPU training #1368

zyushun opened this issue Apr 28, 2024 · 5 comments

Comments

@zyushun
Copy link

zyushun commented Apr 28, 2024

Hi,

I found the following strange phenomena when running your code for tinyllama pretraining.

  1. When using multiple GPUs, I got completely different results when running the same code twice. Further, many loss spike occurs. See the example for 2-card training. I use all the default settings except that I shrink the learning rate from 4e-4 to 2e-4 and batchsize from 1024 to 512.

AdamW 2-card: run1

wandb: 🚀 View run at https://wandb.ai/yushunzhang0410/pretrain-tiny-llama-1.1b/runs/83b8yfjz

AdamW 2-card: run2

wandb: 🚀 View run at https://wandb.ai/yushunzhang0410/pretrain-tiny-llama-1.1b/runs/8p6axrgw

Two runs are totally different and the training fails.

  1. When simply changing the above settings to single GPU, these issues do not occur. Two runs are mostly the same (with slight difference though) and the loss decreases stably without any spikes.

AdamW 1-card: run 1

wandb: 🚀 View run at https://wandb.ai/yushunzhang0410/pretrain-tiny-llama-1.1b/runs/kdg2qmj8

AdamW 1-card: run 2

wandb: 🚀 View run at https://wandb.ai/yushunzhang0410/pretrain-tiny-llama-1.1b/runs/vh23qd0u

Two runs are mostly the same and the loss decreases stably.

Do you encounter a similar issue? Any idea why?

@lantiga
Copy link
Contributor

lantiga commented Apr 29, 2024

Thanks for the report. Can you try:

Thanks a lot for investigating this.

cc @awaelchli for visibility

@zyushun
Copy link
Author

zyushun commented Apr 30, 2024

Hi,

I still encounter this issue when using your latest code on github.

four A800-80GB GPU, AdamW, Tinyllama, all default settings. I did not change anything except the data path. I still encounter loss spike which does not exists in single-GPU training.

wandb: 🚀 View run at https://wandb.ai/yushunzhang0410/pretrain-tiny-llama-1.1b-litgpt-version/runs/bhiopo5z

image

I simply use pip install 'litgpt[all]' to get all the dependencies, as you suggested in the github. I checked your default pretrain.py and find I am using model.compile, with Pytorch 2.3.0. This meets your suggestion "running with torch.compile but on PyTorch 2.3"

What should I do now? Am I the only one encountering this issue? Do you have this issue on your side? I think you can easily reproduce this issue if you git clone + pip install 'litgpt[all]' + run the code (just as I did).

@awaelchli
Copy link
Member

awaelchli commented Apr 30, 2024

Your wandb log metadata suggests you are using lightning 2.2dev, which probably came with an older version of litgpt that you had. You might need this fix for pretraining, so I suggest updating lightning to the latest version first.

@zyushun
Copy link
Author

zyushun commented Apr 30, 2024

Hi,

Thanks for your prompt reply.

I think my lightning version is correct. My code is based on a fresh environment created yesterday, where I simply run " pip install 'litgpt[all]' "as you suggested in the github. As confirmed in the "conda list" screenshot below, I am using lightning 2.3.0.dev20240328 .

image

Any other possible issue?

@awaelchli
Copy link
Member

The initialization fix I made was on April 11, so the package you have is still too old. The fix was then cherry-picked into lightning 2.2.2. So I would still update the package.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants